Web Scraping and Crawling with Python

whoami

Lidiane Taquehara

Brazilian immigrant, living in London
Used to be quite active in the Python community in Brazil
Software engineer at Artsy

Web Scraping

Beautiful Soup

Let's scrape Pycon Austria blog

Show me the code

            import requests
            from bs4 import BeautifulSoup
             
            response = requests.get('https://pycon.pyug.at/en/blog/')
            soup = BeautifulSoup(response.text, 'html.parser')
            for element in soup.find_all("h2"):
              print(element.text)

            $ python example.py
            Get Sentry
            Speakers, talks, and workshops
            Embracing opportunities in 2025: PyCon Austria is closer than ever!
            Interview with our sponsor Bitpanda
            Interview with Professor Robert Matzinger of the UAS Burgenland
            Interview with Thomas Mitzka
            Interviews with Horst JENS and Ralf SCHLATTERBECK, PhD
            Interviews with Ivana Kellyerova
            call for papers (call for volunteers)
            Interviews: Týna Doležalová, Lubomír Doležal

Scraping Workshops

FATEC Jundiaí, Higher Education Institution in São Paulo, Brazil

Web Crawling

Example of a problem

I want to collect data about the movies that are popular right now
I can use the most popular movies list on Rotten Tomatoes for reference
I need to know the title, director and synopsis for movie in the list

Data spread across different pages

Scrapy

Python framework focused on scraping and crawling
Open Source
Built on top of Twisted
- Efficient
- Asynchronous

Spider example

Access the most popular movies list on Rotten Tomatoes
Collect the movie title
Access the linked page for each movie in that list
On the movie details page, collect the movie summary and the director name

See the example on GitHub

Digging the code

name = 'most_popular_movies'

Define a unique identifier for the spider

start_urls = ['https://www.rottentomatoes...popular']

Specify the initial url to be hit by the spider

parse()

Manipulate the responses for each one of the requests

Running the spider

Data management in the cloud

Zyte (formerly Scrapinghub)

Creators and maintainers of Scrapy
- Zyte Managed Data (Data on Demand)
- Scrapy Cloud

Scrapy Cloud

Host and monitor spiders in the cloud
High availability database

Dashboard

Data usage options

Download (CSV, JSON JSON Lines, XML)
Dataset publishing and sharing on Zyte
Query the data via Scrapy Cloud API

Scrapy Cloud API

It enables interactions with the spiders, the collected data and cloud management

Endpoints:
- app.scrapinghub.com
- storage.scrapinghub.com

python-scrapinghub

Python client: python-scrapinghub
Flask app example: https://github.com/lidimayra/scrapinghub-api-demo
Resulting app: https://scrapinghub-api-demo.osc-fr1.scalingo.io/

Web Scraping and Crawling with Python

Web Scraping and Crawling with Python

whoami

Web Scraping

Beautiful Soup

Let's scrape Pycon Austria blog

Show me the code

Scraping Workshops

Web Crawling

Example of a problem

Data spread across different pages

Scrapy

Spider example

Digging the code

Running the spider

Data management in the cloud

Zyte (formerly Scrapinghub)

Scrapy Cloud

Dashboard

Data usage options

Scrapy Cloud API

python-scrapinghub

Real life application: Love Mondays

References:

Thank you so much!

This presentation is available in:

https://scraping-slides.netlify.app

Web Scraping and Crawling with Python

whoami

Web Scraping

Beautiful Soup

Let's scrape Pycon Austria blog

Show me the code

Scraping Workshops

Web Crawling

Example of a problem

Data spread across different pages

Scrapy

Spider example

Digging the code

Running the spider

Data management in the cloud

Zyte (formerly Scrapinghub)

Scrapy Cloud

Dashboard

Data usage options

Scrapy Cloud API

python-scrapinghub

Real life application: Love Mondays

References:

Thank you so much! This presentation is available in: https://scraping-slides.netlify.app

Thank you so much!

This presentation is available in:

https://scraping-slides.netlify.app