Web Scraping and Crawling with Python

Lidiane Taquehara

Web Scraping and Crawling with Python

whoami

Lidiane Taquehara

Web Scraping

Beautiful Soup

Let's scrape Pycon Austria blog

Show me the code

            import requests
            from bs4 import BeautifulSoup
             
            response = requests.get('https://pycon.pyug.at/en/blog/')
            soup = BeautifulSoup(response.text, 'html.parser')
            for element in soup.find_all("h2"):
              print(element.text)
        
            $ python example.py
            Get Sentry
            Speakers, talks, and workshops
            Embracing opportunities in 2025: PyCon Austria is closer than ever!
            Interview with our sponsor Bitpanda
            Interview with Professor Robert Matzinger of the UAS Burgenland
            Interview with Thomas Mitzka
            Interviews with Horst JENS and Ralf SCHLATTERBECK, PhD
            Interviews with Ivana Kellyerova
            call for papers (call for volunteers)
            Interviews: Týna Doležalová, Lubomír Doležal
        

Scraping Workshops

FATEC Jundiaí, Higher Education Institution in São Paulo, Brazil

Web Crawling

Example of a problem

Data spread across different pages

Scrapy

Spider example

See the example on GitHub

Digging the code

name = 'most_popular_movies'

Define a unique identifier for the spider

start_urls = ['https://www.rottentomatoes...popular']

Specify the initial url to be hit by the spider

parse()

Manipulate the responses for each one of the requests

Running the spider

Data management in the cloud

Zyte (formerly Scrapinghub)

Scrapy Cloud

Dashboard

Data usage options

Scrapy Cloud API

It enables interactions with the spiders, the collected data and cloud management

python-scrapinghub

Real life application: Love Mondays

References:

Thank you so much!

This presentation is available in:

https://scraping-slides.netlify.app