An introduction to scraping with Python and BeautifulSoup

Articles in Web Scraping, Data Analysis | By Hash Brown

Published 20 hours agoTue, 08 Jan 2019 22:34:09 -0800

13 views, 1 RAM, and 1 comment

Web scraping a form of data extraction from web pages online. Simply put we are using code to simulate human behaviour and saving parts of the page for our own use. This could be done to get information which is not available via an API or to organise the internets information like Google does.

What is web scraping

Common use for scraping could be:

Collecting phone numbers from directories
Automatically archiving temporary web content such as weather reports
Buying event tickets or shoes when they go online

Not all information is available from APIs. Learning scraping can essentially let you “create” any API you need for a project. If you wanted an API to display stock prices and none of the current options were available to you could just scrape the data from a website. If you wanted to store tweets from public figures so they are available when they delete them, you can do it with scraping. If you wanted to find press releases before they are made available to the public, again you can do it with scraping

Learning scraping opens up a whole host of new opportunities to developers and it’s really not that difficult, skip ahead for the tutorial!

What is the difference between scraping and crawling?

Crawling is using bots to store, read and index web pages. It works in ways very similar to scraping and will follow links on each web page to collate information and usually add it to a central database.

Search engines such as Google and Bing are constantly crawling websites with their bots (referred to as spiders) and adding information to their index, which is what users are served with when they make searches.

The main difference between crawling and scraping is that crawling will go on every single page it can find from URLs collected from pages it has already crawled. It’s goal is to get as much information as it can from every source it can.

Scraping is used to target specific pages (which can be combined with crawling to find these pages) with the idea of taking more specific data. For example, crawling might take general information from a Wikipedia page for a book and assign that particular URL to a topic in a database for Google to give to a user, while a scraper may just store the authors name and book title which you can use later on for your own purpose.

Is scraping legal?

Firstly, we are not legal professionals so if you are looking for a hard answer get some real advice.

The answer depends on where you’re scraping and what you’re doing with the information. Much of the content you’re taking could be protected by copyright laws and using it for your own purpose could cause trouble.

Many websites list scraping as against their terms of service, this again could cause legal issues but this is generally seen a civil matter rather than criminal. You might get sued but you’re probably not going to jail.

What languages are best for scraping?

You can scrape data in most languages, C#, Python, R, Java and even PHP all have public libraries for parsing HTML and pulling data from it.

However some languages are better than others, PHP for example does not support threading so using a PHP script to scrape would be slow if you needed to scrape any kind of volume. Python (which we are focusing on today) has some great libraries such as BeautifulSoup and Scrapy which are easy to install and use for beginners.

If this turns out to be a popular topic I may expand this guide or create new guides for other languages, leave a comment below letting me know what you want to see next.

BeautifulSoup4 and Python:

BeautifulSoup is a Python library for pulling data from HTML and works with almost any Python parser. The library is highly supported, simple to install and works extremely well.

As HTML is just a markup language it can be navigated in a tree structure, just like XML. BeautifulSoup4 (BS4) makes this very easy, saving us many hours of work over coding our own solution.

Before we start using BS4 though, we need to install it.

How to install BeautifulSoup4

If you’re running a recent version of Ubuntu or Debian installing BS4 is easy and can be done through the included package manager.

Python 2:

$ apt-get install python-bs4

Python 3:

$ apt-get install python3-bs4

Alternatively for Windows users, if you have easy_install or pip set up you can do this:

$ easy_install beautifulsoup4
$ pip install beautifulsoup4

If none of these methods work for your environment and you’re not running a package manager, you can download the tarball here and install by running setup.py

$ python setup.py install

The BS4 license allows developers to include the entire codebase into their projects so if you want to do this at all, you can use it without installing the library at all but most users will want to let their package manager handle things so that updates are easy.

How to use BeautifulSoup4 to scrape data

In this tutorial below we will be scraping IMDB.com, we will be extracting the film title, rating and list of character names from the movies page for Aquaman.

This is a fairly simple example, but the template we create for this data extraction will work on any IMDB movie page. If you wanted to extract this data for a list of movies it would be fairly easy to expand the code shown to do this but today we will be just looking at how BeautifulSoup4 (BS4) works and how to use it.

Note: Before we begin I’m going to assume you understand the basics of programming and Python as well as have an understanding of HTML and CSS. These are skills you probably need to learn before you start scraping.

Step 1: Understand the page we’re scraping

To scrape the page we want to target we need to understand where the data is, what elements are they in? Can we use CSS selectors? How many elements are there? Let’s take a look at the target URL.

To grab the film title are going to need to grab what I’ve outlined in red here:

If we look at the code we can see that the title is inside a H1 tag which is great, and a quick Ctrl+F search on the source code reveals this is the only H1 tag on the movie page, this means we can use this element to grab the content.

Next up on our list of things to scrape is the rating, located here:

If we look at the source code we can see the rating is held in a div with the class ratingValue. We can again target this through the CSS selector and elements that around the actual text we want. Very simple.

Finally, we want to gather a list of character names. These are stores on the page in this table.

These are a little different as instead of their just being 1 mention of the value there’s multiple and different movies all have a different number of characters.

But thankfully the lovely people at IMDB put all of these in a table cell with the class of character. These will be easy to pick up but a little more work will be needed to output the text but more on this later.

Step 2: Start the script

Make a new file called imdb.py in your favourite Python code editor.

Before we can scrape anything we need to import a couple of things and give BS4 some HTML to work with, so at the top of the document we need to import URLLib and BeautifulSoup.

import urllib.request
from bs4 import BeautifulSoup

Before we can scrape we need something to scrape, so make a variable called URL and give it the IMDB URL as the value.

url = "https://www.imdb.com/title/tt1477834/"

To get the HTML from the page we use urllib to make a call and the response will be the HTML we need. This is done like so:

response = urllib.request.urlopen(url)
html = response.read()

We can now pass this HTML into BeautifulSoup to create our “soup” which is what we will use to pull values from later on. To do this we use a HTML parser.

soup = BeautifulSoup(html, 'html.parser')

With all these steps done we are ready to scrape.

Step 3: Scraping the data

Now we have our URL and have created our soup it’s time to unleash BS4 and get to work.

Movie Title:

Our movie title is stored in this block of HTML:

<h1 class="">Aquaman<span id="titleYear">(<a href="/year/2018/?ref_=tt_ov_inf">2018</a>)</span></h1>

It’s the only H1 on the page which makes this fairly easy, but to avoid pulling the text from the child span element we need to set a couple of arguments.

This is the code which will pull text from the H1, without anything else.

title = soup.h1.find(text=True, recursive=False)

This works very simply, we are taking our soup (our parsed HTML from earlier) and pulling out just the H1 element. Next we use the find function. The text argument tells BS4 we only want the text (as default it would give us the entire element), and the recursive argument lets BeautifulSoup know we’re not interested in child elements.

Rating:

Our rating value is stored in this block of HTML.

<div class="ratingValue">
<strong title="7.5 based on 115,394 user ratings"><span itemprop="ratingValue">7.5</span></strong><span class="grey">/</span><span class="grey" itemprop="bestRating">10</span>                    </div>

Here we need to do something a little different to above. Pulling the data is fairly easy, we just need the text value from a span element where the attribute itemprop is “ratingValue”.

Here is our python code:

rating = soup.find("span", itemprop="ratingValue")

We are using find() to pull our span with the itemprop attribute matching our target. This will return the entire element but we can sort this later when we output everything.

Character Names:

Character names are different to the above examples, there are multiple values for these we need to look at storing these as a list.

We’re also going to be targeting these elements with a CSS class. This means we need to go with select() in this specific case. This works in a very similar way to find as we used for the rating except it pulls values with CSS selectors.

characters =  soup.select("td.character")

This will create our list of characters that we can output next.

This is all the data we wanted to collect. This uses 3 basic functions and methods of using BeautifulSoup4. You can probably appreciate how much time this could save you if you were forced to create your own functions, BS4 makes scraping effortless.

Step 4: Outputting data

What you do with your data is down to you, to keep this tutorial simple we will be printing everything out.

We will be doing this in two parts, we have 2 very easy values to print (title and rating) and a list of characters we need to output in a loop.

The first print statement is very simple:

print("The movie "+title+"got a rating of "+str(rating.text)+" from IMDB users.")

When we got the data for title we specified that we only wanted text, but we didn’t do this with rating as python would recognise the value of rating as an integer. We get around this by using “.text” on the rating value (which is the entire rating HTML element) and turning it into a string inside the print statement. In our case this will print:

The movie Aquaman got a rating of 7.5 from IMDB users.

The second part of outputting these values is printing the list, we do this in a foreach loop.

print("This is a list of the main characters:")
for character in characters:
    print(character.text.strip())

We are taking each item in the characters list and printing them with the “.text” command just like above, we are also stripping any white space as whitespace was left over from the HTML. The output is:

The movie Aquaman got a rating of 7.5 from IMDB users.
This is a list of the main characters:
Arthur
Mera
Vulko
King Orm
Atlanna
King Nereus
Manta
Tom Curry
Captain Murk
Jesse (Manta's Father)
Dr. Stephen Shin
King Atlan
Cargo Pilot
Young Arthur (Three Years Old)
Young Arthur (Three Years Old)

Complete Code Example

import urllib.request
from bs4 import BeautifulSoup

#the url we are going to scrape
url = "https://www.imdb.com/title/tt1477834/"

#get the html of the page
response = urllib.request.urlopen(url)
html = response.read()

#create the soup
soup = BeautifulSoup(html, 'html.parser')

#title
title = soup.h1.find(text=True, recursive=False)

#rating
rating = soup.find("span", itemprop="ratingValue")

#characters
characters =  soup.select("td.character")

print("The movie "+title+"got a rating of "+str(rating.text)+" from IMDB users.")

print("This is a list of the main characters:")
for character in characters:
    print(character.text.strip())

In just 27 lines (most of this formatting and comments) you have requested a URL, got the HTML, parsed it, extracted data and printed it out in a nice format.

Scraping is a superpower.

Other scraping methods

For those who don’t perhaps have the skills needed to code something there are also other options for taking data from websites. These work pulling HTML elements (or their content) using regular expressions (regex), CSS selectors or XPath.

Some of the software that lets you do this includes:

Screaming Frog

Traditionally known for being a SEO software crawler, it can also be used to extract upto 10 pieces of information per crawl with XPath, CSS selectors or regex. The software will also allow pages to execute and render JavaScript, so even if websites have made an attempt to stop scraping you can still get the data you need.

The software has a free version but the data extraction feature is only available with the paid license and will cost you £150/year.

Screaming Frog

Data Miner

Data Miner is a Chrome Extension that works in your browser. The software lets you pull data very simply and export to a CSV file. Most of the extraction can be done by simply clicking what you want to extract and this option is free for the first few hundred pages.

You also have access to a library of pre-created extraction queries, so if someone has done the thing you want to do before you can simply use their query without needing to creating your own.

https://data-miner.io/

Visual Web Ripper

This is a desktop application which runs a integrated browser and works very similarly to Data Miner, without any restrictions on page count. The tool is very easy to set up and work with and requires no coding knowledge however the tool can be slow due to lack of threading.

The other downside is compared to other options it’s also very expensive.

http://visualwebripper.com/

Users Who Have Downloaded More RAM:
August R. Garcia (16 hours ago) 🐏 ⨉ 1 Posted by Hash Brown 20 hours ago 🕓 Posted at 08 January, 2019 22:34 PM PST

Top ↑

Hash Brown

🗎 12 🗨 114 🐏 21

Staff

Account created 1 month ago.
12 posts, 114 comments, and 21 RAMs.

Last active 3 hours ago:
Commented in thread What is your favourite Terry A. Davis quote?

August R. Garcia Posting on Internet... Portland, OR

🗎 27 🗨 277 🐏 22

Site Owner

Very nice, as they say.

Other scraping methods

For those who don’t perhaps have the skills needed to code something there are also other options for taking data from websites. These work pulling HTML elements (or their content) using regular expressions (regex), CSS selectors or XPath.

[...]

Another option is to use Google Sheets' IMPORTXML function to import webpages and to then extract the section that you need.

Also, these two sites/tools are helpful when planning out XPath queries and REGEX pattern matches:

Users Who Have Downloaded More RAM:
Hash Brown (5 hours ago) 🐏 ⨉ 1 Posted by August R. Garcia 16 hours ago 🕓 Posted at 09 January, 2019 03:16 AM PST

Here we are on an Internet forum.

Post a New Comment

To leave a comment, login to your account or create an account.