Scraping tables

The python Scrapy library is an excellent helper to build simple but powerful scrapers. It’s common to want to scrape HTML tables when we scrape text of pages and as I’m going to show it really doesn’t need to be difficult.

The rough idea is to find a table, iterate each row and then get the text out of each cell.

Sources

I struggled to scrape a table wanting to easily get an array of array for values and I found this guide on how to scrape tables. In this page I go through the same steps but also offer a quick utility class you can use.

Also, check out my post on debugging Scrapy for a quick and easy way to try this out in your own project.

Table scraper

Here’s the table scraper I’ve put together for my project:

It’s very simple and will give you an array of arrays for each row and cells.

How to use the table scraper

Simply select the table you want to scrape and you can even get it out as a dictionary. In this example, since the page doesn’t identify each table, I had to use an xpath to pick out the first table.

The results running it (scrapy runspider table-scraping.py) should look something like this:

Scraping table with helper class

The data we get out from this will look like:

Result of outputting the scraper data

How it works

It is really simply so I’ll run through the steps here:

First we pull out the root element of the table, this should be your <table> element, probably identified by an ID or class
When we have the table we get all rows by listing <tr> tags
Then for each row we extract the text for all cells, header cells might be identified with <th> instead of <td> so to get both, my selector will pick either

And that’s it, there we have our table.

Final words