Scraping tables
The python Scrapy library is an excellent helper to build simple but powerful scrapers. It’s common to want to scrape HTML tables when we scrape text of pages and as I’m going to show it really doesn’t need to be difficult.
The rough idea is to find a table, iterate each row and then get the text out of each cell.
Sources
I struggled to scrape a table wanting to easily get an array of array for values and I found this guide on how to scrape tables. In this page I go through the same steps but also offer a quick utility class you can use.
Also, check out my post on debugging Scrapy for a quick and easy way to try this out in your own project.
Table scraper
Here’s the table scraper I’ve put together for my project:
It’s very simple and will give you an array of arrays for each row and cells.
How to use the table scraper
Simply select the table you want to scrape and you can even get it out as a dictionary. In this example, since the page doesn’t identify each table, I had to use an xpath to pick out the first table.
The results running it (scrapy runspider table-scraping.py
) should look something like this:
Scraping table with helper class
The data we get out from this will look like:
Result of outputting the scraper data
How it works
It is really simply so I’ll run through the steps here:
- First we pull out the root element of the table, this should be your
<table>
element, probably identified by an ID or class - When we have the table we get all rows by listing
<
tr> tags - Then for each row we extract the text for all cells, header cells might be identified with
<th>
instead of<td>
so to get both, my selector will pick either
And that’s it, there we have our table.
Final words
Though this will get you your table you will likely want to wrap the data and parse it individually. Either you could post-process each line or add a different callback or function rather than the ::text
selector to pick out the elements you want, for example, pulling out links.