I was trying to pull out a big description block for an item in a recent scraping project. Of course, this contains all kinds of weird and wonderful HTML formatting as it is probably built in a WYSIWYG editor.
I found that Scrapy doesn’t have a good way to handle even the simpler cases, take this HTML for example:
<div id="complex-text">
<p>This div contains <i>complex</i> text</p>
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
<blockquote>Including quotes</blockquote>
</div>
Try to get the text of this using Scrapy’s ::text
psuedo-selector, like this response.css('#complex-text')
and all you will get is an empty string. Why?
The ::text
psuedo-selector will only return the text content of the element you select, not the innerText as we would expect from the Javascript innerText
property. But I think that in most cases, except for really simply one’s, we need to get the full innerText, styling ignored.
Join the elements
def innertext_quick(elements, delimiter=""): return list(delimiter.join(el.strip() for el in element.css('*::text').getall()) for element in elements)
This naive solution, however, has several problems. It will simply put all elements together. So imagine you have some list items:
<div id="complex-text">
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
</div>
This will end up non-delimited: List item 1List item 2
without any spaces.
Of course, you can add spacing in between these, but it will instead cause issues if you have <span>
tags inside text where you don’t want the spaces added.
Use bs4
Better yet, use beautifulsoup. It will treat each HTML element as you expect it and concatenate a string. You can even control the stripping of elements you don’t want.
from bs4 import BeautifulSoup
def innertext(selector):
html = selector.get()
soup = BeautifulSoup(html, 'html.parser')
return soup.get_text().strip()
Show me the code
If you’re curious to see other alternatives to do this, check out my full article on this over at medium. I have also prepared a full repository showing the two solutions.
https://github.com/ddikman/scrapy-innertext
Happy scraping