Let's say we have the following response from a browser:
<div>
<tr id="1"></tr>
<tr id="2">
<!--
<div class="A">AAA</div>
<div class="C">BBB</div>
<div class="C">CCC</div>
-->
</tr>
</div>
Getting the comment string using xpath in scrapy should be something like:
response.xpath(//tr[@id="2"]/comment())
So my question - is there any easy way to extract the values of <div class="C"
> tags inside the comment?
One way would be remove the comment tags in the string <!-- (...) -->
, and use lxml.html
library to transform the result into an HTML again and use xpath in it, but I'm pretty sure it should be an easier way...
I'd appreciate any help. Cheers!
Parsing the content of the comment with lxml.html
is a good solution in my opinion.
Python Code
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
html_text = """<div>
<tr id="1"></tr>
<tr id="2">
<!--
<div class="A">AAA</div>
<div class="C">BBB</div>
<div class="C">CCC</div>
-->
</tr>
</div>"""
tree = etree.parse(StringIO(html_text), parser)
comment = tree.xpath("//tr[@id='2']/comment()")
comment_text = str(comment[0])
# string needs an outermost element in order to be parseable
comment_text = comment_text.replace("<!--", "<html>").replace("-->", "</html>")
embedded_tree = etree.parse(StringIO(comment_text), parser)
embedded_tree.xpath("//div[@class='C']/text()")
Output
['BBB', 'CCC']
Thanks Mathias! It's been very useful