Warm tip: This article is reproduced from stackoverflow.com, please click
python scrapy xpath

Using XPath in strings

发布于 2020-03-29 12:47:36

Let's say we have the following response from a browser:

<div>
  <tr id="1"></tr>
  <tr id="2">
  <!--
    <div class="A">AAA</div>
    <div class="C">BBB</div>
    <div class="C">CCC</div>
  -->
  </tr>
</div>

Getting the comment string using xpath in scrapy should be something like:

response.xpath(//tr[@id="2"]/comment())

So my question - is there any easy way to extract the values of <div class="C"> tags inside the comment? One way would be remove the comment tags in the string <!-- (...) -->, and use lxml.htmllibrary to transform the result into an HTML again and use xpath in it, but I'm pretty sure it should be an easier way...

I'd appreciate any help. Cheers!

Questioner
willp93
Viewed
97
Mathias Müller 2020-01-30 03:35

Parsing the content of the comment with lxml.html is a good solution in my opinion.

Python Code

from lxml import etree
from io import StringIO

parser = etree.HTMLParser()

html_text = """<div>
  <tr id="1"></tr>
  <tr id="2">
  <!--
    <div class="A">AAA</div>
    <div class="C">BBB</div>
    <div class="C">CCC</div>
  -->
  </tr>
</div>"""

tree = etree.parse(StringIO(html_text), parser)

comment = tree.xpath("//tr[@id='2']/comment()")

comment_text = str(comment[0])

# string needs an outermost element in order to be parseable

comment_text = comment_text.replace("<!--", "<html>").replace("-->", "</html>")

embedded_tree = etree.parse(StringIO(comment_text), parser)

embedded_tree.xpath("//div[@class='C']/text()")

Output

['BBB', 'CCC']