温馨提示:本文翻译自stackoverflow.com,查看原文请点击:java - Jsoup selectors: 2nd div after h2
css-selectors html java jsoup

java - Jsoup选择器:h2之后的第2个div

发布于 2020-03-29 13:13:04

我有以下HTML:

<html>
<body>

...

<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
                <tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
                <tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
                <tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
                <tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

<h2> Blah Blah 2</h2>

<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
                <tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
                <tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
                <tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
                <tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

</body>
</html>

我想在包含给定文本h2标签之后提取第二个DIV

您可能会在第一和第二个div中注意到,p标签不在同一位置。

要提取第一个h2之后的DIV,可以使用以下公式:

h2:contains(Blah 1) + p + div +div

但是要提取第二个,将“ Blah 1”替换为“ Blah 2”将不起作用,因为“” p“”标签位于其他位置,因此静态选择器将是:

h2:contains(Blah 2) + div + p +div

我需要的是一个选择器公式,无论p块位于何处,更改文本都可以使它起作用

I tried several ways : like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...

Help please

查看更多

查看更多

提问者
Bruno C
被浏览
25
Krystian G 2020-02-01 08:49

I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.

    public void execute1(String html) {
        Document doc = Jsoup.parse(html);
        // first approach: remove every <p> to simplify document
        Elements paragraphs = doc.select("p");
        for (Element paragraph : paragraphs) {
            paragraph.remove();
        }
        // then one selector will return what you want in both cases
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText(Document doc, String text) {
        return doc.select("h2:contains(" + text + ")+div+div").first();
    }

第二种方法是遍历兄弟姐妹,"h2:contains(" + text+ ")"然后“手动”找到第二种<div>忽略其他事物的方法。最好这样做,因为它不会破坏原始文档,并且会跳过任何数量的<p>元素。

    public void execute2(String html) {
        Document doc = Jsoup.parse(html);
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
        int counter = 2;
        // find h2 with given text
        Element h2 = doc.select("h2:contains(" + text + ")").first();
        // select every sibling after this h2 element
        Elements siblings = h2.nextElementSiblings();
        // loop over them
        for (Element sibling : siblings) {
            // skip everything that's not a div
            if (sibling.tagName().equals("div")) {
                // count how many divs left to skip
                counter--;
                if (counter == 0) {
                    // return when found nth div
                    return sibling;
                }
            }
        }
        return null;
    }

我还有第三个想法要使用"h2:contains(" + text + ")~div:nth-of-type(2)"它适用于第一种情况,但不适用于第二种情况,可能是因为<p>div之间有一个