Warm tip: This article is reproduced from stackoverflow.com, please click
css-selectors html java jsoup

Jsoup selectors: 2nd div after h2

发布于 2020-03-29 12:47:52

I have the following HTML:

<html>
<body>

...

<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
                <tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
                <tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
                <tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
                <tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

<h2> Blah Blah 2</h2>

<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
                <tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
                <tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
                <tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
                <tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

</body>
</html>

I would like to extract the 2nd DIV following an h2 tag that contains a given text.

As you may notice in the first and second div the p tags are not in the same position.

To extract the DIV following the first h2, the below formula would work:

h2:contains(Blah 1) + p + div +div

But to extract the 2nd, replacing "Blah 1" with "Blah 2" would not work as the ""p"" tag is located elsewhere , so a static selector would be :

h2:contains(Blah 2) + div + p +div

And what I need is a single selector formula where changing the text would make it work, wherever the p blocks may be

I tried several ways : like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...

Help please

Questioner
Bruno C
Viewed
26
Krystian G 2020-02-01 08:49

I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.

    public void execute1(String html) {
        Document doc = Jsoup.parse(html);
        // first approach: remove every <p> to simplify document
        Elements paragraphs = doc.select("p");
        for (Element paragraph : paragraphs) {
            paragraph.remove();
        }
        // then one selector will return what you want in both cases
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText(Document doc, String text) {
        return doc.select("h2:contains(" + text + ")+div+div").first();
    }

The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.

    public void execute2(String html) {
        Document doc = Jsoup.parse(html);
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
        int counter = 2;
        // find h2 with given text
        Element h2 = doc.select("h2:contains(" + text + ")").first();
        // select every sibling after this h2 element
        Elements siblings = h2.nextElementSiblings();
        // loop over them
        for (Element sibling : siblings) {
            // skip everything that's not a div
            if (sibling.tagName().equals("div")) {
                // count how many divs left to skip
                counter--;
                if (counter == 0) {
                    // return when found nth div
                    return sibling;
                }
            }
        }
        return null;
    }

I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.