Jsoup selectors: 2nd div after h2

Krystian G 2020-02-01 08:49

I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.

    public void execute1(String html) {
        Document doc = Jsoup.parse(html);
        // first approach: remove every <p> to simplify document
        Elements paragraphs = doc.select("p");
        for (Element paragraph : paragraphs) {
            paragraph.remove();
        }
        // then one selector will return what you want in both cases
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText(Document doc, String text) {
        return doc.select("h2:contains(" + text + ")+div+div").first();
    }

The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.

    public void execute2(String html) {
        Document doc = Jsoup.parse(html);
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
        System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
    }

    private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
        int counter = 2;
        // find h2 with given text
        Element h2 = doc.select("h2:contains(" + text + ")").first();
        // select every sibling after this h2 element
        Elements siblings = h2.nextElementSiblings();
        // loop over them
        for (Element sibling : siblings) {
            // skip everything that's not a div
            if (sibling.tagName().equals("div")) {
                // count how many divs left to skip
                counter--;
                if (counter == 0) {
                    // return when found nth div
                    return sibling;
                }
            }
        }
        return null;
    }

I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.

Bruno C 2020-02-06 16:30:28

Hy Kristian, I wanted to avoid using java but at the end there is no way to do it without coding:)

Related issues

Alternative approach for checking data types to build an object

Is it possible to get first and last name from Google Authentication on Firebase?

Constructor that receives TXT files to read and store them

Logback with Elastic Beanstalk

Camel Predicate Example in xml DSL

How do I get user input validation into my java calculator program?

firebase getting exact childs from database

Listener for custom dialog null

JFrame very small

AWS Elastic Beanstalk Application Logging with Logback