Warm tip: This article is reproduced from stackoverflow.com, please click
html html-table kotlin web-scraping

using skrape{it} to get data from a html

发布于 2020-03-27 15:39:48

I am making a kotlin web scraping app using skrape{it}.

So far I haver managed to get some information that i need but i cannot get the rest, i need to get a href link from a html table. see below...

   <div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
   <div class="CollapsiblePanelContent">
   <table width="667px"  class="tabularData">

  <tr>
    <td width="407px" height="21"><a href="link info i need in here">description </a></td>
    <td width="130px">15:28</td>
    <td width="130px">Western</td>
  </tr> 

I can get the collapsible panel tab info but not the href, i cant figure how to get into the table.

I have done this by doctoring the example code in the libraries github page.

does anyone have any ideas?

my code

import it.skrape.selects.el
import it.skrape.skrape

data class MyScrapedData(
    val userName: String
)

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = el("div.CollapsiblePanel").text() 
            )
        }
    }
    println("${githubUserData.userName} is data selected ")

Thanks for any info

Questioner
Raif Jackson
Viewed
101
Christian Dräger 2020-02-01 05:10

looking at the syntax i'm assuming you are using version 0.6.0. You have to use a more specific css-selector.

data class MyScrapedData(
   val userName: String,
   val link: String
)

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = el("div.CollapsiblePanel").text(),
                link = el("table tr td a").attr("href")
            )
        }
    }
    println("selected user: ${githubUserData.userName}")
    println("selected link: ${githubUserData.link}")

    // will print:
    // Today's Interest (1)
    // link info i need in here
}

You can find more information about css selectors here: https://www.w3schools.com/cssref/css_selectors.asp

you could also give version 1.0.0-alpha5 a try. I know it's an alpha version, but it's fully working and you could do things even more elegant.

EDIT: If you want to extract multiple links you can do (using version 0.6.0) it like this:

assuming the HTML you want to parse has following structure:

<div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
    <div class="CollapsiblePanelContent">
        <table width="667px" class="tabularData">
            <tr>
                <td><a href="1st link">description </a></td>
                <td><a href="2nd link">description </a></td>
                <td><a href="3rd link">description </a></td>
                <td><a href="4th link">description </a></td>
                <td>no link in here</td>
            </tr> 
        </table>
    </div>
</div>

Change your data classes property link to be of type List<String>

data class MyScrapedData(
   val userName: String,
   val links: List<String>
)

Use elements instead of element to select all matching occurences of the css-selector and call eachAttr("href") to extract the value of all corresponding href attributes.

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = element("div.CollapsiblePanel").text(),
                links = elements("table tr td a").eachAttr("href")
            )
        }
    }
    println("selected user: ${githubUserData.userName}")
    println("selected links: ${githubUserData.links}")

    // will print:
    // selected user: Today's Interest (1)
    // selected links: [1st link, 2nd link, 3rd link, 4th link]
}

HINT: The artifact id has changed from coreto skrapeit-core from version 0.4.2 and above. I think that's the reason why you couldn't update the version. Thereby you have to add the skrape{it} dependency like this:

using Gradle:

implementation("it.skrape:skrapeit-core:0.6.0")
// instead of implementation("it.skrape:core:0.4.1")

using Maven:

<dependency>
   <groupId>it.skrape</groupId>
   <artifactId>skrapeit-core</artifactId>
   <version>0.6.0</version>
</dependency>