温馨提示:本文翻译自stackoverflow.com,查看原文请点击:kotlin - using skrape{it} to get data from a html
html html-table kotlin web-scraping

kotlin - 使用skrape {it}从html获取数据

发布于 2020-03-27 15:42:38

我正在使用skrape {it}制作Kotlin网络抓取应用程序。

到目前为止,我已经设法获取了一些我需要的信息,但我无法获取其余信息,我需要从html表中获取href链接。见下文...

   <div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
   <div class="CollapsiblePanelContent">
   <table width="667px"  class="tabularData">

  <tr>
    <td width="407px" height="21"><a href="link info i need in here">description </a></td>
    <td width="130px">15:28</td>
    <td width="130px">Western</td>
  </tr> 

我可以获取可折叠面板选项卡的信息,但不能获取href,我无法弄清楚如何进入表格。

我是通过在库github页面上篡改示例代码来完成此操作的。

有人有什么想法吗?

我的代码

import it.skrape.selects.el
import it.skrape.skrape

data class MyScrapedData(
    val userName: String
)

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = el("div.CollapsiblePanel").text() 
            )
        }
    }
    println("${githubUserData.userName} is data selected ")

谢谢你的任何信息

查看更多

查看更多

提问者
Raif Jackson
被浏览
121
Christian Dräger 2020-02-01 05:10

查看语法,我假设您使用的是0.6.0版。您必须使用更特定的CSS选择器。

data class MyScrapedData(
   val userName: String,
   val link: String
)

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = el("div.CollapsiblePanel").text(),
                link = el("table tr td a").attr("href")
            )
        }
    }
    println("selected user: ${githubUserData.userName}")
    println("selected link: ${githubUserData.link}")

    // will print:
    // Today's Interest (1)
    // link info i need in here
}

You can find more information about css selectors here: https://www.w3schools.com/cssref/css_selectors.asp

you could also give version 1.0.0-alpha5 a try. I know it's an alpha version, but it's fully working and you could do things even more elegant.

EDIT: If you want to extract multiple links you can do (using version 0.6.0) it like this:

assuming the HTML you want to parse has following structure:

<div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
    <div class="CollapsiblePanelContent">
        <table width="667px" class="tabularData">
            <tr>
                <td><a href="1st link">description </a></td>
                <td><a href="2nd link">description </a></td>
                <td><a href="3rd link">description </a></td>
                <td><a href="4th link">description </a></td>
                <td>no link in here</td>
            </tr> 
        </table>
    </div>
</div>

Change your data classes property link to be of type List<String>

data class MyScrapedData(
   val userName: String,
   val links: List<String>
)

Use elements instead of element to select all matching occurences of the css-selector and call eachAttr("href") to extract the value of all corresponding href attributes.

fun main() {
    val githubUserData = skrape {
        url = "http://www.website" 

        extract {
            MyScrapedData(
                userName = element("div.CollapsiblePanel").text(),
                links = elements("table tr td a").eachAttr("href")
            )
        }
    }
    println("selected user: ${githubUserData.userName}")
    println("selected links: ${githubUserData.links}")

    // will print:
    // selected user: Today's Interest (1)
    // selected links: [1st link, 2nd link, 3rd link, 4th link]
}

提示: 工件标识已从0.4.2及更高版本更改coreskrapeit-core我认为这就是您无法更新版本的原因。因此,您必须添加skrape {it}依赖项,如下所示:

使用Gradle:

implementation("it.skrape:skrapeit-core:0.6.0")
// instead of implementation("it.skrape:core:0.4.1")

使用Maven:

<dependency>
   <groupId>it.skrape</groupId>
   <artifactId>skrapeit-core</artifactId>
   <version>0.6.0</version>
</dependency>