I am making a kotlin web scraping app using skrape{it}.
So far I haver managed to get some information that i need but i cannot get the rest, i need to get a href link from a html table. see below...
<div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
<div class="CollapsiblePanelContent">
<table width="667px" class="tabularData">
<tr>
<td width="407px" height="21"><a href="link info i need in here">description </a></td>
<td width="130px">15:28</td>
<td width="130px">Western</td>
</tr>
I can get the collapsible panel tab info but not the href, i cant figure how to get into the table.
I have done this by doctoring the example code in the libraries github page.
does anyone have any ideas?
my code
import it.skrape.selects.el
import it.skrape.skrape
data class MyScrapedData(
val userName: String
)
fun main() {
val githubUserData = skrape {
url = "http://www.website"
extract {
MyScrapedData(
userName = el("div.CollapsiblePanel").text()
)
}
}
println("${githubUserData.userName} is data selected ")
Thanks for any info
looking at the syntax i'm assuming you are using version 0.6.0. You have to use a more specific css-selector.
data class MyScrapedData(
val userName: String,
val link: String
)
fun main() {
val githubUserData = skrape {
url = "http://www.website"
extract {
MyScrapedData(
userName = el("div.CollapsiblePanel").text(),
link = el("table tr td a").attr("href")
)
}
}
println("selected user: ${githubUserData.userName}")
println("selected link: ${githubUserData.link}")
// will print:
// Today's Interest (1)
// link info i need in here
}
You can find more information about css selectors here: https://www.w3schools.com/cssref/css_selectors.asp
you could also give version 1.0.0-alpha5 a try. I know it's an alpha version, but it's fully working and you could do things even more elegant.
EDIT: If you want to extract multiple links you can do (using version 0.6.0) it like this:
assuming the HTML you want to parse has following structure:
<div class="CollapsiblePanelTab" tabindex="0">Today's Interest (1)</div>
<div class="CollapsiblePanelContent">
<table width="667px" class="tabularData">
<tr>
<td><a href="1st link">description </a></td>
<td><a href="2nd link">description </a></td>
<td><a href="3rd link">description </a></td>
<td><a href="4th link">description </a></td>
<td>no link in here</td>
</tr>
</table>
</div>
</div>
Change your data classes property link to be of type List<String>
data class MyScrapedData(
val userName: String,
val links: List<String>
)
Use elements
instead of element
to select all matching occurences of the css-selector and call eachAttr("href")
to extract the value of all corresponding href attributes.
fun main() {
val githubUserData = skrape {
url = "http://www.website"
extract {
MyScrapedData(
userName = element("div.CollapsiblePanel").text(),
links = elements("table tr td a").eachAttr("href")
)
}
}
println("selected user: ${githubUserData.userName}")
println("selected links: ${githubUserData.links}")
// will print:
// selected user: Today's Interest (1)
// selected links: [1st link, 2nd link, 3rd link, 4th link]
}
HINT:
The artifact id has changed from core
to skrapeit-core
from version 0.4.2 and above. I think that's the reason why you couldn't update the version. Thereby you have to add the skrape{it} dependency like this:
using Gradle:
implementation("it.skrape:skrapeit-core:0.6.0")
// instead of implementation("it.skrape:core:0.4.1")
using Maven:
<dependency>
<groupId>it.skrape</groupId>
<artifactId>skrapeit-core</artifactId>
<version>0.6.0</version>
</dependency>
Thanks for the quick reply! I was using 0.4.1 (it was at the top of the list :/ ) i tried getting the alpha build but it wouldn't download for some reason. I did however get the 0.6.0. what do i import to the project please? i ask because attr and link = are highlighted in red even with the vars being declared. Thanks again!
check that ha ha, I got it to work, there was a missing , thanks very much!! just one more question (call me Colombo) If there is more than 1 href link do i just need to declare Link as a list and do a loop to get them all?
sorry for the missing comma. just typed the code example directly in the browser without syntax highlighting :D I updated the answer to work with a list of links
That is brilliant. No problem about the comma haha, I'll have a look at it tomorrow and let you know how I get on. Thanks again!
Hi @Christian thanks for the advice I have managed to get all the information I needed from that first part. I have also looked at the CSS selectors and was wondering one thing. Where do I find the information about what to write for eachbselector in your library? For instance if I want to get a certain "I'd" The selector says element #id #firstname is id="firstname" but this doesn't work... I have tried different ways to write it but to no avail. Any ideas/links to read/teach would be great