Warm tip: This article is reproduced from serverfault.com, please click

screen scraping-用rvest解析R中的表和URL

(screen scraping - Parsing table and urls in R with rvest)

发布于 2020-12-11 12:30:39

对不起,还有一个 Scraping Scraping 问题。

我需要此表中的数据:http : //rspp.ru/tables/non-financial-reports-library/ 它包含俄罗斯公司的非财务报告。 Scraping 擦是合法的。我需要进行一些文本挖掘以进行研究。

理想情况下,我需要以下输出:公司-年-报告URL。

我正在尝试抓取它,但是我无法将URL对应于公司和年份数据。这是我的脚本:

library(rvest)
library(dplyr)

url = "http://rspp.ru/tables/non-financial-reports-library/"

page = read_html(url)

# table
tab = page %>% 
  html_node("table") %>% 
  html_table(fill = T) 

# links
links = page %>% 
  html_node("table") %>% 
  html_nodes("a") %>% 
  html_attr("href")

能否请你帮忙?

Questioner
Petr
Viewed
0
QHarr 2020-12-12 12:55:54

桌子不规则。一种丑陋的方法是通过分别在列和行内使用colspanrowspan属性值来重构表,以将表扩展为常规数据帧。

然后,你可以添加适当的标头,并考虑合并的单元格,我只在适用年份中重复相同的URL。我确实抓取了给定报告所涵盖的年份的文本描述,例如2007-2009(在带有链接的单元格中看到),但不输出此信息,因为在标题行中已使用的年份。

library(rvest)
library(stringr)

url <- 'http://rspp.ru/tables/non-financial-reports-library/'
page <- read_html(url)
headers <- page %>% html_nodes('.company-report-table .register-table__row:nth-child(1) th')%>%html_text()
companies <- page %>% html_nodes('.company-report-table .register-table__row td:nth-child(1) span')%>%html_text()
body_rows <- page %>% html_nodes('.register-table__row ~ .register-table__row')
df <- data.frame(matrix(NA_character_, nrow = length(body_rows), ncol = length(headers)))
n <- 0

for(row in seq_along(body_rows)){
  curr_row <-  body_rows[[row]] 
  rspan <- curr_row %>% html_node('td') %>% html_attr('rowspan') %>% as.integer() #rspan tells us how many rows per company
  
  if(!is.na(rspan)){
    n <- n + 1
    title <- companies[[n]]
  }
  df[row,1] = title 
  # handle other columns excluding first
  columns_minus_first <- curr_row %>% html_nodes('td:not(:nth-child(1))') # not always 21 range 10 > 21 but we use colspan to expand to 21
  c <- 1
  
  for(column in seq_along(columns_minus_first)){
    curr_col <- columns_minus_first[[column]]
    cspan <- curr_col %>% html_attr('colspan') %>% as.integer() #use cspan value to determine how many years report covers
    
    if(!is.na(cspan)){
      link <- paste0('http://rspp.ru', curr_col %>% html_node('a') %>% html_attr('href'))
      year <- str_extract(curr_col %>% html_text() ,'\\b[0-9-]{4,9}\\b') #purists may want a tighter regex for year spans
      
      for(i in seq_along(cspan)){ #we will start writing out from col 2 as first col is the company name
        df[row,i+c] <- link #repeats for each year covered by report (could alter this for only first)
      }
    }
    c <- c + 1
  }
}

colnames(df) <- headers
df <- tibble(df)