温馨提示:本文翻译自stackoverflow.com,查看原文请点击:r - Clean tags from SEC Edgar filings in readtext and quanteda
encoding quanteda r

r - 清除SEC Edgar文件中以readtext和Quanteda标记的标签

发布于 2020-04-18 12:22:47

我正在尝试使用从SEC Edgar公开上市公司文件数据库中解析的readtext和quanteda将.txt文件读入R。.txt文件的一个例子是在这里和更加用户友好的版本是在这里进行比较(PG&E在加州野火)。

我的代码如下,对于1996年的文件夹,其中包含许多.txt文件:

directory<-("D:")
text <- readtext(paste0(directory,"/1996/*.txt"))
corpus<-corpus(text)
dfm<-dfm(corpus,tolower=TRUE,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE)

我注意到dfm仍然包含许多“无用的”标记,例如“字体样式”,“斜体”,最后还有许多无用的标记,例如“ 3eyn”和“ kq”,我认为它们是其中的一部分.txt文件底部的.jpg部分。

当我使用readtext编码文档时,问题仍然存在,例如在执行以下操作时:

text<-readtext(paste0(directory,"/*.txt"),encoding="UTF-8")
text<-readtext(paste0(directory,"/*.txt"),encoding="ASCII")

非常感谢您提供有关如何清理这些文件,使它们看起来更像上述用户友好版本的帮助(即仅包含主要文本)。

查看更多

提问者
Glen Gostlow
被浏览
52
Ken Benoit 2020-03-12 15:55

此处的关键是在文本中找到指示您想要的文本开头的标记,以及指示其结尾的标记。这可以是一组使用regex分隔的条件|

保留第一个标记之前的所有内容(默认情况下),您可以使用删除结尾标记之后的文本,方法是将其从语料库中删除corpus_subset()在您发现实际数据中的各种模式之后,毫无疑问,实际模式将需要进行调整。

这是我为您的示例文档执行的操作:

library("quanteda")
## Package version: 2.0.0

corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
  corpus()

# clean up text
corp <- gsub("<.*?>|&#\\d+;", "", corp)
corp <- gsub("&amp;", "&", corp)

corp <- corpus_segment(corp,
  pattern = "Item 8\\.01 Other Events\\.|SIGNATURES",
  valuetype = "regex"
) %>%
  corpus_subset(pattern != "SIGNATURES")

print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires   Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation.   It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately $800 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."