Warm tip: This article is reproduced from stackoverflow.com, please click
encoding quanteda r

Clean tags from SEC Edgar filings in readtext and quanteda

发布于 2020-04-18 09:47:49

I am trying to read .txt files into R using readtext and quanteda that I have parsed from the SEC Edgar database of publicly listed firm filings. An example of the .txt file is here and a more user friendly version is here for comparison (PG&E during the Californian wildfires).

My code is the following, for a folder of year 1996, containing many .txt files:

directory<-("D:")
text <- readtext(paste0(directory,"/1996/*.txt"))
corpus<-corpus(text)
dfm<-dfm(corpus,tolower=TRUE,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE)

I notice that the dfm still contains a lot of 'useless' tokens, such as 'font-style', 'italic', and at the end many useless tokens such as '3eyn' and 'kq', which I think are part of the .jpg part at the bottom of the .txt file.

When I encode the documents when using readtext, the problem still persists, for example when doing:

text<-readtext(paste0(directory,"/*.txt"),encoding="UTF-8")
text<-readtext(paste0(directory,"/*.txt"),encoding="ASCII")

Any help on how to clean these files so that they appear more like the user friendly version above (i.e. contain only the main text) is much appreciated.

Questioner
Glen Gostlow
Viewed
48
Ken Benoit 2020-03-12 15:55

The key here is to find the marker in the text that indicates the start of the text that you want, and the marker that indicates where this ends. This can be a set of conditions separated in regex using |.

Nothing before the first marker is kept (by default), and you can remove the text following the ending marker by dropping that from the corpus using corpus_subset(). The actual patterns will no doubt require tweaking after you discover the variety of patterns in your actual data.

Here's how I did it for your sample document:

library("quanteda")
## Package version: 2.0.0

corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
  corpus()

# clean up text
corp <- gsub("<.*?>|&#\\d+;", "", corp)
corp <- gsub("&amp;", "&", corp)

corp <- corpus_segment(corp,
  pattern = "Item 8\\.01 Other Events\\.|SIGNATURES",
  valuetype = "regex"
) %>%
  corpus_subset(pattern != "SIGNATURES")

print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires   Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation.   It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately $800 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."