Clean tags from SEC Edgar filings in readtext and quanteda

Ken Benoit 2020-03-12 15:55

The key here is to find the marker in the text that indicates the start of the text that you want, and the marker that indicates where this ends. This can be a set of conditions separated in regex using |.

Nothing before the first marker is kept (by default), and you can remove the text following the ending marker by dropping that from the corpus using corpus_subset(). The actual patterns will no doubt require tweaking after you discover the variety of patterns in your actual data.

Here's how I did it for your sample document:

library("quanteda")
## Package version: 2.0.0

corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
  corpus()

# clean up text
corp <- gsub("<.*?>|&#\\d+;", "", corp)
corp <- gsub("&amp;", "&", corp)

corp <- corpus_segment(corp,
  pattern = "Item 8\\.01 Other Events\\.|SIGNATURES",
  valuetype = "regex"
) %>%
  corpus_subset(pattern != "SIGNATURES")

print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires   Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation.   It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately $800 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."

Related issues

write utf-8 characters to file with fputcsv in php

PHP

Can not read http request properly

How to save and load a structure of interfaces in go

how to detect and handle unsupported locales in algorithms?

How to print UTF-8 strings to std::cout on Windows?

Converting problem ANSI to UTF8 C#

Python unable to decode byte string

Read Japanese Character using Scanner

Effectively UTF-8 encode a string