Warm tip: This article is reproduced from serverfault.com, please click

how to use python to parse a html that is in txt format?

发布于 2020-11-29 20:59:26

I am trying to parse a txt, example as below link. The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp". https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt

I have tried below:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")

However, "soup" does not contain "COMPANY CONFORMED NAME" at all. Can someone point me in the right direction?

Questioner
Lisa
Viewed
0
Luca Angioloni 2020-11-30 05:26:40

The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:

import re
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()

name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")

match = name_re.search(text_string).group(1)
print(match)