Warm tip: This article is reproduced from serverfault.com, please click

Python Newspaper with web archive (wayback machine)

发布于 2017-01-16 15:44:40

I'm trying to use the Python library newspaper with the archives from the Wayback Machine, which stores old versions of websites that were archived. Theoretically, old news articles could be queried and downloaded from these archives.

For instance, the follow code queries the archives for CNBC for a specific archive date.

import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )

Although the archived website itself contains links to actual news articles from 2016-12-01, the newspaper module does not seem to pick them up. Instead, you get urls such as:

https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/

which are not actual articles from this archived version of CNBC. However, newspaper works great with today's version of CNBC.

I suppose that it gets confused because of the format of the url (which contains two https). Does anyone have any suggestions on how to extract articles from the Wayback Machine archives?

Questioner
have_beard_will_ski
Viewed
1
Life is complex 2020-12-28 19:22:41

This was an interesting problem, which I will add to my Newspaper Usage Overview document available on GitHub.

I attempted to use newspaper.build, but I couldn't get it to work correctly, so I used newspaper Source.

from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
                  memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)

wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
   article_extract.download()
   article_extract.parse()

   print(article_extract.publish_date)
   print(article_extract.title)
   print(article_extract.url)
   print('')

   # this sleep timer is helping with some timeout issues
   # that were happening when querying
   sleep(randint(1,3))

The example above outputs this:

None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
    
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/

2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html

2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points 
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html

2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
     

Hopefully, this answer helps with your use case for querying the WayBack Machine for articles. If you have any questions please let me know.