Warm tip: This article is reproduced from serverfault.com, please click

archive newspaper3k python python-3.x python-newspaper

Python Newspaper with web archive (wayback machine)

发布于 2017-01-16 15:44:40

I'm trying to use the Python library newspaper with the archives from the Wayback Machine, which stores old versions of websites that were archived. Theoretically, old news articles could be queried and downloaded from these archives.

For instance, the follow code queries the archives for CNBC for a specific archive date.

import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )

Although the archived website itself contains links to actual news articles from 2016-12-01, the newspaper module does not seem to pick them up. Instead, you get urls such as:

https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/

which are not actual articles from this archived version of CNBC. However, newspaper works great with today's version of CNBC.

I suppose that it gets confused because of the format of the url (which contains two https). Does anyone have any suggestions on how to extract articles from the Wayback Machine archives?

Questioner

have_beard_will_ski

Viewed

1

Life is complex 2020-12-28 19:22:41

This was an interesting problem, which I will add to my Newspaper Usage Overview document available on GitHub.

I attempted to use newspaper.build, but I couldn't get it to work correctly, so I used newspaper Source.

from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
                  memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)

wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
   article_extract.download()
   article_extract.parse()

   print(article_extract.publish_date)
   print(article_extract.title)
   print(article_extract.url)
   print('')

   # this sleep timer is helping with some timeout issues
   # that were happening when querying
   sleep(randint(1,3))

The example above outputs this:

None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
    
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/

2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html

2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points 
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html

2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html

Hopefully, this answer helps with your use case for querying the WayBack Machine for articles. If you have any questions please let me know.

热门帖子

1

iOS 17.5 BUG 有用户发现多年前删除的照片重新出现在照片库

2

怎么 vision pro 没啥讨论度了

3

卷死同行 gpt-4o 模型 1.4 折中转接近官网 3.5 的价格！

4

新房入住， 618 有推荐的组 MESH 的主副路由器吗？

5

各位大佬好，我是一名大学生，想请教一下大家有没有什么适合大学生的赚钱小项目？我深知赚钱不易，所以想在不影响学业的前提下，找一些小项目来赚点零花钱。希望各位大佬能不吝赐教，分享一些你们的经验和建议。谢谢大家啦！

6

虚心求教，数据量上亿的爬虫数据用什么该用什么数据库呢

7

联通推出了更便宜的 eSIM iPad 套餐

8

坐标深圳，收台主机，不急

9

google doc如何快速插入日期时间？

10

最近三年面了三百多人，给程序员和面试官们分享一下我的感受

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books