Warm tip: This article is reproduced from serverfault.com, please click

url-点击“加载更多新闻”按钮后,Python抓取页面

(url - Python scraping pages after hitting the "load more news" button)

发布于 2020-11-30 14:09:30

我可以使用以下代码来抓取财经新闻网站的首页。

df = pd.DataFrame()
url = 'https://std.stheadline.com/realtime/finance/%E5%8D%B3%E6%99%82-%E8%B2%A1%E7%B6%93'
result = requests.get(url)
result.raise_for_status()
result.encoding = "utf-8"

要下载后续页面,我需要单击“加载更多新闻”按钮。我使用“ Chrome”>“检查”>“网络”检查了该网站。我发现点击“加载更多新闻”按钮后,请求URL为“ https://std.stheadline.com/realtime/get_more_news”并形成数据;是“ cat = finance&page = 3”。我将这两个放在一起,并加上“?” 在两者之间。但是,这样的URL不起作用。缺少什么吗?

url="https://std.stheadline.com/realtime/get_more_news?cat=finance&page=3"
Questioner
Arthur Law
Viewed
11
baduker 2020-11-30 22:23:53

该按钮实际上是一个POST请求,因此,除了查找API外,无需查找任何内容,然后发出正确的请求即可。

就是这样:

import requests

headers = {
    "Referer": "https://std.stheadline.com/realtime/finance/",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    "X-Requested-With": "XMLHttpRequest",
}
payload = {
    "cat": "finance",
    "page": 4,
}
print(requests.post("https://std.stheadline.com/realtime/get_more_news/", data=payload, headers=headers).json())

这样会将你“载入”新闻的下一页。