Warm tip: This article is reproduced from serverfault.com, please click

Trouble scraping Tableau public dashboard

发布于 2020-12-04 02:00:37

I'm attempting to scrape this Tableau dashboard, however I'm running into a problem where I am missing values in my output. Specifically, it seems like my code won't scrape/print repeated values (a value that shows up twice will only be scraped/printed once).

Here is the code I am using:

import requests
from bs4 import BeautifulSoup
import json
import re

r = requests.get("https://public.tableau.com/views/COVID-19HospitalsDashboard/Hospitals?%3Aembed=y&%3AshowVizHome=no", 
    params = {
    ":embed": "y",
    ":showVizHome": "no",
    ":host_url": "https://public.tableau.com/",
    ":embed_code_version": 3,
    ":tabs": "no",
    ":toolbar": "no",
    ":animate_transition": "yes",
    ":display_static_image": "no",
    ":display_spinner": "no",
    ":display_overlay": "yes",
    ":display_count": "yes",
    ":language": "en",
    ":loadOrderID": 0
})
soup = BeautifulSoup(r.text, "html.parser")

tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
Questioner
Henry St
Viewed
0
groseries 2020-12-14 23:26:17

I've looked through both your raw request text and the regex search data and can't find the difference you are discussing. Both the points in the raw data from the request and your own regex search come back with 1132 matches using the following:

x = data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"][0]["dataValues"]

duplicates = set()
# loop through elements and find matches
for i in x:
    if i not in duplicates:
        duplicates.add(i)

print(len(duplicates))

From this it looks like your code is working properly.