Warm tip: This article is reproduced from stackoverflow.com, please click
express javascript node.js puppeteer

Why data scraping with puppeteer always gives data from first page?

发布于 2020-04-16 12:29:46

I am trying to scrape data from a website with puppeteer. As I request for data every time it gives me data from first page even If I am passing url for any other page. On google chrome it gives me correct page data related to searched url but as i request from API or postman it always gives me first page data. Below is my script...

async function main() {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.setViewport({ width: 1200, height: 720 })
    await page.goto('https://member.daraz.pk/user/login', { waitUntil: 'networkidle0' }); // wait until page load
    await page.type('input[type="text"]', 'username', { delay: 10 });
    await page.type('input[type="password"]', 'pass', { delay: 10 });

    // click and wait for navigation

    await page.click('.next-btn-large');
    await page.waitFor(8000);
    const page1 = await browser.newPage();
    await page1.setViewport({ width: 1200, height: 720 })
    await page.waitFor(1000);
    for (let i = 1; i < 10; i++) {
        await page.goto(`https://www.daraz.pk/air-conditioners/gree/?page=${i}`, { waitUntil: 'networkidle0' });

        // always return first page data

    }

}

main();```
Questioner
Md Ch
Viewed
18
jfriend00 2020-02-06 15:17

The script I suggested in my comment was loading image src values and requires those images to be visible before the page would load them. So, if you didn't make the right tab visible, it probably wouldn't have loading them. That's some sort of on-demand image loading built into the page. It's better to look at some other aspect of the page that isn't loaded that way. I modified my script to do that.

Here's a script that works for me. I don't know what data you want out of the page, but this gets the sku-simple value and the title for each product in the page. For brevity, I only output to the console the first 10 products in each page and I dialed it back to only traverse 3 pages. You can obviously adjust those as you want. I've also removed the username/pwd from my script since I see you don't have it public any more. You can fill that in yourself.

const puppeteer = require('puppeteer');

async function main() {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();
    await page.setViewport({ width: 1200, height: 720 })
    await page.goto('https://member.daraz.pk/user/login', { waitUntil: 'networkidle0' }); // wait until page load
    await page.type('input[type="text"]', 'xxx', { delay: 10 });
    await page.type('input[type="password"]', 'yyy', { delay: 10 });

    // click and wait for navigation

    await page.click('.next-btn-large');
    await page.waitFor(8000);
    const page1 = await browser.newPage();
    await page1.setViewport({ width: 1200, height: 720 })
    await page.waitFor(1000);
    // page.on('console', msg => console.log('PAGE LOG:', msg.text()));
    for (let i = 1; i <= 3; i++) {
        await page.goto(`https://www.daraz.pk/air-conditioners/gree/?page=${i}`, { waitUntil: 'networkidle0' });
        let srcs = await page.$$eval(".c2prKC", elements => { 
            return elements.map(el => {
                let skuSimple = el.getAttribute("data-sku-simple");
                let link = el.querySelector(".c16H9d a");
                let title = "<unknown>";
                if (link) {
                    title = link.getAttribute("title");
                }
                return {skuSimple, title};
            });
        });
        console.log(`Data for page ${i}:`);
        console.log(srcs.slice(0,10));
    }
    //await browser.close();    

}

main();

I see output like this in my console, so it definitely appears to be fetching the pages and retrieving data from the DOM in those pages:

Data for page 1:
[
  {
    skuSimple: 'GR678HL0KV5HWNAFAMZ-4744951',
    title: 'Gree Inverter AC - GS-18CITH12G - 1.5 ton - Inverter  Air Conditioner - Cozy Series - Heat N Cool - Grey'
  },
  {
    skuSimple: 'GR678HL09YUCKNAFAMZ-3940302',
    title: 'Gree GS-12FITH1W - Fairy Inverter Air Conditioner Series - White'
  },
  {
    skuSimple: 'GR678HL0RTUHWNAFAMZ-3940305',
    title: 'Gree GS-18FITH1W - Fairy Inverter Air Conditioner Series - White'
  },
  {
    skuSimple: 'GR678HL1E0WZSNAFAMZ-1741958',
    title: 'Gree Split Air Conditioner - GS-12LM4 - 1 Ton - White'
  },
  {
    skuSimple: '2779851_PK-1252862621',
    title: 'Gree 18CITHI 12G- DC Inverter AC - 1.5 Ton'
  },
  {
    skuSimple: 'GR678HLEOKNJNAFAMZ-668566',
    title: 'Gree Gree GS-12LM -1 Ton Air Conditioner - White'
  },
  {
    skuSimple: '114820460_PK-1266640670',
    title: 'Gree Windows AC 0.75 Ton with Remote Control 60% Electricity Saving'
  },
  {
    skuSimple: '2864384_PK-1246026961',
    title: 'Gree Inverter AC - GS-12CITH12G - 1.0ton - Inverter Air Conditioner - Cozy Series - Heat N Cool - Grey'
  },
  {
    skuSimple: '105610333_PK-1253012621',
    title: 'Gree 1.0 Ton Dc Inverter AC Heat & Cool R-410A Air Conditioner - 12cith12G - Grey'
  },
  {
    skuSimple: '105616318_PK-1253002672',
    title: 'Gree 1.5 Ton Dc Inverter AC Heat & Cool R-410A Air Conditioner - 18cith12G - Grey'
  }
]
Data for page 2:
[
  {
    skuSimple: '109636918_PK-1260070281',
    title: 'New Gree DC Inverter Ac 1(ton) 12CIT'
  },
  {
    skuSimple: '114536248_PK-1266322653',
    title: 'Gree 1.0 Ton Heat & Cool DC Inverter Air conditioner 12CITH'
  },
  {
    skuSimple: '109830097_PK-1260278793',
    title: 'AC Dawlance Inspire Plus Inverter 30 1.5 Ton Split Saving 26000 Yearly'
  },
  {
    skuSimple: '121648880_PK-1277580612',
    title: 'Gs-24Lm4L - 2 Ton Ac - White - Brand Warranty'
  },
  {
    skuSimple: '106364064_PK-1254400160',
    title: 'Gree Floor Standing GF-48FW - Floor Standing Low Voltage Startup Series - White'
  },
  {
    skuSimple: '109324039_PK-1259442545',
    title: 'Gree G10 Inverter 1.5 Ton (18000 BTU) GS-18CITH2/2G Split Air Conditioner'
  },
  {
    skuSimple: '122056481_PK-1278142392',
    title: 'AC Gree 12FITH1C 1 Ton DC Inverter Split AC 50% to 70% Energy Saving'
  },
  {
    skuSimple: '115570453_PK-1267506144',
    title: 'AC Gree GS-12CITH13M Inverter 1 Ton (Wifi) Split 60% to 70% Energy Saving'
  },
  {
    skuSimple: 'GR678HL0ZWE2CNAFAMZ-4776611',
    title: 'Gree 1.5 Ton Dc Inverter Heat & Cool R-410A Air Conditioner - 18cith11B - Black'
  },
  {
    skuSimple: '110096660_PK-1260802813',
    title: 'GREE 1.0 TON SPLIT COOL ONLY AIR CONDITIONER 12LM4'
  }
]
Data for page 3:
[
  {
    skuSimple: 'GR678HL017DY0NAFAMZ-4102700',
    title: 'Gree 1.5 Ton Dc Inverter Heat & Cool R-410A Air Conditioner - 18cith11S - Silver'
  },
  {
    skuSimple: '115554341_PK-1267490372',
    title: 'Gree GS-18CITH13M Inverter 1.5 Ton (Wifi) Split Up to 60% Energy Saving'
  },
  {
    skuSimple: '109428468_PK-1259596998',
    title: 'Gree Inverter Air conditioner 2 ton'
  },
  {
    skuSimple: '124818788_PK-1282694870',
    title: 'Gree Inverter Air Conditioner - GS-24CITH11W - Cozy Inverter Series - 02ton - White'
  },
  {
    skuSimple: '3407444_PK-1247135008',
    title: 'Gree 2 Ton Dc Inverter Heat & Cool R-410A Air Conditioner - 24cith11S - Silver'
  },
  {
    skuSimple: '109826799_PK-1260322442',
    title: 'Gree GS-18CITH13M Inverter 1.5 Ton (Wifi) Split Up to 60% Energy Saving'
  },
  {
    skuSimple: '130883483_PK-1290780443',
    title: 'Gree - Inverter Split Air Conditioner - 1.5 Ton'
  },
  {
    skuSimple: '107714050_PK-1256398549',
    title: 'Gree Inverter Air conditioner 1.5 ton'
  },
  {
    skuSimple: 'GR678HL0Q02DENAFAMZ-5098883',
    title: 'GS-18LM4 - Gree Air Conditioner - 1.5 Ton - White'
  },
  {
    skuSimple: 'GR678HL1IIQ8YNAFAMZ-5098768',
    title: 'Gree Gree - GS - 12CITH12G - 1.0 ton - Inverter Air Conditioner - Grey'
  }
]