How to get all html data after all scripts and page loading is done? (puppeteer)

Question

Finally I figured how to use Node.js. Installed all libraries/extensions. So puppeteer is working, but as it was previous with Xmlhttp... it gets only template/body of the page, without needed information. All scripts on the page engage after few second it had been opened in browser (Web app?). I need to get information inside certain tags after Whole page is loaded. Also, I would ask, if it possible to have pure JavaScript, because I do not use jQuery like code. So it doubles difficulty for me...

Here what I have so far.

const puppeteer = require('puppeteer');
const $ = require('cheerio');
let browser;
let page;

const url = "really long link with latitude and attitude";

(async () => puppeteer
  .launch()
  .then(await function(browser) {
    return browser.newPage();
})
  .then(await function(page) {
    return page.goto(url).then(function() {
      return page.content();
    });
  })
  .then(await function(html) {
    $('strong', html).each(function() {
      console.log($(this).text());
    });
  })
  .catch(function(err) {
    //handle error
  }))();

I get only template default body elements inside strong tag. But it should contain a lot more data than just 10 items.

It's a bit odd to use async/await and then(). Usually it would be const browser = await puppeteer.launch(); const page = await browser.newPage();... etc. — Heretic Monkey, Commented Feb 6, 2019 at 22:20

codetinker · Accepted Answer · 2020-03-10 12:47:02Z

43

If you want full html same as inspect? Here it is:

    const puppeteer = require('puppeteer');

    (async function main() {
      try {
        const browser = await puppeteer.launch();
        const [page] = await browser.pages();

        await page.goto('https://example.org/', { waitUntil: 'networkidle0' });
        const data = await page.evaluate(() => document.querySelector('*').outerHTML);

        console.log(data);

        await browser.close();
      } catch (err) {
        console.error(err);
      }
    })();

answered Mar 10, 2020 at 12:47

codetinker

80411 silver badges9 bronze badges

20

how is this different than await page.content()?
– chovy
Commented Dec 23, 2020 at 9:37
1

@chovy no different from document.documentElement.outerHTML github.com/puppeteer/puppeteer/blob/…
– 井上智文
Commented Jan 8, 2023 at 10:45

Add a comment |

Alex G · Accepted Answer · 2022-10-21 11:51:07Z

25

Just one line:

const html = await page.content();

Details:

import puppeteer from 'puppeteer'

const test = async (url) => {
    const browser = await puppeteer.launch({ headless: false })
    const page = await browser.newPage()

    await page.goto(url, { waitUntil: 'networkidle0' })

    const html = await page.content()
    console.log(html)
}

await test('https://stackoverflow.com/')

answered Oct 21, 2022 at 11:51

Alex G

1,5901 gold badge26 silver badges35 bronze badges

Add a comment |

Makki Anjum · Accepted Answer · 2020-12-04 16:17:42Z

11

let bodyHTML = await page.evaluate(() => document.documentElement.outerHTML);

This

edited Dec 4, 2020 at 16:17

answered Dec 3, 2020 at 14:54

Makki Anjum

1291 silver badge4 bronze badges

7

however this may answer the question, please find some words to describe your solution.
– zhisme
Commented Dec 3, 2020 at 15:30

Add a comment |

vsemozhebuty · Accepted Answer · 2019-02-08 22:57:23Z

Some notes:

You need not cheerio with puppeteer and you need not reparse page.content(): you already have the full DOM with all scripts run and you can evaluate any code in window context like in a browser using page.evaluate() and transferring serializable data between web API context and Node.js API context.
Try to use async/await only, this will simplify your code and flow.
If you need to wait till all the scripts and other dependencies are loaded, use waitUntil: 'networkidle0' in page.goto().
If you suspect that document scripts need some time till the needed state, use various test functions like page.waitForSelector() or fall back to page.waitFor(milliseconds).

Here is a simple script that outputs all tag names in a page.

'use strict';

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://example.org/', { waitUntil: 'networkidle0' });

    const data = await page.evaluate(
      () =>  Array.from(document.querySelectorAll('*'))
                  .map(elem => elem.tagName)
    );

    console.log(data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

You can specify your task in more details and we can try to write something more appropriate.

Script for www.bezrealitky.cz (task from a comment below):

'use strict';

const fs = require('fs');
const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();
    page.setDefaultTimeout(0);

    await page.goto('https://www.bezrealitky.cz/vyhledat?offerType=pronajem&estateType=byt&disposition=&ownership=&construction=&equipped=&balcony=&order=timeOrder_desc&boundary=%5B%5B%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.154133576294%2C%22lng%22%3A14.599004629591036%7D%2C%7B%22lat%22%3A50.14524430128%2C%22lng%22%3A14.58773054712799%7D%2C%7B%22lat%22%3A50.129307131988%2C%22lng%22%3A14.60087568578706%7D%2C%7B%22lat%22%3A50.122604734575%2C%22lng%22%3A14.659116306376973%7D%2C%7B%22lat%22%3A50.106512499343%2C%22lng%22%3A14.657434650206028%7D%2C%7B%22lat%22%3A50.090685542974%2C%22lng%22%3A14.705099547441932%7D%2C%7B%22lat%22%3A50.072175921973%2C%22lng%22%3A14.700004206235008%7D%2C%7B%22lat%22%3A50.056898491904%2C%22lng%22%3A14.640206899053055%7D%2C%7B%22lat%22%3A50.038528576841%2C%22lng%22%3A14.666852728301023%7D%2C%7B%22lat%22%3A50.030955909657%2C%22lng%22%3A14.656128752460972%7D%2C%7B%22lat%22%3A50.013435368522%2C%22lng%22%3A14.66854956530301%7D%2C%7B%22lat%22%3A49.99444182116%2C%22lng%22%3A14.640153080292066%7D%2C%7B%22lat%22%3A50.010839032542%2C%22lng%22%3A14.527474219359988%7D%2C%7B%22lat%22%3A49.970771602447%2C%22lng%22%3A14.46224174052395%7D%2C%7B%22lat%22%3A49.970669964027%2C%22lng%22%3A14.400648545303966%7D%2C%7B%22lat%22%3A49.941901176098%2C%22lng%22%3A14.395563234671044%7D%2C%7B%22lat%22%3A49.948384148423%2C%22lng%22%3A14.337635637038034%7D%2C%7B%22lat%22%3A49.958376114735%2C%22lng%22%3A14.324977842107955%7D%2C%7B%22lat%22%3A49.9676286223%2C%22lng%22%3A14.34491711110104%7D%2C%7B%22lat%22%3A49.971859099005%2C%22lng%22%3A14.326815050839059%7D%2C%7B%22lat%22%3A49.990608728081%2C%22lng%22%3A14.342731259186962%7D%2C%7B%22lat%22%3A50.002211140429%2C%22lng%22%3A14.29483886971002%7D%2C%7B%22lat%22%3A50.023596577558%2C%22lng%22%3A14.315872285282012%7D%2C%7B%22lat%22%3A50.058309376419%2C%22lng%22%3A14.248086830069042%7D%2C%7B%22lat%22%3A50.073179111%2C%22lng%22%3A14.290193274400963%7D%2C%7B%22lat%22%3A50.102973823639%2C%22lng%22%3A14.224439442359994%7D%2C%7B%22lat%22%3A50.130060800171%2C%22lng%22%3A14.302396419107936%7D%2C%7B%22lat%22%3A50.116019827009%2C%22lng%22%3A14.360785349547996%7D%2C%7B%22lat%22%3A50.148005694843%2C%22lng%22%3A14.365662825877052%7D%2C%7B%22lat%22%3A50.14142969454%2C%22lng%22%3A14.394903042943952%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%2C%7B%22lat%22%3A50.171436864513%2C%22lng%22%3A14.506905276796942%7D%5D%5D&hasDrawnBoundary=1&mapBounds=%5B%5B%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.087801111111958%7D%2C%7B%22lat%22%3A50.039169221047985%2C%22lng%22%3A14.68724263943227%7D%2C%7B%22lat%22%3A50.289447077141126%2C%22lng%22%3A14.68724263943227%7D%5D%5D&center=%7B%22lat%22%3A50.16447196305031%2C%22lng%22%3A14.387521875272125%7D&zoom=11&locationInput=praha&limit=15');

    await page.waitForSelector('#search-content button.btn-icon');

    while (await page.$('#search-content button.btn-icon') !== null) {
      const articlesForNow = (await page.$$('#search-content article')).length;
      console.log(`Articles for now: ${articlesForNow}. Getting more...`);

      await Promise.all([
        page.evaluate(
          () => { document.querySelector('#search-content button.btn-icon').click(); }
        ),
        page.waitForFunction(
          old => document.querySelectorAll('#search-content article').length > old,
          {},
          articlesForNow
        ),
      ]);
    }

    const articlesAll = (await page.$$('#search-content article')).length;
    console.log(`All articles: ${articlesAll}.`);

    fs.writeFileSync('full.html', await page.content());
    fs.writeFileSync('articles.html', await page.evaluate(
      () => document.querySelector('#search-content div.b-filter__inner').outerHTML
    ));
    fs.writeFileSync('articles.txt', await page.evaluate(
      () => [...document.querySelectorAll('#search-content article')]
              .map(({ innerText }) => innerText)
              .join(`\n${'-'.repeat(50)}\n`)
    ));
    console.log('Saved.');

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

thanks, that works, but i have another question. On the page, there is button, and i need to press it to get more items, how do i do that? And also, if possible, i want to get html with All data, and parse it through queryselector myself, it would be much easier for me. — user10452005, Commented Feb 7, 2019 at 17:58
This depends on the button click effect: does it start navigation, send fetch or XHR request or just make some dynamic DOM manipulation. As for the second question, I am not sure I understand the issue. Maybe you can provide the URL and describe what you need to achive? — vsemozhebuty, Commented Feb 7, 2019 at 18:12
tinyurl.com/y9vgf2h7 There is button below all apartments offer, to load more. I want to get HTML of this page with All appartments offer, to parse it later with querySelector. — user10452005, Commented Feb 8, 2019 at 10:17
Do you mean "Zobrazit dalších 15 nabídek" button? Do you want to click on it till all the offers are shown? I've clicked on it several times and the list still grows. Is this list growth finite? — vsemozhebuty, Commented Feb 8, 2019 at 11:30
Yes, this button. I think it has end :). At least i remember it had. — user10452005, Commented Feb 8, 2019 at 15:38

Magnus · Accepted Answer · 2023-06-08 17:20:03Z

The answers above are essentially correct, i.e. the main ingredient is:

await page.goto('https://example.org/', { waitUntil: 'networkidle0' });

However, in practice, some sites will try to make themselves scrape-unfriendly by also checking the User-Agent header. So if you want the DOM to look like it would in a real browser, you might also need:

await page.setExtraHTTPHeaders({
    "User-Agent":
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
  });
await page.goto(url, { waitUntil: "networkidle0" });

Collectives™ on Stack Overflow

How to get all html data after all scripts and page loading is done? (puppeteer)

5 Answers 5

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Linked

Related