Setup Headless Chrome and Puppeteer

I’d recommend installing Puppeteer with npm, as it’ll also include the stable up-to-date Chromium version that is guaranteed to work with the library.

Run this command in your project root directory:

npm i puppeteer --save

Note: This might take a while as Puppeteer will need to download and install Chromium in the background.

Okay, now that we are all set and configured, let the fun begin!

Using Puppeteer API for Automated Web Scraping

Let’s start our Puppeteer tutorial with a basic example. We’ll write a script that will cause our headless browser to take a screenshot of a website of our choice.

Create a new file in your project directory named screenshot.jsand open it in your favorite code editor.

First, let’s import the Puppeteer library in your script:

const puppeteer = require('puppeteer');

Next up, let’s take the URL from command-line arguments:

const url = process.argv[2];
if (!url) {
    throw "Please provide a URL as the first argument";
}

Now, we need to keep in mind that Puppeteer is a promise-based library: It performs asynchronous calls to the headless Chrome instance under the hood. Let’s keep the code clean by using async/await. For that, we need to define an async function first and put all the Puppeteer code in there:

async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

Altogether, the final code looks like this:

const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
    throw "Please provide URL as a first argument";
}
async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

You can run it by executing the following command in the root directory of your project:

node screenshot.js https://github.com

Wait a second, and boom! Our headless browser just created a file named screenshot.png and you can see the GitHub homepage rendered in it. Great, we have a working Chrome web scraper!

Let’s stop for a minute and explore what happens in our run()function above.

First, we launch a new headless browser instance, then we open a new page (tab) and navigate to the URL provided in the command-line argument. Lastly, we use Puppeteer’s built-in method for taking a screenshot, and we only need to provide the path where it should be saved. We also need to make sure to close the headless browser after we are done with our automation.

Now that we’ve covered the basics, let’s move on to something a bit more complex.

A Second Puppeteer Scraping Example

For the next part of our Puppeteer tutorial, let’s say we want to scrape down the newest articles from Hacker News.

Create a new file named ycombinator-scraper.js and paste in the following code snippet:

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://news.ycombinator.com/");
            let urls = await page.evaluate(() => {
                let results = [];
                let items = document.querySelectorAll('a.storylink');
                items.forEach((item) => {
                    results.push({
                        url:  item.getAttribute('href'),
                        text: item.innerText,
                    });
                });
                return results;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

Okay, there’s a bit more going on here compared with the previous example.

The first thing you might notice is that the run() function now returns a promise so the async prefix has moved to the promise function’s definition.

We’ve also wrapped all of our code in a try-catch block so that we can handle any errors that cause our promise to be rejected.

And finally, we’re using Puppeteer’s built-in method called evaluate(). This method lets us run custom JavaScript code as if we were executing it in the DevTools console. Anything returned from that function gets resolved by the promise. This method is very handy when it comes to scraping information or performing custom actions.

The code passed to the evaluate() method is pretty basic JavaScript that builds an array of objects, each having url and text fields that represent the story URLs we see on https://news.ycombinator.com/.

The output of the script looks something like this (but with 30 entries, originally):

[ { url: 'https://www.nature.com/articles/d41586-018-05469-3',
    text: 'Bias detectives: the researchers striving to make algorithms fair' },
  { url: 'https://mino-games.workable.com/jobs/415887',
    text: 'Mino Games Is Hiring Programmers in Montreal' },
  { url: 'http://srobb.net/pf.html',
    text: 'A Beginner\'s Guide to Firewalling with pf' },
  // ...
  { url: 'https://tools.ietf.org/html/rfc8439',
    text: 'ChaCha20 and Poly1305 for IETF Protocols' } ]

Pretty neat, I’d say!

Okay, let’s move forward. We only had 30 items returned, while there are many more available—they are just on other pages. We need to click on the “More” button to load the next page of results.

Let’s modify our script a bit to add a support for pagination:

const puppeteer = require('puppeteer');
function run (pagesToScrape) {
    return new Promise(async (resolve, reject) => {
        try {
            if (!pagesToScrape) {
                pagesToScrape = 1;
            }
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://news.ycombinator.com/");
            let currentPage = 1;
            let urls = [];
            while (currentPage <= pagesToScrape) {
                let newUrls = await page.evaluate(() => {
                    let results = [];
                    let items = document.querySelectorAll('a.storylink');
                    items.forEach((item) => {
                        results.push({
                            url:  item.getAttribute('href'),
                            text: item.innerText,
                        });
                    });
                    return results;
                });
                urls = urls.concat(newUrls);
                if (currentPage < pagesToScrape) {
                    await Promise.all([
                        await page.click('a.morelink'),
                        await page.waitForSelector('a.storylink')
                    ])
                }
                currentPage++;
            }
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run(5).then(console.log).catch(console.error);

Let’s review what we did here:

1.      We added a single argument called pagesToScrape to our main run() function. We’ll use this to limit how many pages our script will scrape.

2.      There is one more new variable named currentPage which represents the number of the page of results are we looking at currently. It’s set to 1 initially. We also wrapped our evaluate() function in a while loop, so that it keeps running as long as currentPage is less than or equal to pagesToScrape.

3.      We added the block for moving to a new page and waiting for the page to load before restarting the while loop.

You’ll notice that we used the page.click() method to have the headless browser click on the “More” button. We also used the waitForSelector() method to make sure our logic is paused until the page contents are loaded.

Both of those are high-level Puppeteer API methods ready to use out-of-the-box.

One of the problems you’ll probably encounter during scraping with Puppeteer is waiting for a page to load. Hacker News has a relatively simple structure and it was fairly easy to wait for its page load completion. For more complex use cases, Puppeteer offers a wide range of built-in functionality, which you can explore in the API documentation on GitHub.

This is all pretty cool, but our Puppeteer tutorial hasn’t covered optimization yet. Let’s see how can we make Puppeteer run faster.

Related Posts

Comments are closed.

© 2024 Software Engineering - Theme by WPEnjoy · Powered by WordPress