|
| 1 | +--- |
| 2 | +title: Logging into a website |
| 3 | +description: description |
| 4 | +menuWeight: 2 |
| 5 | +paths: |
| 6 | + - puppeteer-playwright/common-use-cases/paginating-through-results |
| 7 | +--- |
| 8 | + |
| 9 | +# [](#paginating-through-results) Paginating through results |
| 10 | + |
| 11 | +If you're trying to [collect data]({{@link puppeteer_playwright/executing_scripts/collecting_data.md}}) on a website that has millions, thousands, or even just hundreds of results, it is very likely that they are paginating their results to reduce strain on their backend as well as on the users loading and rendering the content. |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +Attempting to scrape thousands to tens of thousands of results using a headless browser on a website that only shows 30 results at a time might be daunting at first, but be rest assured that by the end of this lesson you'll feel confident when faced with this use case. |
| 16 | + |
| 17 | +## [](#page-number-based-pagination) Page number-based pagination |
| 18 | + |
| 19 | +At the time of writing, Facebook has [115 repositories on Github](https://github.com/orgs/facebook/repositories). By default, Github lists repositories in descending order based on when they were last updated (the most recently updated repos are at the top of the list). |
| 20 | + |
| 21 | +We want to scrape all of the titles, links, and descriptions for Facebook's repositories; however, Github only displays 30 repos per page. This means we've gotta paginate through all of the results. |
| 22 | + |
| 23 | +Let's first start off by defining some basic variables: |
| 24 | + |
| 25 | +```JavaScript |
| 26 | +// This is where we'll push all of the scraped repos |
| 27 | +const repositories = []; |
| 28 | + |
| 29 | +const BASE_URL = 'https://github.com'; |
| 30 | + |
| 31 | +// We'll use this URL a couple of times within our code, so we'll |
| 32 | +// store it in a constant variable to prevent typos and so it's |
| 33 | +// easier to tell what it is |
| 34 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 35 | +``` |
| 36 | + |
| 37 | +### [](#finding-the-last-page) Finding the last page |
| 38 | + |
| 39 | +What we want to do now is grab the last page number, so that we know exactly how many requests we need to send in order to paginate through all of the repositories. Luckily, this information is available right on the page here: |
| 40 | + |
| 41 | + |
| 42 | + |
| 43 | +Let's grab this number now with a little bit of code: |
| 44 | + |
| 45 | +```marked-tabs |
| 46 | +<marked-tab header="Playwright" lang="javascript"> |
| 47 | +import { chromium } from 'playwright'; |
| 48 | +import { load } from 'cheerio'; |
| 49 | +
|
| 50 | +const repositories = []; |
| 51 | +
|
| 52 | +const BASE_URL = 'https://github.com'; |
| 53 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 54 | +
|
| 55 | +const browser = await chromium.launch({ headless: false }); |
| 56 | +const page = await browser.newPage(); |
| 57 | +
|
| 58 | +await page.goto(REPOSITORIES_URL); |
| 59 | +
|
| 60 | +const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)'); |
| 61 | +// This will output 4 |
| 62 | +const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, ''); |
| 63 | +
|
| 64 | +console.log(lastPage); |
| 65 | +
|
| 66 | +await browser.close(); |
| 67 | +</marked-tab> |
| 68 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 69 | +import puppeteer from 'puppeteer'; |
| 70 | +import { load } from 'cheerio'; |
| 71 | +
|
| 72 | +const repositories = []; |
| 73 | +
|
| 74 | +const BASE_URL = 'https://github.com'; |
| 75 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 76 | +
|
| 77 | +const browser = await puppeteer.launch({ headless: false }); |
| 78 | +const page = await browser.newPage(); |
| 79 | +
|
| 80 | +await page.goto(REPOSITORIES_URL); |
| 81 | +
|
| 82 | +const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label')); |
| 83 | +// This will output 4 |
| 84 | +const lastPage = +lastPageLabel.replace(/\D/g, ''); |
| 85 | +
|
| 86 | +console.log(lastPage); |
| 87 | +
|
| 88 | +await browser.close(); |
| 89 | +</marked-tab> |
| 90 | +``` |
| 91 | + |
| 92 | +> Learn more about the `:nth-last-child` pseudo-class [on W3Schools](https://www.w3schools.com/cssref/sel_nth-last-child.asp). It works similar to `:nth-child`, but starts from the bottom of the parent element's children instead of from the top. |
| 93 | +
|
| 94 | +When we run this code, here's what we see: |
| 95 | + |
| 96 | +```text |
| 97 | +4 |
| 98 | +``` |
| 99 | + |
| 100 | +And since we're already on the first page, we'll go ahead and scrape the repos from it. However, since we are going to reuse this logic on the other pages as well, let's create a function that will handle the data collection and reliably return a nice array of results: |
| 101 | + |
| 102 | +```marked-tabs |
| 103 | +<marked-tab header="Playwright" lang="javascript"> |
| 104 | +import { chromium } from 'playwright'; |
| 105 | +import { load } from 'cheerio'; |
| 106 | +
|
| 107 | +const repositories = []; |
| 108 | +
|
| 109 | +const BASE_URL = 'https://github.com'; |
| 110 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 111 | +
|
| 112 | +// Create a function which grabs all repos from a page |
| 113 | +const scrapeRepos = async (page) => { |
| 114 | + const $ = load(await page.content()); |
| 115 | +
|
| 116 | + return [...$('li.Box-row')].map((item) => { |
| 117 | + const elem = $(item); |
| 118 | + const titleElement = elem.find('a[itemprop*="name"]'); |
| 119 | +
|
| 120 | + return { |
| 121 | + title: titleElement.text().trim(), |
| 122 | + description: elem.find('p[itemprop="description"]').text().trim(), |
| 123 | + link: new URL(titleElement.attr('href'), BASE_URL).href, |
| 124 | + }; |
| 125 | + }); |
| 126 | +}; |
| 127 | +
|
| 128 | +const browser = await chromium.launch({ headless: false }); |
| 129 | +const page = await browser.newPage(); |
| 130 | +
|
| 131 | +await page.goto(REPOSITORIES_URL); |
| 132 | +
|
| 133 | +const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)'); |
| 134 | +const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, ''); |
| 135 | +
|
| 136 | +// Push all results from the first page to results array |
| 137 | +repositories.push(...(await scrapeRepos(page))); |
| 138 | +
|
| 139 | +// Log the 30 repositories scraped from the first page |
| 140 | +console.log(repositories); |
| 141 | +
|
| 142 | +await browser.close(); |
| 143 | +</marked-tab> |
| 144 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 145 | +import puppeteer from 'puppeteer'; |
| 146 | +import { load } from 'cheerio'; |
| 147 | +
|
| 148 | +const repositories = []; |
| 149 | +
|
| 150 | +const BASE_URL = 'https://github.com'; |
| 151 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 152 | +
|
| 153 | +// Create a function which grabs all repos from a page |
| 154 | +const scrapeRepos = async (page) => { |
| 155 | + const $ = load(await page.content()); |
| 156 | +
|
| 157 | + return [...$('li.Box-row')].map((item) => { |
| 158 | + const elem = $(item); |
| 159 | + const titleElement = elem.find('a[itemprop*="name"]'); |
| 160 | +
|
| 161 | + return { |
| 162 | + title: titleElement.text().trim(), |
| 163 | + description: elem.find('p[itemprop="description"]').text().trim(), |
| 164 | + link: new URL(titleElement.attr('href'), BASE_URL).href, |
| 165 | + }; |
| 166 | + }); |
| 167 | +}; |
| 168 | +
|
| 169 | +const browser = await puppeteer.launch({ headless: false }); |
| 170 | +const page = await browser.newPage(); |
| 171 | +
|
| 172 | +await page.goto(REPOSITORIES_URL); |
| 173 | +
|
| 174 | +const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label')); |
| 175 | +const lastPage = +lastPageLabel.replace(/\D/g, ''); |
| 176 | +
|
| 177 | +// Push all results from the first page to results array |
| 178 | +repositories.push(...(await scrapeRepos(page))); |
| 179 | +
|
| 180 | +// Log the 30 repositories scraped from the first page |
| 181 | +console.log(repositories); |
| 182 | +
|
| 183 | +await browser.close(); |
| 184 | +</marked-tab> |
| 185 | +``` |
| 186 | + |
| 187 | +### [](#making-a-request-for-each-results-page) Making a request for each results page |
| 188 | + |
| 189 | +Cool, so now we have all the tools we need to write concise logic that will be run for every single page. First, we'll create an array of numbers from 0-4: |
| 190 | + |
| 191 | +```JavaScript |
| 192 | +// We must add 1 to the lastPage, since the array starts at 0 and we |
| 193 | +// are creating an array from its index values |
| 194 | +Array(lastPage + 1).keys() // -> [0, 1, 2, 3, 4] |
| 195 | +``` |
| 196 | + |
| 197 | +Then, we'll slice the first two values from that array so that it starts from 2 and ends at 4: |
| 198 | + |
| 199 | +```JavaScript |
| 200 | +[...Array(lastPage + 1).keys()].slice(2) // -> [2, 3, 4] |
| 201 | +``` |
| 202 | + |
| 203 | +This array now accurately represents the pages we need to go through. We'll map through it and create an array of promises, all of which make a request to each page, scrape its data, then push it to the **repositories** array: |
| 204 | + |
| 205 | +```JavaScript |
| 206 | +// Map through the range. The value from the array is the page number |
| 207 | +// to make a request for |
| 208 | +const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) => |
| 209 | + (async () => { |
| 210 | + const page2 = await browser.newPage(); |
| 211 | + |
| 212 | + // Prepare the URL before making the request by setting the "page" |
| 213 | + // parameter to whatever the pageNumber is currently |
| 214 | + const url = new URL(REPOSITORIES_URL); |
| 215 | + url.searchParams.set('page', pageNumber); |
| 216 | + |
| 217 | + await page2.goto(url.href); |
| 218 | + |
| 219 | + // Scrape the data and push it to the "repositories" array |
| 220 | + repositories.push(...(await scrapeRepos(page2))); |
| 221 | + |
| 222 | + await page2.close(); |
| 223 | + })() |
| 224 | +); |
| 225 | + |
| 226 | +await Promise.all(promises); |
| 227 | + |
| 228 | +console.log(repositories.length); |
| 229 | +``` |
| 230 | + |
| 231 | +> **IMPORTANT!** Usually, within the map function's callback you'd want to add the requests to a request queue, especially when paginating through hundreds (or even thousands) of pages. But since we know that we have 4 pages and only 3 left to go through, it is totally safe to use `Promise.all()` for this specific use-case. |
| 232 | +
|
| 233 | +### [](#final-code) Final code |
| 234 | + |
| 235 | +After all is said and done, here's what our final code looks like: |
| 236 | + |
| 237 | +```marked-tabs |
| 238 | +<marked-tab header="Playwright" lang="javascript"> |
| 239 | +import { chromium } from 'playwright'; |
| 240 | +import { load } from 'cheerio'; |
| 241 | +
|
| 242 | +const repositories = []; |
| 243 | +
|
| 244 | +const BASE_URL = 'https://github.com'; |
| 245 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 246 | +
|
| 247 | +const scrapeRepos = async (page) => { |
| 248 | + const $ = load(await page.content()); |
| 249 | +
|
| 250 | + return [...$('li.Box-row')].map((item) => { |
| 251 | + const elem = $(item); |
| 252 | + const titleElement = elem.find('a[itemprop*="name"]'); |
| 253 | +
|
| 254 | + return { |
| 255 | + title: titleElement.text().trim(), |
| 256 | + description: elem.find('p[itemprop="description"]').text().trim(), |
| 257 | + link: new URL(titleElement.attr('href'), BASE_URL).href, |
| 258 | + }; |
| 259 | + }); |
| 260 | +}; |
| 261 | +
|
| 262 | +const browser = await chromium.launch({ headless: false }); |
| 263 | +const page = await browser.newPage(); |
| 264 | +
|
| 265 | +await page.goto(REPOSITORIES_URL); |
| 266 | +
|
| 267 | +const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)'); |
| 268 | +const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, ''); |
| 269 | +
|
| 270 | +repositories.push(...(await scrapeRepos(page))); |
| 271 | +
|
| 272 | +await page.close(); |
| 273 | +
|
| 274 | +const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) => |
| 275 | + (async () => { |
| 276 | + const page2 = await browser.newPage(); |
| 277 | +
|
| 278 | + const url = new URL(REPOSITORIES_URL); |
| 279 | + url.searchParams.set('page', pageNumber); |
| 280 | +
|
| 281 | + await page2.goto(url.href); |
| 282 | +
|
| 283 | + repositories.push(...(await scrapeRepos(page2))); |
| 284 | +
|
| 285 | + await page2.close(); |
| 286 | + })() |
| 287 | +); |
| 288 | +
|
| 289 | +await Promise.all(promises); |
| 290 | +
|
| 291 | +// Log the final length of the "repositories" array |
| 292 | +console.log(repositories.length); |
| 293 | +
|
| 294 | +await browser.close(); |
| 295 | +</marked-tab> |
| 296 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 297 | +import puppeteer from 'puppeteer'; |
| 298 | +import { load } from 'cheerio'; |
| 299 | +
|
| 300 | +const repositories = []; |
| 301 | +
|
| 302 | +const BASE_URL = 'https://github.com'; |
| 303 | +const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; |
| 304 | +
|
| 305 | +// Create a function which grabs all repos from a page |
| 306 | +const scrapeRepos = async (page) => { |
| 307 | + const $ = load(await page.content()); |
| 308 | +
|
| 309 | + return [...$('li.Box-row')].map((item) => { |
| 310 | + const elem = $(item); |
| 311 | + const titleElement = elem.find('a[itemprop*="name"]'); |
| 312 | +
|
| 313 | + return { |
| 314 | + title: titleElement.text().trim(), |
| 315 | + description: elem.find('p[itemprop="description"]').text().trim(), |
| 316 | + link: new URL(titleElement.attr('href'), BASE_URL).href, |
| 317 | + }; |
| 318 | + }); |
| 319 | +}; |
| 320 | +
|
| 321 | +const browser = await puppeteer.launch({ headless: false }); |
| 322 | +const page = await browser.newPage(); |
| 323 | +
|
| 324 | +await page.goto(REPOSITORIES_URL); |
| 325 | +
|
| 326 | +const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label')); |
| 327 | +const lastPage = +lastPageLabel.replace(/\D/g, ''); |
| 328 | +
|
| 329 | +repositories.push(...(await scrapeRepos(page))); |
| 330 | +
|
| 331 | +await page.close(); |
| 332 | +
|
| 333 | +const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) => |
| 334 | + (async () => { |
| 335 | + const page2 = await browser.newPage(); |
| 336 | +
|
| 337 | + const url = new URL(REPOSITORIES_URL); |
| 338 | + url.searchParams.set('page', pageNumber); |
| 339 | +
|
| 340 | + await page2.goto(url.href); |
| 341 | +
|
| 342 | + repositories.push(...(await scrapeRepos(page2))); |
| 343 | +
|
| 344 | + await page2.close(); |
| 345 | + })() |
| 346 | +); |
| 347 | +
|
| 348 | +await Promise.all(promises); |
| 349 | +
|
| 350 | +// Log the final length of the "repositories" array |
| 351 | +console.log(repositories.length); |
| 352 | +
|
| 353 | +await browser.close(); |
| 354 | +</marked-tab> |
| 355 | +``` |
| 356 | + |
| 357 | +If we remember correctly, Facebook has 115 Github repositories (at the time of writing this lesson), so the final output should be: |
| 358 | + |
| 359 | +```text |
| 360 | +115 |
| 361 | +``` |
| 362 | + |
| 363 | +<!-- ## [](#lazy-loading-pagination) Lazy loading pagination --> |
| 364 | + |
| 365 | +## [](#next) Next up |
| 366 | + |
| 367 | +We're actively working in expanding this section of the course, so stay tuned! |
0 commit comments