|
1 | 1 | ---
|
2 |
| -title: Logging into a website |
3 |
| -description: description |
| 2 | +title: Paginating through results |
| 3 | +description: Learn how to paginate through websites that use either page number-based pagination or lazy-loading pagination. |
4 | 4 | menuWeight: 2
|
5 | 5 | paths:
|
6 | 6 | - puppeteer-playwright/common-use-cases/paginating-through-results
|
@@ -230,7 +230,7 @@ console.log(repositories.length);
|
230 | 230 |
|
231 | 231 | > **IMPORTANT!** Usually, within the map function's callback you'd want to add the requests to a request queue, especially when paginating through hundreds (or even thousands) of pages. But since we know that we have 4 pages and only 3 left to go through, it is totally safe to use `Promise.all()` for this specific use-case.
|
232 | 232 |
|
233 |
| -### [](#final-code) Final code |
| 233 | +### [](#final-pagination-code) Final code |
234 | 234 |
|
235 | 235 | After all is said and done, here's what our final code looks like:
|
236 | 236 |
|
@@ -360,7 +360,273 @@ If we remember correctly, Facebook has 115 Github repositories (at the time of w
|
360 | 360 | 115
|
361 | 361 | ```
|
362 | 362 |
|
363 |
| -<!-- ## [](#lazy-loading-pagination) Lazy loading pagination --> |
| 363 | +## [](#lazy-loading-pagination) Lazy-loading pagination |
| 364 | + |
| 365 | +Though page number-based pagination is quite straightforward to automate the pagination process with, and though it is still an extremely common implementation, [lazy-loading](https://en.wikipedia.org/wiki/Lazy_loading) is becoming extremely popular on the modern web, which makes it an important and relevant topic to discuss. |
| 366 | + |
| 367 | +> Note that on websites with lazy-loading pagination, [API scraping]({{@link api_scraping.md}}) is usually a viable option, and a much better one due to reliability and performance. |
| 368 | +
|
| 369 | +Take a moment to look at and scroll through the women's clothing section [on About You's website](https://www.aboutyou.com/c/women/clothing-20204). Notice that the items are loaded as you scroll, and that there are no page numbers. Because of how drastically different this pagination implementation is from the previous one, it also requires a different workflow to scrape. |
| 370 | + |
| 371 | +We're going to scrape the brand and price from the first 75 results on the **About You** page linked above. Here's our basic setup: |
| 372 | + |
| 373 | +```marked-tabs |
| 374 | +<marked-tab header="Playwright" lang="javascript"> |
| 375 | +import { chromium } from 'playwright'; |
| 376 | +import { load } from 'cheerio'; |
| 377 | +
|
| 378 | +// Create an array where all scraped products will |
| 379 | +// be pushed to |
| 380 | +const products = []; |
| 381 | +
|
| 382 | +const browser = await chromium.launch({ headless: false }); |
| 383 | +const page = await browser.newPage(); |
| 384 | +
|
| 385 | +await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); |
| 386 | +
|
| 387 | +await browser.close(); |
| 388 | +</marked-tab> |
| 389 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 390 | +import puppeteer from 'puppeteer'; |
| 391 | +import { load } from 'cheerio'; |
| 392 | +
|
| 393 | +// Create an array where all scraped products will |
| 394 | +// be pushed to |
| 395 | +const products = []; |
| 396 | +
|
| 397 | +const browser = await puppeteer.launch({ headless: false }); |
| 398 | +const page = await browser.newPage(); |
| 399 | +
|
| 400 | +await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); |
| 401 | +
|
| 402 | +await browser.close(); |
| 403 | +</marked-tab> |
| 404 | +``` |
| 405 | + |
| 406 | +### [](#auto-scrolling) Auto scrolling |
| 407 | + |
| 408 | +Now, what we'll do is grab the height in pixels of a result item to have somewhat of a reference to how much we should scroll each time, as well as create a variable for keeping track of how many pixels have been scrolled. |
| 409 | + |
| 410 | +```JavaScript |
| 411 | +// Grab the height of result item in pixels, which will be used to scroll down |
| 412 | +const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); |
| 413 | + |
| 414 | +// Keep track of how many pixels have been scrolled down |
| 415 | +let totalScrolled = 0; |
| 416 | +``` |
| 417 | + |
| 418 | +Then, within a `while` loop that ends once the length of the **products** array has reached 75, we'll run some logic that scrolls down the page and waits 1 second before running again. |
| 419 | + |
| 420 | +```marked-tabs |
| 421 | +<marked-tab header="Playwright" lang="javascript"> |
| 422 | +while (products.length < 75) { |
| 423 | + await page.mouse.wheel(0, itemHeight * 3); |
| 424 | + totalScrolled += itemHeight * 3; |
| 425 | + // Allow the products 1 second to load |
| 426 | + await page.waitForTimeout(1000); |
| 427 | +} |
| 428 | +</marked-tab> |
| 429 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 430 | +while (products.length < 75) { |
| 431 | + await page.mouse.wheel({ deltaY: itemHeight * 3 }); |
| 432 | + totalScrolled += itemHeight * 3; |
| 433 | + // Allow the products 1 second to load |
| 434 | + await page.waitForTimeout(1000); |
| 435 | +} |
| 436 | +</marked-tab> |
| 437 | +``` |
| 438 | + |
| 439 | +This will work; however, what if we reach the bottom of the page and there are say, only 55 total products on the page? That would result in an infinite loop. Because of this edge case, we have to keep track of the constantly changing available scroll height on the page. |
| 440 | + |
| 441 | +```marked-tabs |
| 442 | +<marked-tab header="Playwright" lang="javascript"> |
| 443 | +while (products.length < 75) { |
| 444 | + const scrollHeight = await page.evaluate(() => document.body.scrollHeight); |
| 445 | +
|
| 446 | + await page.mouse.wheel(0, itemHeight * 3); |
| 447 | + totalScrolled += itemHeight * 3; |
| 448 | + // Allow the products 1 second to load |
| 449 | + await page.waitForTimeout(1000); |
| 450 | +
|
| 451 | + // Data collection login will go here |
| 452 | +
|
| 453 | + const innerHeight = await page.evaluate(() => window.innerHeight); |
| 454 | +
|
| 455 | + // if the total pixels scrolled is equal to the true available scroll |
| 456 | + // height of the page, we've reached the end and should stop scraping. |
| 457 | + // even if we haven't reach our goal of 75 products. |
| 458 | + if (totalScrolled >= scrollHeight - innerHeight) { |
| 459 | + break; |
| 460 | + } |
| 461 | +} |
| 462 | +</marked-tab> |
| 463 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 464 | +while (products.length < 75) { |
| 465 | + const scrollHeight = await page.evaluate(() => document.body.scrollHeight); |
| 466 | +
|
| 467 | + await page.mouse.wheel({ deltaY: itemHeight * 3 }); |
| 468 | + totalScrolled += itemHeight * 3; |
| 469 | + // Allow the products 1 second to load |
| 470 | + await page.waitForTimeout(1000); |
| 471 | +
|
| 472 | + // Data collection login will go here |
| 473 | +
|
| 474 | + const innerHeight = await page.evaluate(() => window.innerHeight); |
| 475 | +
|
| 476 | + // if the total pixels scrolled is equal to the true available scroll |
| 477 | + // height of the page, we've reached the end and should stop scraping. |
| 478 | + // even if we haven't reach our goal of 75 products. |
| 479 | + if (totalScrolled >= scrollHeight - innerHeight) { |
| 480 | + break; |
| 481 | + } |
| 482 | +} |
| 483 | +</marked-tab> |
| 484 | +``` |
| 485 | + |
| 486 | +Now, the `while` loop will exit out if we've reached the bottom of the page. |
| 487 | + |
| 488 | +### [](#collecting-data) Collecting data |
| 489 | + |
| 490 | +Within the loop, we can grab hold of the total number of items on the page. To avoid collecting and pushing duplicate items to the **products** array, we can use the `.slice()` method to cut out the items we've already scraped. |
| 491 | + |
| 492 | +```JavaScript |
| 493 | +const $ = load(await page.content()); |
| 494 | + |
| 495 | +// Grab the newly loaded items |
| 496 | +const items = [...$('a[data-testid*="productTile"]')].slice(products.length); |
| 497 | + |
| 498 | +const newItems = items.map((item) => { |
| 499 | + const elem = $(item); |
| 500 | + |
| 501 | + return { |
| 502 | + brand: elem.find('p[data-testid="brandName"]').text().trim(), |
| 503 | + price: elem.find('span[data-testid="finalPrice"]').text().trim(), |
| 504 | + }; |
| 505 | +}); |
| 506 | + |
| 507 | +products.push(...newItems); |
| 508 | +``` |
| 509 | + |
| 510 | +### [](#final-lazy-loading-code) Final code |
| 511 | + |
| 512 | +With everything completed, this is what we're left with: |
| 513 | + |
| 514 | +```marked-tabs |
| 515 | +<marked-tab header="Playwright" lang="javascript"> |
| 516 | +import { chromium } from 'playwright'; |
| 517 | +import { load } from 'cheerio'; |
| 518 | +
|
| 519 | +const products = []; |
| 520 | +
|
| 521 | +const browser = await chromium.launch({ headless: false }); |
| 522 | +const page = await browser.newPage(); |
| 523 | +
|
| 524 | +await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); |
| 525 | +
|
| 526 | +// Grab the height of result item in pixels, which will be used to scroll down |
| 527 | +const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); |
| 528 | +
|
| 529 | +// Keep track of how many pixels have been scrolled down |
| 530 | +let totalScrolled = 0; |
| 531 | +
|
| 532 | +while (products.length < 75) { |
| 533 | + const scrollHeight = await page.evaluate(() => document.body.scrollHeight); |
| 534 | +
|
| 535 | + await page.mouse.wheel(0, itemHeight * 3); |
| 536 | + totalScrolled += itemHeight * 3; |
| 537 | + // Allow the products 1 second to load |
| 538 | + await page.waitForTimeout(1000); |
| 539 | +
|
| 540 | + const $ = load(await page.content()); |
| 541 | +
|
| 542 | + // Grab the newly loaded items |
| 543 | + const items = [...$('a[data-testid*="productTile"]')].slice(products.length); |
| 544 | +
|
| 545 | + const newItems = items.map((item) => { |
| 546 | + const elem = $(item); |
| 547 | +
|
| 548 | + return { |
| 549 | + brand: elem.find('p[data-testid="brandName"]').text().trim(), |
| 550 | + price: elem.find('span[data-testid="finalPrice"]').text().trim(), |
| 551 | + }; |
| 552 | + }); |
| 553 | +
|
| 554 | + products.push(...newItems); |
| 555 | +
|
| 556 | + const innerHeight = await page.evaluate(() => window.innerHeight); |
| 557 | +
|
| 558 | + // if the total pixels scrolled is equal to the true available scroll |
| 559 | + // height of the page, we've reached the end and should stop scraping. |
| 560 | + // even if we haven't reach our goal of 75 products. |
| 561 | + if (totalScrolled >= scrollHeight - innerHeight) { |
| 562 | + break; |
| 563 | + } |
| 564 | +} |
| 565 | +
|
| 566 | +console.log(products.slice(0, 75)); |
| 567 | +
|
| 568 | +await browser.close(); |
| 569 | +</marked-tab> |
| 570 | +<marked-tab header="Puppeteer" lang="javascript"> |
| 571 | +import puppeteer from 'puppeteer'; |
| 572 | +import { load } from 'cheerio'; |
| 573 | +
|
| 574 | +const products = []; |
| 575 | +
|
| 576 | +const browser = await puppeteer.launch({ headless: false }); |
| 577 | +const page = await browser.newPage(); |
| 578 | +
|
| 579 | +await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); |
| 580 | +
|
| 581 | +// Grab the height of result item in pixels, which will be used to scroll down |
| 582 | +const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); |
| 583 | +
|
| 584 | +// Keep track of how many pixels have been scrolled down |
| 585 | +let totalScrolled = 0; |
| 586 | +
|
| 587 | +while (products.length < 75) { |
| 588 | + const scrollHeight = await page.evaluate(() => document.body.scrollHeight); |
| 589 | +
|
| 590 | + await page.mouse.wheel({ deltaY: itemHeight * 3 }); |
| 591 | + totalScrolled += itemHeight * 3; |
| 592 | + // Allow the products 1 second to load |
| 593 | + await page.waitForTimeout(1000); |
| 594 | +
|
| 595 | + const $ = load(await page.content()); |
| 596 | +
|
| 597 | + // Grab the newly loaded items |
| 598 | + const items = [...$('a[data-testid*="productTile"]')].slice(products.length); |
| 599 | +
|
| 600 | + const newItems = items.map((item) => { |
| 601 | + const elem = $(item); |
| 602 | +
|
| 603 | + return { |
| 604 | + brand: elem.find('p[data-testid="brandName"]').text().trim(), |
| 605 | + price: elem.find('span[data-testid="finalPrice"]').text().trim(), |
| 606 | + }; |
| 607 | + }); |
| 608 | +
|
| 609 | + products.push(...newItems); |
| 610 | +
|
| 611 | + const innerHeight = await page.evaluate(() => window.innerHeight); |
| 612 | +
|
| 613 | + // if the total pixels scrolled is equal to the true available scroll |
| 614 | + // height of the page, we've reached the end and should stop scraping. |
| 615 | + // even if we haven't reach our goal of 75 products. |
| 616 | + if (totalScrolled >= scrollHeight - innerHeight) { |
| 617 | + break; |
| 618 | + } |
| 619 | +} |
| 620 | +
|
| 621 | +console.log(products.slice(0, 75)); |
| 622 | +
|
| 623 | +await browser.close(); |
| 624 | +</marked-tab> |
| 625 | +``` |
| 626 | + |
| 627 | +## [](#quick-note) Quick note |
| 628 | + |
| 629 | +The examples shown in this lesson are not the only ways to paginate through websites. They are here to serve as solid examples, but don't view them as the end-all be-all of scraping paginated websites. The methods you use and algorithms you write might differ to various degrees based on what pages you're scraping and how your specific target website implemented pagination. |
364 | 630 |
|
365 | 631 | ## [](#next) Next up
|
366 | 632 |
|
|
0 commit comments