Skip to content

Commit 129d626

Browse files
committed
Finish pagination lesson
1 parent eed7c5d commit 129d626

File tree

2 files changed

+270
-4
lines changed

2 files changed

+270
-4
lines changed

content/academy/puppeteer_playwright/common_use_cases/paginating_through_results.md

Lines changed: 270 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Logging into a website
3-
description: description
2+
title: Paginating through results
3+
description: Learn how to paginate through websites that use either page number-based pagination or lazy-loading pagination.
44
menuWeight: 2
55
paths:
66
- puppeteer-playwright/common-use-cases/paginating-through-results
@@ -230,7 +230,7 @@ console.log(repositories.length);
230230

231231
> **IMPORTANT!** Usually, within the map function's callback you'd want to add the requests to a request queue, especially when paginating through hundreds (or even thousands) of pages. But since we know that we have 4 pages and only 3 left to go through, it is totally safe to use `Promise.all()` for this specific use-case.
232232
233-
### [](#final-code) Final code
233+
### [](#final-pagination-code) Final code
234234

235235
After all is said and done, here's what our final code looks like:
236236

@@ -360,7 +360,273 @@ If we remember correctly, Facebook has 115 Github repositories (at the time of w
360360
115
361361
```
362362

363-
<!-- ## [](#lazy-loading-pagination) Lazy loading pagination -->
363+
## [](#lazy-loading-pagination) Lazy-loading pagination
364+
365+
Though page number-based pagination is quite straightforward to automate the pagination process with, and though it is still an extremely common implementation, [lazy-loading](https://en.wikipedia.org/wiki/Lazy_loading) is becoming extremely popular on the modern web, which makes it an important and relevant topic to discuss.
366+
367+
> Note that on websites with lazy-loading pagination, [API scraping]({{@link api_scraping.md}}) is usually a viable option, and a much better one due to reliability and performance.
368+
369+
Take a moment to look at and scroll through the women's clothing section [on About You's website](https://www.aboutyou.com/c/women/clothing-20204). Notice that the items are loaded as you scroll, and that there are no page numbers. Because of how drastically different this pagination implementation is from the previous one, it also requires a different workflow to scrape.
370+
371+
We're going to scrape the brand and price from the first 75 results on the **About You** page linked above. Here's our basic setup:
372+
373+
```marked-tabs
374+
<marked-tab header="Playwright" lang="javascript">
375+
import { chromium } from 'playwright';
376+
import { load } from 'cheerio';
377+
378+
// Create an array where all scraped products will
379+
// be pushed to
380+
const products = [];
381+
382+
const browser = await chromium.launch({ headless: false });
383+
const page = await browser.newPage();
384+
385+
await page.goto('https://www.aboutyou.com/c/women/clothing-20204');
386+
387+
await browser.close();
388+
</marked-tab>
389+
<marked-tab header="Puppeteer" lang="javascript">
390+
import puppeteer from 'puppeteer';
391+
import { load } from 'cheerio';
392+
393+
// Create an array where all scraped products will
394+
// be pushed to
395+
const products = [];
396+
397+
const browser = await puppeteer.launch({ headless: false });
398+
const page = await browser.newPage();
399+
400+
await page.goto('https://www.aboutyou.com/c/women/clothing-20204');
401+
402+
await browser.close();
403+
</marked-tab>
404+
```
405+
406+
### [](#auto-scrolling) Auto scrolling
407+
408+
Now, what we'll do is grab the height in pixels of a result item to have somewhat of a reference to how much we should scroll each time, as well as create a variable for keeping track of how many pixels have been scrolled.
409+
410+
```JavaScript
411+
// Grab the height of result item in pixels, which will be used to scroll down
412+
const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight);
413+
414+
// Keep track of how many pixels have been scrolled down
415+
let totalScrolled = 0;
416+
```
417+
418+
Then, within a `while` loop that ends once the length of the **products** array has reached 75, we'll run some logic that scrolls down the page and waits 1 second before running again.
419+
420+
```marked-tabs
421+
<marked-tab header="Playwright" lang="javascript">
422+
while (products.length < 75) {
423+
await page.mouse.wheel(0, itemHeight * 3);
424+
totalScrolled += itemHeight * 3;
425+
// Allow the products 1 second to load
426+
await page.waitForTimeout(1000);
427+
}
428+
</marked-tab>
429+
<marked-tab header="Puppeteer" lang="javascript">
430+
while (products.length < 75) {
431+
await page.mouse.wheel({ deltaY: itemHeight * 3 });
432+
totalScrolled += itemHeight * 3;
433+
// Allow the products 1 second to load
434+
await page.waitForTimeout(1000);
435+
}
436+
</marked-tab>
437+
```
438+
439+
This will work; however, what if we reach the bottom of the page and there are say, only 55 total products on the page? That would result in an infinite loop. Because of this edge case, we have to keep track of the constantly changing available scroll height on the page.
440+
441+
```marked-tabs
442+
<marked-tab header="Playwright" lang="javascript">
443+
while (products.length < 75) {
444+
const scrollHeight = await page.evaluate(() => document.body.scrollHeight);
445+
446+
await page.mouse.wheel(0, itemHeight * 3);
447+
totalScrolled += itemHeight * 3;
448+
// Allow the products 1 second to load
449+
await page.waitForTimeout(1000);
450+
451+
// Data collection login will go here
452+
453+
const innerHeight = await page.evaluate(() => window.innerHeight);
454+
455+
// if the total pixels scrolled is equal to the true available scroll
456+
// height of the page, we've reached the end and should stop scraping.
457+
// even if we haven't reach our goal of 75 products.
458+
if (totalScrolled >= scrollHeight - innerHeight) {
459+
break;
460+
}
461+
}
462+
</marked-tab>
463+
<marked-tab header="Puppeteer" lang="javascript">
464+
while (products.length < 75) {
465+
const scrollHeight = await page.evaluate(() => document.body.scrollHeight);
466+
467+
await page.mouse.wheel({ deltaY: itemHeight * 3 });
468+
totalScrolled += itemHeight * 3;
469+
// Allow the products 1 second to load
470+
await page.waitForTimeout(1000);
471+
472+
// Data collection login will go here
473+
474+
const innerHeight = await page.evaluate(() => window.innerHeight);
475+
476+
// if the total pixels scrolled is equal to the true available scroll
477+
// height of the page, we've reached the end and should stop scraping.
478+
// even if we haven't reach our goal of 75 products.
479+
if (totalScrolled >= scrollHeight - innerHeight) {
480+
break;
481+
}
482+
}
483+
</marked-tab>
484+
```
485+
486+
Now, the `while` loop will exit out if we've reached the bottom of the page.
487+
488+
### [](#collecting-data) Collecting data
489+
490+
Within the loop, we can grab hold of the total number of items on the page. To avoid collecting and pushing duplicate items to the **products** array, we can use the `.slice()` method to cut out the items we've already scraped.
491+
492+
```JavaScript
493+
const $ = load(await page.content());
494+
495+
// Grab the newly loaded items
496+
const items = [...$('a[data-testid*="productTile"]')].slice(products.length);
497+
498+
const newItems = items.map((item) => {
499+
const elem = $(item);
500+
501+
return {
502+
brand: elem.find('p[data-testid="brandName"]').text().trim(),
503+
price: elem.find('span[data-testid="finalPrice"]').text().trim(),
504+
};
505+
});
506+
507+
products.push(...newItems);
508+
```
509+
510+
### [](#final-lazy-loading-code) Final code
511+
512+
With everything completed, this is what we're left with:
513+
514+
```marked-tabs
515+
<marked-tab header="Playwright" lang="javascript">
516+
import { chromium } from 'playwright';
517+
import { load } from 'cheerio';
518+
519+
const products = [];
520+
521+
const browser = await chromium.launch({ headless: false });
522+
const page = await browser.newPage();
523+
524+
await page.goto('https://www.aboutyou.com/c/women/clothing-20204');
525+
526+
// Grab the height of result item in pixels, which will be used to scroll down
527+
const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight);
528+
529+
// Keep track of how many pixels have been scrolled down
530+
let totalScrolled = 0;
531+
532+
while (products.length < 75) {
533+
const scrollHeight = await page.evaluate(() => document.body.scrollHeight);
534+
535+
await page.mouse.wheel(0, itemHeight * 3);
536+
totalScrolled += itemHeight * 3;
537+
// Allow the products 1 second to load
538+
await page.waitForTimeout(1000);
539+
540+
const $ = load(await page.content());
541+
542+
// Grab the newly loaded items
543+
const items = [...$('a[data-testid*="productTile"]')].slice(products.length);
544+
545+
const newItems = items.map((item) => {
546+
const elem = $(item);
547+
548+
return {
549+
brand: elem.find('p[data-testid="brandName"]').text().trim(),
550+
price: elem.find('span[data-testid="finalPrice"]').text().trim(),
551+
};
552+
});
553+
554+
products.push(...newItems);
555+
556+
const innerHeight = await page.evaluate(() => window.innerHeight);
557+
558+
// if the total pixels scrolled is equal to the true available scroll
559+
// height of the page, we've reached the end and should stop scraping.
560+
// even if we haven't reach our goal of 75 products.
561+
if (totalScrolled >= scrollHeight - innerHeight) {
562+
break;
563+
}
564+
}
565+
566+
console.log(products.slice(0, 75));
567+
568+
await browser.close();
569+
</marked-tab>
570+
<marked-tab header="Puppeteer" lang="javascript">
571+
import puppeteer from 'puppeteer';
572+
import { load } from 'cheerio';
573+
574+
const products = [];
575+
576+
const browser = await puppeteer.launch({ headless: false });
577+
const page = await browser.newPage();
578+
579+
await page.goto('https://www.aboutyou.com/c/women/clothing-20204');
580+
581+
// Grab the height of result item in pixels, which will be used to scroll down
582+
const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight);
583+
584+
// Keep track of how many pixels have been scrolled down
585+
let totalScrolled = 0;
586+
587+
while (products.length < 75) {
588+
const scrollHeight = await page.evaluate(() => document.body.scrollHeight);
589+
590+
await page.mouse.wheel({ deltaY: itemHeight * 3 });
591+
totalScrolled += itemHeight * 3;
592+
// Allow the products 1 second to load
593+
await page.waitForTimeout(1000);
594+
595+
const $ = load(await page.content());
596+
597+
// Grab the newly loaded items
598+
const items = [...$('a[data-testid*="productTile"]')].slice(products.length);
599+
600+
const newItems = items.map((item) => {
601+
const elem = $(item);
602+
603+
return {
604+
brand: elem.find('p[data-testid="brandName"]').text().trim(),
605+
price: elem.find('span[data-testid="finalPrice"]').text().trim(),
606+
};
607+
});
608+
609+
products.push(...newItems);
610+
611+
const innerHeight = await page.evaluate(() => window.innerHeight);
612+
613+
// if the total pixels scrolled is equal to the true available scroll
614+
// height of the page, we've reached the end and should stop scraping.
615+
// even if we haven't reach our goal of 75 products.
616+
if (totalScrolled >= scrollHeight - innerHeight) {
617+
break;
618+
}
619+
}
620+
621+
console.log(products.slice(0, 75));
622+
623+
await browser.close();
624+
</marked-tab>
625+
```
626+
627+
## [](#quick-note) Quick note
628+
629+
The examples shown in this lesson are not the only ways to paginate through websites. They are here to serve as solid examples, but don't view them as the end-all be-all of scraping paginated websites. The methods you use and algorithms you write might differ to various degrees based on what pages you're scraping and how your specific target website implemented pagination.
364630

365631
## [](#next) Next up
366632

0 commit comments

Comments
 (0)