Skip to content

Commit eed7c5d

Browse files
committed
Write Paginating lesson (first part)
1 parent 6b0a2de commit eed7c5d

File tree

3 files changed

+368
-1
lines changed

3 files changed

+368
-1
lines changed
30.7 KB
Loading

content/academy/puppeteer_playwright/common_use_cases/logging_into_a_website.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -393,4 +393,4 @@ await browser.close();
393393

394394
## [](#next) Next up
395395

396-
We're actively working in expanding this section of the course, so stay tuned!
396+
In the [next lesson]({{@link puppeteer_playwright/common_use_cases/paginating_through_results.md}}), you'll learn how to paginate through results on a website.
Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
---
2+
title: Logging into a website
3+
description: description
4+
menuWeight: 2
5+
paths:
6+
- puppeteer-playwright/common-use-cases/paginating-through-results
7+
---
8+
9+
# [](#paginating-through-results) Paginating through results
10+
11+
If you're trying to [collect data]({{@link puppeteer_playwright/executing_scripts/collecting_data.md}}) on a website that has millions, thousands, or even just hundreds of results, it is very likely that they are paginating their results to reduce strain on their backend as well as on the users loading and rendering the content.
12+
13+
![Amazon pagination](https://apify-docs.s3.amazonaws.com/master/docs/assets/tutorials/images/pagination.webp)
14+
15+
Attempting to scrape thousands to tens of thousands of results using a headless browser on a website that only shows 30 results at a time might be daunting at first, but be rest assured that by the end of this lesson you'll feel confident when faced with this use case.
16+
17+
## [](#page-number-based-pagination) Page number-based pagination
18+
19+
At the time of writing, Facebook has [115 repositories on Github](https://github.com/orgs/facebook/repositories). By default, Github lists repositories in descending order based on when they were last updated (the most recently updated repos are at the top of the list).
20+
21+
We want to scrape all of the titles, links, and descriptions for Facebook's repositories; however, Github only displays 30 repos per page. This means we've gotta paginate through all of the results.
22+
23+
Let's first start off by defining some basic variables:
24+
25+
```JavaScript
26+
// This is where we'll push all of the scraped repos
27+
const repositories = [];
28+
29+
const BASE_URL = 'https://github.com';
30+
31+
// We'll use this URL a couple of times within our code, so we'll
32+
// store it in a constant variable to prevent typos and so it's
33+
// easier to tell what it is
34+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
35+
```
36+
37+
### [](#finding-the-last-page) Finding the last page
38+
39+
What we want to do now is grab the last page number, so that we know exactly how many requests we need to send in order to paginate through all of the repositories. Luckily, this information is available right on the page here:
40+
41+
![Final page number]({{@asset puppeteer_playwright/common_use_cases/images/github-last-page.webp}})
42+
43+
Let's grab this number now with a little bit of code:
44+
45+
```marked-tabs
46+
<marked-tab header="Playwright" lang="javascript">
47+
import { chromium } from 'playwright';
48+
import { load } from 'cheerio';
49+
50+
const repositories = [];
51+
52+
const BASE_URL = 'https://github.com';
53+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
54+
55+
const browser = await chromium.launch({ headless: false });
56+
const page = await browser.newPage();
57+
58+
await page.goto(REPOSITORIES_URL);
59+
60+
const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)');
61+
// This will output 4
62+
const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, '');
63+
64+
console.log(lastPage);
65+
66+
await browser.close();
67+
</marked-tab>
68+
<marked-tab header="Puppeteer" lang="javascript">
69+
import puppeteer from 'puppeteer';
70+
import { load } from 'cheerio';
71+
72+
const repositories = [];
73+
74+
const BASE_URL = 'https://github.com';
75+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
76+
77+
const browser = await puppeteer.launch({ headless: false });
78+
const page = await browser.newPage();
79+
80+
await page.goto(REPOSITORIES_URL);
81+
82+
const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label'));
83+
// This will output 4
84+
const lastPage = +lastPageLabel.replace(/\D/g, '');
85+
86+
console.log(lastPage);
87+
88+
await browser.close();
89+
</marked-tab>
90+
```
91+
92+
> Learn more about the `:nth-last-child` pseudo-class [on W3Schools](https://www.w3schools.com/cssref/sel_nth-last-child.asp). It works similar to `:nth-child`, but starts from the bottom of the parent element's children instead of from the top.
93+
94+
When we run this code, here's what we see:
95+
96+
```text
97+
4
98+
```
99+
100+
And since we're already on the first page, we'll go ahead and scrape the repos from it. However, since we are going to reuse this logic on the other pages as well, let's create a function that will handle the data collection and reliably return a nice array of results:
101+
102+
```marked-tabs
103+
<marked-tab header="Playwright" lang="javascript">
104+
import { chromium } from 'playwright';
105+
import { load } from 'cheerio';
106+
107+
const repositories = [];
108+
109+
const BASE_URL = 'https://github.com';
110+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
111+
112+
// Create a function which grabs all repos from a page
113+
const scrapeRepos = async (page) => {
114+
const $ = load(await page.content());
115+
116+
return [...$('li.Box-row')].map((item) => {
117+
const elem = $(item);
118+
const titleElement = elem.find('a[itemprop*="name"]');
119+
120+
return {
121+
title: titleElement.text().trim(),
122+
description: elem.find('p[itemprop="description"]').text().trim(),
123+
link: new URL(titleElement.attr('href'), BASE_URL).href,
124+
};
125+
});
126+
};
127+
128+
const browser = await chromium.launch({ headless: false });
129+
const page = await browser.newPage();
130+
131+
await page.goto(REPOSITORIES_URL);
132+
133+
const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)');
134+
const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, '');
135+
136+
// Push all results from the first page to results array
137+
repositories.push(...(await scrapeRepos(page)));
138+
139+
// Log the 30 repositories scraped from the first page
140+
console.log(repositories);
141+
142+
await browser.close();
143+
</marked-tab>
144+
<marked-tab header="Puppeteer" lang="javascript">
145+
import puppeteer from 'puppeteer';
146+
import { load } from 'cheerio';
147+
148+
const repositories = [];
149+
150+
const BASE_URL = 'https://github.com';
151+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
152+
153+
// Create a function which grabs all repos from a page
154+
const scrapeRepos = async (page) => {
155+
const $ = load(await page.content());
156+
157+
return [...$('li.Box-row')].map((item) => {
158+
const elem = $(item);
159+
const titleElement = elem.find('a[itemprop*="name"]');
160+
161+
return {
162+
title: titleElement.text().trim(),
163+
description: elem.find('p[itemprop="description"]').text().trim(),
164+
link: new URL(titleElement.attr('href'), BASE_URL).href,
165+
};
166+
});
167+
};
168+
169+
const browser = await puppeteer.launch({ headless: false });
170+
const page = await browser.newPage();
171+
172+
await page.goto(REPOSITORIES_URL);
173+
174+
const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label'));
175+
const lastPage = +lastPageLabel.replace(/\D/g, '');
176+
177+
// Push all results from the first page to results array
178+
repositories.push(...(await scrapeRepos(page)));
179+
180+
// Log the 30 repositories scraped from the first page
181+
console.log(repositories);
182+
183+
await browser.close();
184+
</marked-tab>
185+
```
186+
187+
### [](#making-a-request-for-each-results-page) Making a request for each results page
188+
189+
Cool, so now we have all the tools we need to write concise logic that will be run for every single page. First, we'll create an array of numbers from 0-4:
190+
191+
```JavaScript
192+
// We must add 1 to the lastPage, since the array starts at 0 and we
193+
// are creating an array from its index values
194+
Array(lastPage + 1).keys() // -> [0, 1, 2, 3, 4]
195+
```
196+
197+
Then, we'll slice the first two values from that array so that it starts from 2 and ends at 4:
198+
199+
```JavaScript
200+
[...Array(lastPage + 1).keys()].slice(2) // -> [2, 3, 4]
201+
```
202+
203+
This array now accurately represents the pages we need to go through. We'll map through it and create an array of promises, all of which make a request to each page, scrape its data, then push it to the **repositories** array:
204+
205+
```JavaScript
206+
// Map through the range. The value from the array is the page number
207+
// to make a request for
208+
const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) =>
209+
(async () => {
210+
const page2 = await browser.newPage();
211+
212+
// Prepare the URL before making the request by setting the "page"
213+
// parameter to whatever the pageNumber is currently
214+
const url = new URL(REPOSITORIES_URL);
215+
url.searchParams.set('page', pageNumber);
216+
217+
await page2.goto(url.href);
218+
219+
// Scrape the data and push it to the "repositories" array
220+
repositories.push(...(await scrapeRepos(page2)));
221+
222+
await page2.close();
223+
})()
224+
);
225+
226+
await Promise.all(promises);
227+
228+
console.log(repositories.length);
229+
```
230+
231+
> **IMPORTANT!** Usually, within the map function's callback you'd want to add the requests to a request queue, especially when paginating through hundreds (or even thousands) of pages. But since we know that we have 4 pages and only 3 left to go through, it is totally safe to use `Promise.all()` for this specific use-case.
232+
233+
### [](#final-code) Final code
234+
235+
After all is said and done, here's what our final code looks like:
236+
237+
```marked-tabs
238+
<marked-tab header="Playwright" lang="javascript">
239+
import { chromium } from 'playwright';
240+
import { load } from 'cheerio';
241+
242+
const repositories = [];
243+
244+
const BASE_URL = 'https://github.com';
245+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
246+
247+
const scrapeRepos = async (page) => {
248+
const $ = load(await page.content());
249+
250+
return [...$('li.Box-row')].map((item) => {
251+
const elem = $(item);
252+
const titleElement = elem.find('a[itemprop*="name"]');
253+
254+
return {
255+
title: titleElement.text().trim(),
256+
description: elem.find('p[itemprop="description"]').text().trim(),
257+
link: new URL(titleElement.attr('href'), BASE_URL).href,
258+
};
259+
});
260+
};
261+
262+
const browser = await chromium.launch({ headless: false });
263+
const page = await browser.newPage();
264+
265+
await page.goto(REPOSITORIES_URL);
266+
267+
const lastPageElement = page.locator('a[aria-label*="Page "]:nth-last-child(2)');
268+
const lastPage = +(await lastPageElement.getAttribute('aria-label')).replace(/\D/g, '');
269+
270+
repositories.push(...(await scrapeRepos(page)));
271+
272+
await page.close();
273+
274+
const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) =>
275+
(async () => {
276+
const page2 = await browser.newPage();
277+
278+
const url = new URL(REPOSITORIES_URL);
279+
url.searchParams.set('page', pageNumber);
280+
281+
await page2.goto(url.href);
282+
283+
repositories.push(...(await scrapeRepos(page2)));
284+
285+
await page2.close();
286+
})()
287+
);
288+
289+
await Promise.all(promises);
290+
291+
// Log the final length of the "repositories" array
292+
console.log(repositories.length);
293+
294+
await browser.close();
295+
</marked-tab>
296+
<marked-tab header="Puppeteer" lang="javascript">
297+
import puppeteer from 'puppeteer';
298+
import { load } from 'cheerio';
299+
300+
const repositories = [];
301+
302+
const BASE_URL = 'https://github.com';
303+
const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`;
304+
305+
// Create a function which grabs all repos from a page
306+
const scrapeRepos = async (page) => {
307+
const $ = load(await page.content());
308+
309+
return [...$('li.Box-row')].map((item) => {
310+
const elem = $(item);
311+
const titleElement = elem.find('a[itemprop*="name"]');
312+
313+
return {
314+
title: titleElement.text().trim(),
315+
description: elem.find('p[itemprop="description"]').text().trim(),
316+
link: new URL(titleElement.attr('href'), BASE_URL).href,
317+
};
318+
});
319+
};
320+
321+
const browser = await puppeteer.launch({ headless: false });
322+
const page = await browser.newPage();
323+
324+
await page.goto(REPOSITORIES_URL);
325+
326+
const lastPageLabel = await page.$eval('a[aria-label*="Page "]:nth-last-child(2)', (elem) => elem.getAttribute('aria-label'));
327+
const lastPage = +lastPageLabel.replace(/\D/g, '');
328+
329+
repositories.push(...(await scrapeRepos(page)));
330+
331+
await page.close();
332+
333+
const promises = [...Array(lastPage + 1).keys()].slice(2).map((pageNumber) =>
334+
(async () => {
335+
const page2 = await browser.newPage();
336+
337+
const url = new URL(REPOSITORIES_URL);
338+
url.searchParams.set('page', pageNumber);
339+
340+
await page2.goto(url.href);
341+
342+
repositories.push(...(await scrapeRepos(page2)));
343+
344+
await page2.close();
345+
})()
346+
);
347+
348+
await Promise.all(promises);
349+
350+
// Log the final length of the "repositories" array
351+
console.log(repositories.length);
352+
353+
await browser.close();
354+
</marked-tab>
355+
```
356+
357+
If we remember correctly, Facebook has 115 Github repositories (at the time of writing this lesson), so the final output should be:
358+
359+
```text
360+
115
361+
```
362+
363+
<!-- ## [](#lazy-loading-pagination) Lazy loading pagination -->
364+
365+
## [](#next) Next up
366+
367+
We're actively working in expanding this section of the course, so stay tuned!

0 commit comments

Comments
 (0)