Skip to content

Commit 3c29160

Browse files
authored
Merge pull request #383 from apify/general-api-scraping
API Pagination + API Filters lessons
2 parents 2adc11a + 031839d commit 3c29160

File tree

3 files changed

+215
-10
lines changed

3 files changed

+215
-10
lines changed

content/academy/api_scraping/general_api_scraping/cookies_headers_tokens.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,4 @@ scrapeClientId();
127127
128128
## [](#next) Next up
129129
130-
This is the last lesson in the API scraping tutorial for now, but be on the lookout for more lessons soon to come! So far, you've learned how to:
131-
132-
1. Locate API endpoints
133-
2. Understand located API endpoints and their parameters
134-
3. Parse and modify cookies
135-
4. Modify/set headers
136-
5. Farm API tokens using Puppeteer
137-
138-
If you'd still like to read more API scraping, check out the [**GraphQL scraping**]({{@link api_scraping/graphql_scraping.md}}) course! GraphQL is the king of API scraping.
130+
Keep the code above in mind, because we'll be using it in the [next lesson]({{@link api_scraping/general_api_scraping/handling_pagination.md}}) when paginating through results from SoundCloud's API.
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
---
2+
title: Handling pagination
3+
description: Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination.
4+
menuWeight: 3
5+
paths:
6+
- api-scraping/general-api-scraping/handling-pagination
7+
---
8+
9+
# [](#handling-pagination) Handling pagination
10+
11+
When scraping large APIs, you'll quickly realize that most APIs limit the number of results it responds back with. For some APIs, the max number of results is 5, while for others it's 2000. Either way, they all have something in common - pagination.
12+
13+
If you've never dealt with it before, trying to scrape thousands to hundreds of thousands of items from an API with pagination can be a bit challenging. In this lesson, we'll be discussing a few of the different types of pagination, as well as how to work with them.
14+
15+
## [](#page-number) Page-number pagination
16+
17+
The most common and rudimentary form of pagination is by simply having page numbers, which can be compared to paginating through a typical e-commerce website.
18+
19+
![Amazon pagination](https://apify-docs.s3.amazonaws.com/master/docs/assets/tutorials/images/pagination.webp)
20+
21+
This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails just incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead.
22+
23+
## [](#offset-pagination) Offset pagination
24+
25+
The second most popular pagination technique used is based on using a **limit** parameter along with an **offset** parameter. The **limit** says how many records should be returned in a single request, while the **offset** parameter says how many records should be skipped.
26+
27+
For example, let's say that we have this dataset and an API route to retrieve its items:
28+
29+
```JavaScript
30+
const myAwesomeDataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
31+
```
32+
33+
If we were to make a request with the **limit** set to **5** and the **offset** parameter also set to **5**, the API would skip over the first five items and return `[6, 7, 8, 9, 10]`.
34+
35+
## [](#cursor-pagination) Cursor pagination
36+
37+
Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit** parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided.
38+
39+
One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one.
40+
41+
> Note: SoundCloud [migrated](https://developers.soundcloud.com/blog/pagination-updates-on-our-api) over to using cursor-based pagination; however, they did not change the parameter name from **offset** to **cursor**. Always be on the lookout for this type of stuff!
42+
43+
## [](#using-next-page) Using "next page"
44+
45+
In a minute, we're going to create a mini-project which will scrape the first 100 of Tiësto's tracks by keeping a **limit** of 20 and paginating through until we've scraped 100 items.
46+
47+
Luckily for us, SoundCloud's API (and many others) provides a **next_href** property in each response, which means we don't have to directly deal with setting the **offset** (cursor) parameter:
48+
49+
```JSON
50+
//...
51+
{
52+
"next_href": "https://api-v2.soundcloud.com/users/141707/tracks?offset=2020-03-13T00%3A00%3A00.000Z%2Ctracks%2C00774168919&limit=20&representation=https%3A%2F%2Fapi-v2.soundcloud.com%2Fusers%2F141707%2Ftracks%3Flimit%3D20",
53+
"query_urn": null
54+
}
55+
```
56+
57+
This URL can take various different forms, and can be given different names; however, they all generally do the same thing - bring you to the next page of results.
58+
59+
## [](#mini-project) Mini project
60+
61+
First, create a new folder called **pagination-tutorial** and run this command inside of it:
62+
63+
```shell
64+
# initialize the project and install the puppeteer
65+
# and got-scraping packages
66+
npm init -y && npm i puppeteer got-scraping
67+
```
68+
69+
Now, make a new file called **scrapeClientId**, copying the **client_id** scraping code from the previous lesson and making a slight modification:
70+
71+
```JavaScript
72+
// scrapeClientId.js
73+
const puppeteer = require('puppeteer');
74+
75+
const scrapeClientId = async () => {
76+
const browser = await puppeteer.launch({ headless: true });
77+
const page = await browser.newPage();
78+
79+
let clientId = null;
80+
81+
page.on('response', async (res) => {
82+
const id = new URL(res._url).searchParams.get('client_id') ?? null;
83+
if (id) clientId = id;
84+
});
85+
86+
await page.goto('https://soundcloud.com/tiesto/tracks');
87+
await page.waitForSelector('.sc-classic');
88+
await browser.close();
89+
90+
// return the client_id
91+
return clientId;
92+
};
93+
94+
// export the function to be used in a different file
95+
module.exports = scrapeClientId;
96+
```
97+
98+
Now, in a new file called **index.js** we'll write the skeleton for our pagination and item-scraping code:
99+
100+
```JavaScript
101+
// index.js
102+
const scrapeClientId = require('./scrapeClientId');
103+
// we will need gotScraping to make HTTP requests
104+
const { gotScraping } = require('got-scraping');
105+
106+
const scrape100Items = async () => {
107+
// the initial request URL
108+
let nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0';
109+
110+
// create an array for all of our scraped items to live
111+
let items = [];
112+
113+
// scrape the client ID with the script from the
114+
// previous lesson
115+
const clientId = await scrapeClientId();
116+
117+
// More code will go here
118+
};
119+
```
120+
121+
Let's now take a step back and think about the condition on which we should continue paginating:
122+
123+
1. If the API responds with a **next_href** set to **null**, that means that there are no more pages, and that we have scraped all of the possible items and we should stop paginating.
124+
2. If our items list has 100 records or more, we should stop paginating. Otherwise, we should continue until 100+ items has been reached.
125+
126+
With a full understanding of this condition, we can translate it into code:
127+
128+
```JavaScript
129+
// continue making requests until either we've reached 100+ items
130+
while (items.flat().length < 100) {
131+
// if the "next_href" wasn't present in the last call, there
132+
// are no more pages. return what we have and stop paginating.
133+
if (!nextHref) return items.flat();
134+
135+
// continue paginating
136+
}
137+
```
138+
139+
All that's left to do now is flesh out this `while` loop with pagination logic and finally return the **items** array once the loop has finished.
140+
141+
> Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by the [Apify SDK](https://sdk.apify.com) provide this functionality out of the box.
142+
143+
```JavaScript
144+
// index.js
145+
const scrapeClientId = require('./scrapeClientId');
146+
const { gotScraping } = require('got-scraping');
147+
148+
const scrape100Items = async () => {
149+
let nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0';
150+
let items = [];
151+
152+
const clientId = await scrapeClientId();
153+
154+
while (items.flat().length < 100) {
155+
if (!nextHref) return items.flat();
156+
157+
// set the "client_id" URL parameter of the
158+
// nextHref URL
159+
const nextURL = new URL(nextHref);
160+
nextURL.searchParams.set('client_id', clientId);
161+
162+
// make the paginated request and push its results
163+
// into the in-memory "items" array
164+
const res = await gotScraping(nextURL);
165+
const json = JSON.parse(res.body);
166+
items.push(json.collection);
167+
168+
// queue the next link for the next loop iteration
169+
nextHref = json.next_href;
170+
}
171+
172+
// return an array of all our scraped items
173+
// once the loop has finished
174+
return items.flat();
175+
};
176+
177+
// test run
178+
(async () => {
179+
// run the function
180+
const data = await scrape100Items();
181+
182+
// log the length of the items array
183+
console.log(data.length);
184+
})();
185+
```
186+
187+
> We are using the [`.flat()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat) method when returning the **items** array to turn our array of arrays into a single array of items.
188+
189+
Here's what the output of this code looks like:
190+
191+
```text
192+
105
193+
```
194+
195+
## [](#final-note) Final note
196+
197+
Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](https://docs.apify.com/tutorials/scrape-paginated-sites).
198+
199+
## [](#next) Next up
200+
201+
<!-- In this lesson, you learned about how to use API parameters and properties returned in an API response to paginate through results. [Next up](link api_scraping/general_api_scraping/using_api_filters.md), you'll gain a solid understanding of using API filtering parameters. -->
202+
203+
This is the last lesson in the API scraping tutorial for now, but be on the lookout for more lessons soon to come! So far, you've learned how to:
204+
205+
1. Locate API endpoints
206+
2. Understand located API endpoints and their parameters
207+
3. Parse and modify cookies
208+
4. Modify/set headers
209+
5. Farm API tokens using Puppeteer
210+
6. Use paginated APIs
211+
<!-- 7. Utilize API filters to narrow down results -->
212+
213+
If you'd still like to read more API scraping, check out the [**GraphQL scraping**]({{@link api_scraping/graphql_scraping.md}}) course! GraphQL is the king of API scraping.

content/academy/api_scraping/general_api_scraping/locating_and_learning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ paths:
88

99
# [](#locating-endpoints) Locating API endpoints
1010

11-
In order to retrive a website's API endpoints, as well as other data about them, the **Network** tab within Chrome's (or another browser's) DevTools can be used. This tab allows you to see the all of the various network requests being made, and even allows you to filter them based on request type, response type, or by a keyword.
11+
In order to retrieve a website's API endpoints, as well as other data about them, the **Network** tab within Chrome's (or another browser's) DevTools can be used. This tab allows you to see the all of the various network requests being made, and even allows you to filter them based on request type, response type, or by a keyword.
1212

1313
On our target page, we'll open up the Network tab, and filter by request type of `Fetch/XHR`, as opposed to the default of `All`. Next, we'll do some action on the page which causes the request for the target data to be sent, which will enable us to view the request in DevTools. The types of actions that need to be done can vary depending on the website, the type of page, and the type of data being returned. Sometimes, reloading the page is enough, while other times, a button must be clicked, or the page must be scrolled. For our example use case, reloading the page is sufficient.
1414

0 commit comments

Comments
 (0)