Skip to content

Commit 266b8bd

Browse files
authored
Merge pull request #416 from apify/sdk-crawlee-examples
docs(academy): Update references to Apify SDK with Crawlee
2 parents ba1cbf4 + c80e963 commit 266b8bd

34 files changed

+684
-796
lines changed

content/academy/anti_scraping/mitigation/generating_fingerprints.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ const page = await context.newPage();
8585
await page.goto('https://google.com');
8686
```
8787

88-
> Note that the Apify SDK automatically applies wide variety fingerprints by default, so it is not required to do this unless you aren't using the Apify SDK or if you need a super specific custom fingerprint to scrape with.
88+
> Note that [Crawlee](https://crawlee.dev) automatically applies wide variety of fingerprints by default, so it is not required to do this unless you aren't using Crawlee or if you need a super specific custom fingerprint to scrape with.
8989
90-
## [](#next) Next up
90+
## Wrap up
9191

9292
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.

content/academy/anti_scraping/mitigation/proxies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,4 +45,4 @@ Web scrapers can implement a method called "proxy rotation" to **rotate** the IP
4545

4646
## [](#next) Next up
4747

48-
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})
48+
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in Crawlee to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})

content/academy/anti_scraping/mitigation/using_proxies.md

Lines changed: 42 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,82 @@
11
---
22
title: Using proxies
3-
description: Learn how to use and automagically rotate proxies in your scrapers by using the Apify SDK, and a bit about how to easily obtain pools of proxies.
3+
description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies.
44
menuWeight: 2
55
paths:
66
- anti-scraping/mitigation/using-proxies
77
---
88

99
# [](#using-proxies) Using proxies
1010

11-
In the [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, we learned about the power of the Apify SDK, and how it can streamline the development process of web crawlers. You've already seen how powerful the `apify` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
11+
In the [**Web scraping for beginners**]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
1212

13-
Because proxies are so widely used in the scraping world, we at Apify have equipped our SDK with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into the SDK is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
13+
Because proxies are so widely used in the scraping world, Crawlee as been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
1414

1515
## [](#implementing-proxies) Implementing proxies in a scraper
1616

1717
Let's borrow some scraper code from the end of the [pro-scraping]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) lesson in our **Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one:
1818

1919
```JavaScript
20-
// proxies.js
21-
import Apify from 'apify';
22-
23-
await Apify.utils.purgeLocalStorage();
24-
25-
const requestQueue = await Apify.openRequestQueue();
26-
await requestQueue.addRequest({
27-
url: 'https://demo-webstore.apify.org/search/on-sale',
28-
userData: {
29-
label: 'START',
30-
},
31-
});
32-
33-
const crawler = new Apify.CheerioCrawler({
34-
requestQueue,
35-
handlePageFunction: async ({ $, request }) => {
36-
if (request.userData.label === 'START') {
37-
await Apify.utils.enqueueLinks({
38-
$,
39-
requestQueue,
40-
selector: 'a[href*="/product/"]',
41-
baseUrl: new URL(request.url).origin,
20+
// crawlee.js
21+
import { CheerioCrawler, Dataset } from 'crawlee';
22+
23+
const crawler = new CheerioCrawler({
24+
requestHandler: async ({ $, request, enqueueLinks }) => {
25+
if (request.label === 'START') {
26+
await enqueueLinks({
27+
selector: 'a[href*="/product/"]'
4228
});
29+
30+
// When on the START page, we don't want to
31+
// extract any data after we extract the links.
4332
return;
4433
}
4534

35+
// We copied and pasted the extraction code
36+
// from the previous lesson
4637
const title = $('h3').text().trim();
4738
const price = $('h3 + div').text().trim();
4839
const description = $('div[class*="Text_body"]').text().trim();
4940

50-
await Apify.pushData({
41+
// Instead of saving the data to a variable,
42+
// we immediately save everything to a file.
43+
await Dataset.pushData({
5144
title,
5245
description,
5346
price,
5447
});
5548
},
5649
});
5750

51+
await crawler.addRequests([{
52+
url: 'https://demo-webstore.apify.org/search/on-sale',
53+
// By labeling the Request, we can very easily
54+
// identify it later in the requestHandler.
55+
label: 'START',
56+
}]);
57+
5858
await crawler.run();
5959
```
6060

61-
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`proxyConfiguration`](https://sdk.apify.com/docs/api/proxy-configuration#docsNav) and configure it with our custom proxies, like so:
61+
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so:
6262

6363
```JavaScript
64-
const proxyConfiguration = await Apify.createProxyConfiguration({
64+
import { ProxyConfiguration } from 'crawlee';
65+
66+
const proxyConfiguration = new ProxyConfiguration({
6567
proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'],
6668
});
6769
```
6870

6971
Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxie pool is total fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:
7072

7173
```JavaScript
72-
const crawler = new Apify.CheerioCrawler({
74+
const crawler = new CheerioCrawler({
7375
proxyConfiguration,
74-
requestQueue,
75-
handlePageFunction: async ({ $, request }) => {
76-
if (request.userData.label === 'START') {
77-
await Apify.utils.enqueueLinks({
78-
$,
79-
requestQueue,
76+
requestHandler: async ({ $, request, enqueueLinks }) => {
77+
if (request.label === 'START') {
78+
await enqueueLinks({
8079
selector: 'a[href*="/product/"]',
81-
baseUrl: new URL(request.url).origin,
8280
});
8381
return;
8482
}
@@ -87,7 +85,7 @@ const crawler = new Apify.CheerioCrawler({
8785
const price = $('h3 + div').text().trim();
8886
const description = $('div[class*="Text_body"]').text().trim();
8987

90-
await Apify.pushData({
88+
await Dataset.pushData({
9189
title,
9290
description,
9391
price,
@@ -96,7 +94,7 @@ const crawler = new Apify.CheerioCrawler({
9694
});
9795
```
9896

99-
> Note that if you run this code, it may not work, as the proxies could potentially be down at the time you are going through this course.
97+
> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.
10098
10199
That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.
102100

@@ -105,9 +103,8 @@ That's it! The crawler will now automatically rotate through the proxies we prov
105103
At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.
106104

107105
```JavaScript
108-
const crawler = new Apify.CheerioCrawler({
106+
const crawler = new CheerioCrawler({
109107
proxyConfiguration,
110-
requestQueue,
111108
// Destructure "proxyInfo" from the "context" object
112109
handlePageFunction: async ({ $, request, proxyInfo }) => {
113110
// Log its value
@@ -122,15 +119,16 @@ After modifying your code to log `proxyInfo` to the console and running the scra
122119

123120
![proxyInfo being logged by the scraper]({{@asset anti_scraping/mitigation/images/proxy-info-logs.webp}})
124121

125-
These logs confirm that our proxies are being used and rotated successfully by the Apify SDK, and can also be used to debug slow or broken proxies.
122+
These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies.
126123

127124
## [](#higher-level-proxy-scraping) Higher level proxy scraping
128125

129-
Though we will discuss it more in-depth in future courses, it is still important to mention that the Apify SDK has integrated support for [Apify Proxy](https://apify.com/proxy), which is a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
126+
Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
130127

131128
```JavaScript
132-
const proxyConfiguration = await Apify.createProxyConfiguration({
133-
groups: ['SHADER'],
129+
import { Actor } from 'apify';
130+
131+
const proxyConfiguration = await Actor.createProxyConfiguration({
134132
countryCode: 'US'
135133
});
136134
```

content/academy/anti_scraping/techniques/rate_limiting.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,16 @@ In cases when a higher number of requests is expected for the crawler, using a [
1818

1919
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies]({{@link anti_scraping/mitigation/proxies.md}}) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
2020

21-
In the Apify SDK, proxies are automatically rotated for you when you use `proxyConfiguration` and a [**SessionPool**]((https://sdk.apify.com/docs/api/session-pool)) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers]({{@link puppeteer_playwright.md}}) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).
21+
In Crawlee, proxies are automatically rotated for you when you use `ProxyConfiguration` and a [**SessionPool**](https://crawlee.dev/api/core/class/SessionPool) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers]({{@link puppeteer_playwright.md}}) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).
2222

2323
Here is an example of these features being used in a **PuppeteerCrawler** instance:
2424

2525
```JavaScript
26-
import Apify from 'apify';
26+
import { PuppeteerCrawler } from 'crawlee';
27+
import { Actor } from 'apify';
2728

28-
const myCrawler = new Apify.PuppeteerCrawler({
29-
proxyConfiguration: await Apify.createProxyConfiguration({
29+
const myCrawler = new PuppeteerCrawler({
30+
proxyConfiguration: await Actor.createProxyConfiguration({
3031
groups: ['RESIDENTIAL'],
3132
}),
3233
sessionPoolOptions: {
@@ -44,17 +45,17 @@ const myCrawler = new Apify.PuppeteerCrawler({
4445
});
4546
```
4647

47-
> Take a look at the [**Using proxies**]({{@link anti_scraping/mitigation/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in the Apify SDK.
48+
> Take a look at the [**Using proxies**]({{@link anti_scraping/mitigation/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in Crawlee.
4849
4950
### [](#configuring-session-pool) Configuring a session pool
5051

5152
There are various configuration options available in `sessionPoolOptions` that can be used to set up the SessionPool for different rate-limiting scenarios. In the example above, we used `maxUsageCount` within `sessionOptions` to prevent more than 15 requests from being sent using a session before it was thrown away; however, a maximum age can also be set using `maxAgeSecs`.
5253

5354
When dealing with frequent and unpredictable blockage, the `maxErrorScore` option can be set to trash a session after it's hit a certain number of errors.
5455

55-
To learn more about all configurations available in `sessionPoolOptions`, refer to the [SDK documentation](https://sdk.apify.com/docs/typedefs/session-pool-options).
56+
To learn more about all configurations available in `sessionPoolOptions`, refer to the [Crawlee documentation](https://crawlee.dev/api/core/interface/SessionPoolOptions).
5657

57-
> Don't worry too much about these configurations. The Apify SDK's defaults are usually good enough for the majority of use cases.
58+
> Don't worry too much about these configurations. Crawlee's defaults are usually good enough for the majority of use cases.
5859
5960
## [](#next) Next up
6061

content/academy/api_scraping/general_api_scraping/handling_pagination.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ while (items.flat().length < 100) {
138138
139139
All that's left to do now is flesh out this `while` loop with pagination logic and finally return the **items** array once the loop has finished.
140140
141-
> Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by the [Apify SDK](https://sdk.apify.com) provide this functionality out of the box.
141+
> Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by [Crawlee](https://crawlee.dev/docs/) provide this functionality out of the box.
142142
143143
```JavaScript
144144
// index.js

content/academy/apify_platform/deploying_your_code/deploying.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ That's it! the actor should now pull its source code from the repo and automatic
5050
5151
If you're logged in to the Apify CLI, the `apify push` command can be used to push the code straight onto the Apify platform from your local machine (no GitHub repository required), where it will automatically be built for you. Prior to running this command, make sure that you have an **apify.json** file at the root of the project. If you don't already have one, you can use `apify init .` to automatically generate one for you.
5252

53-
One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the even **apify_storage** and **node_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform.
53+
One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the even **storage** and **node_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform.
5454

5555
> The `apify push` command should only really be used for quickly pushing and testing actors on the platform during development. If you are ready to make your actor public, use a Git repository instead, as you will reap the benefits of using Git and others will be able to contribute to the project.
5656

0 commit comments

Comments
 (0)