You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Note that the Apify SDK automatically applies wide variety fingerprints by default, so it is not required to do this unless you aren't using the Apify SDK or if you need a super specific custom fingerprint to scrape with.
88
+
> Note that [Crawlee](https://crawlee.dev)automatically applies wide variety of fingerprints by default, so it is not required to do this unless you aren't using Crawlee or if you need a super specific custom fingerprint to scrape with.
89
89
90
-
## [](#next) Next up
90
+
## Wrap up
91
91
92
92
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
Copy file name to clipboardExpand all lines: content/academy/anti_scraping/mitigation/proxies.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,4 +45,4 @@ Web scrapers can implement a method called "proxy rotation" to **rotate** the IP
45
45
46
46
## [](#next) Next up
47
47
48
-
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})
48
+
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in Crawlee to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})
Copy file name to clipboardExpand all lines: content/academy/anti_scraping/mitigation/using_proxies.md
+42-44Lines changed: 42 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,84 +1,82 @@
1
1
---
2
2
title: Using proxies
3
-
description: Learn how to use and automagically rotate proxies in your scrapers by using the Apify SDK, and a bit about how to easily obtain pools of proxies.
3
+
description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies.
4
4
menuWeight: 2
5
5
paths:
6
6
- anti-scraping/mitigation/using-proxies
7
7
---
8
8
9
9
# [](#using-proxies) Using proxies
10
10
11
-
In the [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, we learned about the power of the Apify SDK, and how it can streamline the development process of web crawlers. You've already seen how powerful the `apify` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
11
+
In the [**Web scraping for beginners**]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.
12
12
13
-
Because proxies are so widely used in the scraping world, we at Apify have equipped our SDK with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into the SDK is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
13
+
Because proxies are so widely used in the scraping world, Crawlee as been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.
14
14
15
15
## [](#implementing-proxies) Implementing proxies in a scraper
16
16
17
17
Let's borrow some scraper code from the end of the [pro-scraping]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) lesson in our **Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one:
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`proxyConfiguration`](https://sdk.apify.com/docs/api/proxy-configuration#docsNav) and configure it with our custom proxies, like so:
61
+
In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so:
Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxie pool is total fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:
@@ -96,7 +94,7 @@ const crawler = new Apify.CheerioCrawler({
96
94
});
97
95
```
98
96
99
-
> Note that if you run this code, it may not work, as the proxies could potentially be down at the time you are going through this course.
97
+
> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.
100
98
101
99
That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.
102
100
@@ -105,9 +103,8 @@ That's it! The crawler will now automatically rotate through the proxies we prov
105
103
At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.
106
104
107
105
```JavaScript
108
-
constcrawler=newApify.CheerioCrawler({
106
+
constcrawler=newCheerioCrawler({
109
107
proxyConfiguration,
110
-
requestQueue,
111
108
// Destructure "proxyInfo" from the "context" object
Though we will discuss it more in-depth in future courses, it is still important to mention that the Apify SDK has integrated support for [Apify Proxy](https://apify.com/proxy), which is a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
126
+
Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:
Copy file name to clipboardExpand all lines: content/academy/anti_scraping/techniques/rate_limiting.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,15 +18,16 @@ In cases when a higher number of requests is expected for the crawler, using a [
18
18
19
19
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies]({{@link anti_scraping/mitigation/proxies.md}}) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
20
20
21
-
In the Apify SDK, proxies are automatically rotated for you when you use `proxyConfiguration` and a [**SessionPool**]((https://sdk.apify.com/docs/api/session-pool)) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers]({{@link puppeteer_playwright.md}}) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).
21
+
In Crawlee, proxies are automatically rotated for you when you use `ProxyConfiguration` and a [**SessionPool**](https://crawlee.dev/api/core/class/SessionPool) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers]({{@link puppeteer_playwright.md}}) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).
22
22
23
23
Here is an example of these features being used in a **PuppeteerCrawler** instance:
@@ -44,17 +45,17 @@ const myCrawler = new Apify.PuppeteerCrawler({
44
45
});
45
46
```
46
47
47
-
> Take a look at the [**Using proxies**]({{@link anti_scraping/mitigation/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in the Apify SDK.
48
+
> Take a look at the [**Using proxies**]({{@link anti_scraping/mitigation/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in Crawlee.
48
49
49
50
### [](#configuring-session-pool) Configuring a session pool
50
51
51
52
There are various configuration options available in `sessionPoolOptions` that can be used to set up the SessionPool for different rate-limiting scenarios. In the example above, we used `maxUsageCount` within `sessionOptions` to prevent more than 15 requests from being sent using a session before it was thrown away; however, a maximum age can also be set using `maxAgeSecs`.
52
53
53
54
When dealing with frequent and unpredictable blockage, the `maxErrorScore` option can be set to trash a session after it's hit a certain number of errors.
54
55
55
-
To learn more about all configurations available in `sessionPoolOptions`, refer to the [SDK documentation](https://sdk.apify.com/docs/typedefs/session-pool-options).
56
+
To learn more about all configurations available in `sessionPoolOptions`, refer to the [Crawlee documentation](https://crawlee.dev/api/core/interface/SessionPoolOptions).
56
57
57
-
> Don't worry too much about these configurations. The Apify SDK's defaults are usually good enough for the majority of use cases.
58
+
> Don't worry too much about these configurations. Crawlee's defaults are usually good enough for the majority of use cases.
Copy file name to clipboardExpand all lines: content/academy/api_scraping/general_api_scraping/handling_pagination.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -138,7 +138,7 @@ while (items.flat().length < 100) {
138
138
139
139
All that's left to do now is flesh out this `while` loop with pagination logic and finally return the **items** array once the loop has finished.
140
140
141
-
> Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by the [Apify SDK](https://sdk.apify.com) provide this functionality out of the box.
141
+
> Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by [Crawlee](https://crawlee.dev/docs/) provide this functionality out of the box.
Copy file name to clipboardExpand all lines: content/academy/apify_platform/deploying_your_code/deploying.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ That's it! the actor should now pull its source code from the repo and automatic
50
50
51
51
If you're logged in to the Apify CLI, the `apify push` command can be used to push the code straight onto the Apify platform from your local machine (no GitHub repository required), where it will automatically be built for you. Prior to running this command, make sure that you have an **apify.json** file at the root of the project. If you don't already have one, you can use `apify init .` to automatically generate one for you.
52
52
53
-
One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the even **apify_storage** and **node_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform.
53
+
One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the even **storage** and **node_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform.
54
54
55
55
> The `apify push` command should only really be used for quickly pushing and testing actors on the platform during development. If you are ready to make your actor public, use a Git repository instead, as you will reap the benefits of using Git and others will be able to contribute to the project.
0 commit comments