Skip to content

Commit 1056e02

Browse files
committed
Switch proxies folder to mitigation instead
1 parent b8fee01 commit 1056e02

File tree

13 files changed

+39
-26
lines changed

13 files changed

+39
-26
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Mitigation
3+
description: After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.
4+
menuWeight: 4.2
5+
paths:
6+
- anti-scraping/mitigation
7+
---
8+
9+
# [](#anti-scraping-mitigation) Anti-scraping mitigation
10+
11+
In the [techniques]({{@link anti_scraping/techniques.md}}) section of this course, you learned about multiple methods websites use to prevent bots from accessing their content. This **Mitigation** section will be all about how to circumvent these protections using various different techniques.
12+
13+
<!-- Here there should -->
14+
15+
## [](#next) Next up
16+
17+
In the [first lesson]({{@link anti_scraping/mitigation/proxies.md}}) of this section, you'll be learning about what proxies are and how to use them in your own crawler.

content/academy/anti_scraping/proxies.md renamed to content/academy/anti_scraping/mitigation/proxies.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Proxies
33
description: Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.
4-
menuWeight: 4.2
4+
menuWeight: 1
55
paths:
6-
- anti-scraping/proxies
6+
- anti-scraping/mitigation/proxies
77
---
88

99
# [](#about-proxies) Proxies
@@ -45,4 +45,4 @@ Web scrapers can implement a method called "proxy rotation" to **rotate** the IP
4545

4646
## [](#next) Next up
4747

48-
This module's first lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/proxies/using_proxies.md}})
48+
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})

content/academy/anti_scraping/proxies/using_proxies.md renamed to content/academy/anti_scraping/mitigation/using_proxies.md

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Using proxies
33
description: Learn how to use and automagically rotate proxies in your scrapers by using the Apify SDK, and a bit about how to easily obtain pools of proxies.
4-
menuWeight: 1
4+
menuWeight: 2
55
paths:
6-
- anti-scraping/proxies/using-proxies
6+
- anti-scraping/mitigation/using-proxies
77
---
88

99
# [](#using-proxies) Using proxies
@@ -120,7 +120,7 @@ const crawler = new Apify.CheerioCrawler({
120120

121121
After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:
122122

123-
![proxyInfo being logged by the scraper]({{@asset anti_scraping/proxies/images/proxy-info-logs.webp}})
123+
![proxyInfo being logged by the scraper]({{@asset anti_scraping/mitigation/images/proxy-info-logs.webp}})
124124

125125
These logs confirm that our proxies are being used and rotated successfully by the Apify SDK, and can also be used to debug slow or broken proxies.
126126

@@ -137,10 +137,6 @@ const proxyConfiguration = await Apify.createProxyConfiguration({
137137

138138
Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy).
139139

140-
## More lessons to come
140+
## [](#next) Next up
141141

142-
That's it for the proxy course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
143-
144-
<!-- ## [](#next) Next up
145-
146-
Smth -->
142+
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.

content/academy/anti_scraping/techniques/captchas.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ When you've hit a captcha, your first thought should not be how to programmatica
1919

2020
Have you expended all of the possible options to make your scraper appear more human-like? Are you:
2121

22-
- Using [proxies]({{@link anti_scraping/proxies.md}})?
22+
- Using [proxies]({{@link anti_scraping/mitigation/proxies.md}})?
2323
- Making the request with the proper [headers]({{@link concepts/http_headers.md}}) and [cookies]({{@link concepts/http_cookies.md}})?
2424
- Generating and using a custom [browser fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}})?
2525
- Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser?
@@ -38,4 +38,4 @@ Another popular captcha is the [Geetest slider captcha](https://www.geetest.com/
3838

3939
## Wrap up
4040

41-
In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the [next section]({{@link anti_scraping/proxies.md}}), we'll be discussing one of the most crucial parts of web scraping and web-automation: how to properly leverage proxies to avoid many of the anti-scraping techniques that were discussed in this section.
41+
In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the [next section]({{@link anti_scraping/mitigation.md}}), we'll be discussing how to mitigate the anti-scraping techniques you learned about in this section.

content/academy/anti_scraping/techniques/firewalls.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Since there are multiple providers, it is essential to say that the challenges a
3030

3131
## [](#bypassing-firewalls) Bypassing web-application firewalls
3232

33-
- Using [proxies]({{@link anti_scraping/proxies.md}}).
33+
- Using [proxies]({{@link anti_scraping/mitigation/proxies.md}}).
3434
- Mocking [headers]({{@link concepts/http_headers.md}}).
3535
- Overriding the browser's [fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}}) (most effective).
3636
- Farming the [cookies]({{@link concepts/http_cookies.md}}) from a website with a headless browser, then using the farmed cookies to do HTTP based scraping (most performant).

content/academy/anti_scraping/techniques/geolocation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ On targets which are just utilizing cookies and headers to identify the location
2020

2121
The oldest (and still most common) way of geolocating is based on the IP address used to make the request. Sometimes, country-specific sites block themselves from being accessed from any other country (some Chinese, Indian, Israeli, and Japanese websites do this).
2222

23-
[Proxies]({{@link anti_scraping/proxies.md}}) can be used in a scraper to bypass restrictions for make requests from a different location. Often times, proxies need to be used in combination with location-specific cookies/headers.
23+
[Proxies]({{@link anti_scraping/mitigation/proxies.md}}) can be used in a scraper to bypass restrictions for make requests from a different location. Often times, proxies need to be used in combination with location-specific [cookies]({{@link concepts/http_cookies.md}})/headers({{@link concepts/http_headers.md}}).
2424

2525
## [](#override-emulate-geolocation) Override/emulate geolocation when using a browser-based scraper
2626

content/academy/anti_scraping/techniques/rate_limiting.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ When crawling a website, a web scraping bot will typically send many more reques
1212

1313
In the past, most websites had their own anti-scraping solutions, the most common of which was IP address rate-limiting. In recent years, the popularity of third-party specialized anti-scraping providers has dramatically increased, but a lot of websites still use rate-limiting to only allow a certain number of requests per second/minute/hour to be sent from a single IP; therefore, crawler requests have the potential of being blocked entirely quite quickly.
1414

15-
In cases when a higher number of requests is expected for the crawler, using a [proxy]({{@link anti_scraping/proxies.md}}) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.
15+
In cases when a higher number of requests is expected for the crawler, using a [proxy]({{@link anti_scraping/mitigation/proxies.md}}) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.
1616

1717
## [](#dealing-with-rate-limiting) Dealing rate limiting with proxy/session rotating
1818

19-
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies]({{@link anti_scraping/proxies.md}}) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
19+
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies]({{@link anti_scraping/mitigation/proxies.md}}) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
2020

2121
In the Apify SDK, proxies are automatically rotated for you when you use `proxyConfiguration` and a [**SessionPool**]((https://sdk.apify.com/docs/api/session-pool)) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers]({{@link puppeteer_playwright.md}}) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).
2222

@@ -44,7 +44,7 @@ const myCrawler = new Apify.PuppeteerCrawler({
4444
});
4545
```
4646

47-
> Take a look at the [**Using proxies**]({{@link anti_scraping/proxies/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in the Apify SDK.
47+
> Take a look at the [**Using proxies**]({{@link anti_scraping/mitigation/using_proxies.md}}) lesson to learn more about how to use proxies and rotate them in the Apify SDK.
4848
4949
### [](#configuring-session-pool) Configuring a session pool
5050

content/academy/expert_scraping_with_apify/apify_sdk.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The SDK factors away and manages the hard parts of the scraping/automation devel
2222
- Request concurrency
2323
- Queueing requests
2424
- Data storage
25-
- Using and rotating [proxies]({{@link anti_scraping/proxies.md}})
25+
- Using and rotating [proxies]({{@link anti_scraping/mitigation/proxies.md}})
2626
- Puppeteer/Playwright setup overhead
2727
- Plus much more!
2828

0 commit comments

Comments
 (0)