Skip to content

Commit d463504

Browse files
authored
Merge pull request #390 from apify/anti-scraping-revamp
Revamp Anti-Scraping
2 parents 3c29160 + 97ee440 commit d463504

19 files changed

+237
-107
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Mitigation
3+
description: After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.
4+
menuWeight: 4.2
5+
paths:
6+
- anti-scraping/mitigation
7+
---
8+
9+
# [](#anti-scraping-mitigation) Anti-scraping mitigation
10+
11+
In the [techniques]({{@link anti_scraping/techniques.md}}) section of this course, you learned about multiple methods websites use to prevent bots from accessing their content. This **Mitigation** section will be all about how to circumvent these protections using various different techniques.
12+
13+
<!-- Here there should be a bit of an outline of what mitigation techniques they'll be learning -->
14+
15+
## [](#next) Next up
16+
17+
In the [first lesson]({{@link anti_scraping/mitigation/proxies.md}}) of this section, you'll be learning about what proxies are and how to use them in your own crawler.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
title: Generating fingerprints
3+
description: description
4+
menuWeight: 3
5+
paths:
6+
- anti-scraping/mitigation/generating-fingerprints
7+
---
8+
9+
# [](#generating-fingerprints) Generating fingerprints
10+
11+
With the [Fingerprint generator](https://github.com/apify/fingerprint-generator) NPM package, you can easily generate a browser fingerprint.
12+
13+
> It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results.
14+
15+
```JavaScript
16+
import FingerprintGenerator from 'fingerprint-generator';
17+
18+
// Instantiate the fingerprint generator with
19+
// configuration options
20+
const fingerprintGenerator = new FingerprintGenerator({
21+
browsers: [
22+
{ name: "firefox", minVersion: 80 },
23+
],
24+
devices: [
25+
"desktop"
26+
],
27+
operatingSystems: [
28+
"windows"
29+
]
30+
});
31+
32+
// Grab a fingerprint from the fingerprint generator
33+
const { fingerprint } = fingerprintGenerator.getFingerprint({
34+
locales: ["en-US", "en"]
35+
});
36+
```
37+
38+
## [](#injecting-fingerprints) Injecting fingerprints
39+
40+
Once you've generated a fingerprint, it can be injected into the browser using the [Fingerprint injector](https://github.com/apify/fingerprint-injector) package. This tool allows you to inject fingerprints to browsers automated by Playwright or Puppeteer:
41+
42+
```JavaScript
43+
import FingerprintGenerator from 'fingerprint-generator';
44+
import { FingerprintInjector } from 'fingerprint-injector';
45+
import { chromium } from 'playwright';
46+
47+
// Instantiate a fingerprint injector
48+
const fingerprintInjector = new FingerprintInjector();
49+
50+
// Launch a browser in Playwright
51+
const browser = await chromium.launch();
52+
53+
// Instantiate the fingerprint generator with
54+
// configuration options
55+
const fingerprintGenerator = new FingerprintGenerator({
56+
browsers: [
57+
{ name: "firefox", minVersion: 80 },
58+
],
59+
devices: [
60+
"desktop"
61+
],
62+
operatingSystems: [
63+
"windows"
64+
]
65+
});
66+
67+
// Grab a fingerprint
68+
const { fingerprint } = fingerprintGenerator.getFingerprint({
69+
locales: ["en-US", "en"]
70+
});
71+
72+
// Create a new browser context, plugging in
73+
// some values from the fingerprint
74+
const context = await browser.newContext({
75+
userAgent: fingerprint.userAgent,
76+
locale: fingerprint.navigator.language,
77+
});
78+
79+
// Attach the fingerprint to the newly created
80+
// browser context
81+
await fingerprintInjector.attachFingerprintToPlaywright(context, fingerprint);
82+
83+
// Create a new page and go to Google
84+
const page = await context.newPage();
85+
await page.goto('https://google.com');
86+
```
87+
88+
> Note that the Apify SDK automatically applies wide variety fingerprints by default, so it is not required to do this unless you aren't using the Apify SDK or if you need a super specific custom fingerprint to scrape with.
89+
90+
## [](#next) Next up
91+
92+
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.

content/academy/anti_scraping/proxies.md renamed to content/academy/anti_scraping/mitigation/proxies.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Proxies
33
description: Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.
4-
menuWeight: 4.2
4+
menuWeight: 1
55
paths:
6-
- anti-scraping/proxies
6+
- anti-scraping/mitigation/proxies
77
---
88

99
# [](#about-proxies) Proxies
@@ -45,4 +45,4 @@ Web scrapers can implement a method called "proxy rotation" to **rotate** the IP
4545

4646
## [](#next) Next up
4747

48-
This module's first lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/proxies/using_proxies.md}})
48+
Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in the Apify SDK to use and automatically rotate proxies. [Let's get right into it!]({{@link anti_scraping/mitigation/using_proxies.md}})

content/academy/anti_scraping/proxies/using_proxies.md renamed to content/academy/anti_scraping/mitigation/using_proxies.md

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
title: Using proxies
33
description: Learn how to use and automagically rotate proxies in your scrapers by using the Apify SDK, and a bit about how to easily obtain pools of proxies.
4-
menuWeight: 1
4+
menuWeight: 2
55
paths:
6-
- anti-scraping/proxies/using-proxies
6+
- anti-scraping/mitigation/using-proxies
77
---
88

99
# [](#using-proxies) Using proxies
@@ -120,7 +120,7 @@ const crawler = new Apify.CheerioCrawler({
120120

121121
After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:
122122

123-
![proxyInfo being logged by the scraper]({{@asset anti_scraping/proxies/images/proxy-info-logs.webp}})
123+
![proxyInfo being logged by the scraper]({{@asset anti_scraping/mitigation/images/proxy-info-logs.webp}})
124124

125125
These logs confirm that our proxies are being used and rotated successfully by the Apify SDK, and can also be used to debug slow or broken proxies.
126126

@@ -137,10 +137,6 @@ const proxyConfiguration = await Apify.createProxyConfiguration({
137137

138138
Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy).
139139

140-
## More lessons to come
140+
## [](#next) Next up
141141

142-
That's it for the proxy course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
143-
144-
<!-- ## [](#next) Next up
145-
146-
Smth -->
142+
[Next up]({{@link anti_scraping/mitigation/generating_fingerprints.md}}), we'll be checking out how to use two NPM packages to generate and inject [browser fingerprints]({{@link anti_scraping/techniques/fingerprinting.md}}).

content/academy/anti_scraping/techniques/captchas.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ When you've hit a captcha, your first thought should not be how to programmatica
1919

2020
Have you expended all of the possible options to make your scraper appear more human-like? Are you:
2121

22-
- Using [proxies]({{@link anti_scraping/proxies.md}})?
22+
- Using [proxies]({{@link anti_scraping/mitigation/proxies.md}})?
2323
- Making the request with the proper [headers]({{@link concepts/http_headers.md}}) and [cookies]({{@link concepts/http_cookies.md}})?
2424
- Generating and using a custom [browser fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}})?
2525
- Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser?
@@ -38,4 +38,4 @@ Another popular captcha is the [Geetest slider captcha](https://www.geetest.com/
3838

3939
## Wrap up
4040

41-
In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the [next section]({{@link anti_scraping/proxies.md}}), we'll be discussing one of the most crucial parts of web scraping and web-automation: how to properly leverage proxies to avoid many of the anti-scraping techniques that were discussed in this section.
41+
In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the [next section]({{@link anti_scraping/mitigation.md}}), we'll be discussing how to mitigate the anti-scraping techniques you learned about in this section.

0 commit comments

Comments
 (0)