Skip to content

Commit e3204bd

Browse files
authored
Merge pull request #446 from apify/anti-scraping-migrations
feat(anti-scraping): migrations
2 parents 7aaa603 + 7b35a19 commit e3204bd

File tree

5 files changed

+69
-2
lines changed

5 files changed

+69
-2
lines changed

content/academy/anti_scraping.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,14 @@ One of the most successful and advanced methods is collecting the browser's "fin
8989

9090
> It's important to note that this method also blocks all users that cannot evaluate JavaScript (such as bots sending only static HTTP requests), and combines both of the fundamental methods mentioned earlier.
9191
92+
### Honeypots
93+
94+
The honeypot approach is based on providing links that only bots can see. A typical example is hidden pagination. Usually, the bot needs to go through all the pages in the pagination, so the website's last "fake" page has a hidden link for the user, but has the same selector as the real one. Once the bot visits the link, it is automatically blacklisted. This method needs only the HTTP information.
95+
96+
### IP-session consistency
97+
98+
This technique is common for blocking the bot from accessing the website. It works on the principle that every entity that accesses the site gets a token. This token is then saved together with the IP address and HTTP request information such as user-agent and other specific headers. If the entity makes another request, but without the session cookie, the IP address is added on the grey list.
99+
92100
## [](#first) First up
93101

94102
In our [first section]({{@link anti_scraping/techniques.md}}), we'll be discussing more in-depth about the various anti-scraping methods and techniques websites use, as well as how to mitigate these protections.

content/academy/anti_scraping/mitigation/generating_fingerprints.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,29 @@ paths:
88

99
# [](#generating-fingerprints) Generating fingerprints
1010

11-
With the [Fingerprint generator](https://github.com/apify/fingerprint-generator) NPM package, you can easily generate a browser fingerprint.
11+
In [**Crawlee**](https://crawlee.dev), it's extremely easy to automatically generate fingerprints using the [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler.
12+
13+
```JavaScript
14+
import { PlaywrightCrawler } from 'crawlee';
15+
16+
const crawler = new PlaywrightCrawler({
17+
browserPoolOptions: {
18+
fingerprintOptions: {
19+
fingerprintGeneratorOptions: {
20+
browsers: [{ name: 'firefox', minVersion: 80 }],
21+
devices: ['desktop'],
22+
operatingSystems: ['windows'],
23+
},
24+
},
25+
},
26+
});
27+
```
28+
29+
> Note that Crawlee will automatically generate fingerprints for you with no configuration necessary, but the option to configure them yourself is still there within **browserPoolOptions**.
30+
31+
## [](#using-fingerprint-generator) Using the fingerprint-generator package
32+
33+
Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-generator) NPM package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can easily generate browser fingerprints.
1234

1335
> It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results.
1436
@@ -37,7 +59,7 @@ const generated = fingerprintGenerator.getFingerprint({
3759

3860
## [](#injecting-fingerprints) Injecting fingerprints
3961

40-
Once you've generated a fingerprint, it can be injected into the browser using the [Fingerprint injector](https://github.com/apify/fingerprint-injector) package. This tool allows you to inject fingerprints to browsers automated by Playwright or Puppeteer:
62+
Once you've manually generated a fingerprint using the **Fingerprint generator** package, it can be injected into the browser using [**fingerprint-injector**](https://github.com/apify/fingerprint-injector). This tool allows you to inject fingerprints into browsers automated by Playwright or Puppeteer:
4163

4264
```JavaScript
4365
import FingerprintGenerator from 'fingerprint-generator';
@@ -90,3 +112,20 @@ await page.goto('https://google.com');
90112
## Wrap up
91113

92114
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
115+
116+
## [](#generating-headers) Generating headers
117+
118+
Headers are also used by websites to fingerprint users (or bots), so it might sometimes be necessary to generate some user-like headers to mitigate anti-scraping protections. Similarly with fingerprints, **Crawlee** automatically generates headers for you, but you can have full control by using the [**browser-headers-generator**](https://github.com/apify/browser-headers-generator) package.
119+
120+
```JavaScript
121+
import BrowserHeadersGenerator from 'browser-headers-generator';
122+
123+
const browserHeadersGenerator = new BrowserHeadersGenerator({
124+
operatingSystems: ['windows'],
125+
browsers: ['chrome'],
126+
});
127+
128+
await browserHeadersGenerator.initialize()
129+
130+
const randomBrowserHeaders = await browserHeadersGenerator.getRandomizedHeaders()
131+
```

content/academy/anti_scraping/techniques/fingerprinting.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,26 @@ Often times, both of these data obfuscation techniques are used together.
184184

185185
Built-in JavaScript encoding functions are used to transform the code into, for example, hexadecimal string. Or, a custom encoding function is used and a custom decoding function decodes the code as it is evaluated in the browser.
186186

187+
## Detecting fingerprinting scripts
188+
189+
As mentioned above, many sites obfuscate their fingerprinting scripts to make them harder to detect. Luckily for us, there are ways around this.
190+
191+
### Manual de-obfuscation
192+
193+
Almost all sites using fingerprinting and tracking scripts try to protect them as much as much as they can. However, it is impossible to make client-side JavaScript immune to reverse engineering. It is only possible to make reverse engineering difficult and unpleasant for the developer. The procedure used to make the code as unreadable as possible is called [obfuscation](https://www.techtarget.com/searchsecurity/definition/obfuscation#:~:text=Obfuscation%20means%20to%20make%20something,code%20is%20one%20obfuscation%20method.).
194+
195+
When you want to dig inside the protection code to determine exactly which data is collected, you will probably have to deobfuscate it. Be aware that this can be a very time-consuming process. Code deobfuscation can take anywhere up to 1-2-days to be at a semi-readable state.
196+
197+
We recommend watching some videos from [Jarrod Overson on YouTube](https://www.youtube.com/channel/UCJbZGfomrHtwpdjrARoMVaA/videos) to learn the tooling necessary to deobfuscate code.
198+
199+
### Using browser extensions
200+
201+
Because of how common it has become to obfuscate fingerprinting scripts, there are many extensions help identify fingerprinting scripts due to the fact that browser fingerprinting is such a big privacy question. Browser extensions such as [**Don't Fingerprint Me**](https://github.com/freethenation/DFPM) have been created to help detect them. In the extension's window, you can see a report on which functions commonly used for fingerprinting have been called, and which navigator properties have been accessed.
202+
203+
![Don't Fingerprint Me extension window]({{@asset anti_scraping/techniques/images/dont-fingerprint-me.webp}})
204+
205+
This extension provides monitoring of only a few critical attributes, but in order to to deceive anti-scraping protections, the full list is needed. However, the extension does reveal the scripts that collect the fingerprints.
206+
187207
## [](#anti-bot-fingerprinting) Anti-bot fingerprinting
188208

189209
On websites which implement advanced fingerprinting techniques, they will tie the fingerprint and certain headers (such as the **User-Agent** header) to the IP address of the user. These sites will block a user (or scraper) if it made a request with one fingerprint and set of headers, then tries to make another request on the same proxy but with a different fingerprint.
240 KB
Loading
-54.8 KB
Binary file not shown.

0 commit comments

Comments
 (0)