You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/academy/anti_scraping/mitigation.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ paths:
10
10
11
11
In the [techniques]({{@link anti_scraping/techniques.md}}) section of this course, you learned about multiple methods websites use to prevent bots from accessing their content. This **Mitigation** section will be all about how to circumvent these protections using various different techniques.
12
12
13
-
<!-- Here there should -->
13
+
<!-- Here there should be a bit of an outline of what mitigation techniques they'll be learning-->
With the [Fingerprint generator](https://github.com/apify/fingerprint-generator) NPM package, you can easily generate a browser fingerprint.
12
+
13
+
> It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results.
Once you've generated a fingerprint, it can be injected into the browser using the [Fingerprint injector](https://github.com/apify/fingerprint-injector) package. This tool allows you to inject fingerprints to browsers automated by Playwright or Puppeteer:
> Note that the Apify SDK automatically applies wide variety fingerprints by default, so it is not required to do this unless you aren't using the Apify SDK or if you need a super specific custom fingerprint to scrape with.
89
+
90
+
## [](#next) Next up
91
+
92
+
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
Copy file name to clipboardExpand all lines: content/academy/anti_scraping/mitigation/using_proxies.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,4 +139,4 @@ Notice that we didn't provide it a list of proxy URLs. This is because the `SHAD
139
139
140
140
## [](#next) Next up
141
141
142
-
That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! Alternatively, you can subscribe to our mailing list to get periodic updates on the Academy, as well as what Apify is up to.
142
+
[Next up]({{@link anti_scraping/mitigation/generating_fingerprints.md}}), we'll be checking out how to use two NPM packages to generate and inject [browser fingerprints]({{@link anti_scraping/techniques/fingerprinting.md}}).
Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings. All of this information is called the **fingerprint** of the browser, and the act of collecting it is called **fingerprinting**.
12
12
13
-
Often times, websites use fingerprinting to track the online behavior of their users in order to serve hyper-personalized advertisements to them. However, in some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it).
13
+
Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is easily accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique.
14
14
15
-
Here is an example of what a browser fingerprint might look like:
15
+
Based on [research](https://www.eff.org/press/archives/2010/05/13) carried out by the Electronic Frontier Foundation, 84% of collected fingerprints are globally exclusive, and they found that the next 9% were in sets with a size of two. They also stated that even though fingerprints are dynamic, new ones can be matched up with old ones with 99.1% correctness. This makes fingerprinting a very viable option for websites who want to track the online behavior of their users in order to serve hyper-personalized advertisements to them. In some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it).
16
+
17
+
## [](#what-makes-up-a-fingerprint) What makes up a fingerprint?
18
+
19
+
To collect a good fingerprint, websites must collect them from various places.
20
+
21
+
### [](#from-http-headers) From HTTP headers
22
+
23
+
There are a few [HTTP headers]({{@link concepts/http_headers.md}}) which can be used to create a fingerprint about a user. Here are some of the main ones:
24
+
25
+
1.**User-Agent** provides information about the browser and its operating system (including its versions).
26
+
2.**Accept** tells the server what content types the browser can render and send, and **Content-Encoding** provides data about the content compression.
27
+
3.**Content-Language** and **Accept-Language** both indicate the user's (and browser's) preferred language.
28
+
4.**Referer** gives the server the address of the previous page from which the link was followed.
29
+
30
+
A few other headers commonly used for fingerprinting can be seen below:
### [](#from-window-properties) From window properties
35
+
36
+
The `window` is defined as a global variable that is accessible from an JavaScript running in the browser. It is home to a vast amount of functions, variables, and constructors, and most of the global configuration is stored there.
37
+
38
+
Most of the attributes that are used for fingerprinting are stored under the `window.navigator` object, which holds methods and info about the user's state and identity starting with the **User-Agent** itself and ending with the device's battery status. All of these properties can be used to fingerprint a device; however, most fingerprinting solutions (such as [Valve](https://valve.github.io/fingerprintjs/)) only use the most crucial ones.
39
+
40
+
Here is a list of some of the most crucial properties on the `window` object used for fingerprinting:
41
+
42
+
| Property | Example | Description |
43
+
| - | - | - |
44
+
|`screen.width`|`1680`| Defines the width of the device screen. |
45
+
|`screen.height`|`1050`| Defines the height of the device screen. |
46
+
|`screen.availWidth`|`1680`| The portion of the screen width available to the browser window. |
47
+
|`screen.availHeight`|`1050`| The portion of the screen height available to the browser window. |
48
+
|`navigator.userAgent`|`'Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'`| Same as the HTTP header. |
49
+
|`navigator.platform`|`'MacIntel'`| The platform the browser is running on. |
50
+
|`navigator.cookieEnabled`|`true`| Whether or not the browser accepts cookies. |
51
+
|`navigator.doNotTrack`|`'1'`| Indicates the browser's Do Not Track settings. |
52
+
|`navigator.buildID`|`20181001000000`| The build ID of the browser. |
53
+
|`navigator.product`|`'Gecko'`| The layout engine used. |
54
+
|`navigator.productSub`|`20030107`| The version of the layout engine used. |
55
+
|`navigator.vendor`|`'Google Inc.'`| Vendor of the browser. |
56
+
|`navigator.hardwareConcurrency`|`4`| The number of logical processors the user's computer has available to run threads on. |
57
+
|`navigator.javaEnabled`|`false`| Whether or not the user has enabled Java. |
58
+
|`navigator.deviceMemory`|`8`| Approximately the amount of user memory (in gigabytes). |
59
+
|`navigator.language`|`'en-US'`| The user's primary language. |
60
+
|`navigator.languages`|`['en-US', 'cs-CZ', 'es']`| Other user languages. |
61
+
62
+
### [](#from-function-calls) From function calls
63
+
64
+
Fingerprinting tools can also collect pieces of information that are retrieved by calling specific functions:
65
+
66
+
```JavaScript
67
+
// Get the WebGL vendor information
68
+
WebGLRenderingContext.getParameter(37445)
69
+
70
+
// Get the WebGL renderer information
71
+
WebGLRenderingContext.getParameter(37446)
72
+
73
+
// Pass any codec into this function (ex. "audio/aac"). It will return
74
+
// either "maybe," "probably," or "" indicating whether
75
+
// or not the browser can play that codec. An empty
76
+
// string means that it can't be played.
77
+
HTMLMediaElement.canPlayType('some/codec')
78
+
79
+
// can ask for a permission if it is not already enabled.
80
+
// allows you to know which permissions the user has
81
+
// enabled, and which are disabled
82
+
navigator.permissions.query('some_permission')
83
+
```
84
+
85
+
### [](#with-canvases) With canvases
86
+
87
+
This technique is based on rendering [WebGL](https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API) scenes to a canvas element and observing the pixels rendered. WebGL rendering is tightly connected with the hardware, and therefore provides high entropy. Here's a quick breakdown of how it works:
88
+
89
+
1. A JavaScript script creates a [`<canvas>` element](https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API) and renders some font or a custom shape.
90
+
2. The script then gets the pixel-map from the `<canvas>` element.
91
+
3. The collected pixel-map is stored in a cryptographic hash specific to the device's hardware.
92
+
93
+
Canvas fingerprinting takes advantage of the CSS3 feature for importing fonts into CSS (called [WebFonts](https://developer.mozilla.org/en-US/docs/Learn/CSS/Styling_text/Web_fonts)). This means it's not required to use just the machine's preinstalled fonts.
94
+
95
+
Here's an example of multiple WebGL scenes visibly being rendered differently on different machines:
96
+
97
+

98
+
99
+
### [](#from-audiocontext) From AudioContext
100
+
101
+
The [AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) API represents an audio-processing graph built from audio modules linked together, each represented by an [AudioNode](https://developer.mozilla.org/en-US/docs/Web/API/AudioNode) ([OscillatorNode](https://developer.mozilla.org/en-US/docs/Web/API/OscillatorNode)).
102
+
103
+
In the simplest cases, the fingerprint can be obtained by simply checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `<canvas>` method:
104
+
105
+
1. Audio is passed through an OscillatorNode.
106
+
2. The signal is processed and collected.
107
+
3. The collected signal is cryptographically hashed to provide a short ID.
108
+
109
+
> A downfall of this method is that two same machines with the same browser will get the same ID.
110
+
111
+
### [](#from-batterymanager) From BatteryManager
112
+
113
+
The `navigator.getBattery()` function returns a promise which resolves with a [BatteryManager](https://developer.mozilla.org/en-US/docs/Web/API/BatteryManager) interface. BatteryManager offers information about whether or not the battery is charging, and how much time is left until the battery has fully discharged/charged.
114
+
115
+
On its own this method is quite weak, but it can be potent when combined with the `<canvas>` and AudioContext fingerprinting techniques mentioned above.
116
+
117
+
## [](#fingerprint-example) Fingerprint example
118
+
119
+
When all is said and done, this is what a browser fingerprint might look like:
16
120
17
121
```JSON
18
122
{
@@ -86,85 +190,6 @@ On websites which implement advanced fingerprinting techniques, they will tie th
86
190
87
191
When dealing with these cases, it's important to sync the generation of headers and fingerprints with the rotation of proxies (this is known as session rotation).
With the [Fingerprint generator](https://github.com/apify/fingerprint-generator) NPM package, you can easily generate a browser fingerprint.
92
-
93
-
> It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results.
Once you've generated a fingerprint, it can be injected into the browser using the [Fingerprint injector](https://github.com/apify/fingerprint-injector) package. This tool allows you to inject fingerprints to browsers automated by Playwright or Puppeteer:
> Note that the Apify SDK automatically applies wide variety fingerprints by default, so it is not required to do this unless you need a super specific custom fingerprint to scrape with.
167
-
168
193
## [](#next) Next up
169
194
170
195
[Next up]({{@link anti_scraping/techniques/geolocation.md}}), we'll be covering **geolocation** methods that websites use to grab the location from which a request has been made, and how they relate to anti-scraping.
0 commit comments