You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/anti_scraping/index.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ slug: /anti-scraping
12
12
13
13
---
14
14
15
-
If at any point in time you've strayed away from the Academy's demo content, and into the wild west by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions.
15
+
If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions.
16
16
17
17
This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more.
18
18
@@ -65,7 +65,7 @@ Unfortunately for these websites, they have to make compromises and tradeoffs. W
65
65
Anti-scraping protections can work on many different layers and use a large amount of bot-identification techniques.
66
66
67
67
1.**Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot.
68
-
2.**How you look** - With each request, the website can analyze its HTTP headers, TLS version, cyphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration).
68
+
2.**How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration).
69
69
3.**What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently.
70
70
4.**How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses.
71
71
@@ -91,7 +91,7 @@ A common workflow of a website after it has detected a bot goes as follows:
91
91
2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist.
92
92
3. If the captcha is failed, the bot is added to the blacklist.
93
93
94
-
One thing to keep in mind while navigating through this course is that advanced scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations.
94
+
One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations.
95
95
96
96
Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them.
97
97
@@ -111,7 +111,7 @@ Because we here at Apify scrape for a living, we have discovered many popular an
111
111
112
112
### IP rate-limiting
113
113
114
-
This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.
114
+
This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.
115
115
116
116
> Learn more about rate limiting [here](./techniques/rate_limiting.md)
Copy file name to clipboardExpand all lines: sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -176,7 +176,7 @@ The script is modified with some random JavaScript elements. Additionally, it al
176
176
177
177
### Data obfuscation
178
178
179
-
Two main data obfuscation techniues are widely employed:
179
+
Two main data obfuscation techniques are widely employed:
180
180
181
181
1.**String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`.
182
182
2.**Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ In the past, most websites had their own anti-scraping solutions, the most commo
17
17
18
18
In cases when a higher number of requests is expected for the crawler, using a [proxy](../mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.
19
19
20
-
## Dealing rate limiting with proxy/session rotating {#dealing-with-rate-limiting}
20
+
## Dealing with rate limiting by rotating proxy or session {#dealing-with-rate-limiting}
21
21
22
22
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](../mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ If we were to make a request with the **limit** set to **5** and the **offset**
37
37
38
38
## Cursor pagination {#cursor-pagination}
39
39
40
-
Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit**parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided.
40
+
Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to.
41
41
42
42
One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,7 @@ import TabItem from '@theme/TabItem';
14
14
15
15
---
16
16
17
-
Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website.
18
-
19
-
> Most web data extraction cases involve looping through a list of items of some sort.
20
-
21
-
Playwright & Puppeteer offer two main methods for data extraction
17
+
Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction:
22
18
23
19
1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`.
24
20
2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio)
0 commit comments