Skip to content

Commit 753797d

Browse files
committed
fix: typos and stylistic improvements
1 parent 069d356 commit 753797d

File tree

14 files changed

+29
-25
lines changed

14 files changed

+29
-25
lines changed

sources/academy/tutorials/apify_scrapers/getting_started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ The scraper:
290290
291291
## [](#scraping-practice) Scraping practice
292292

293-
We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url` because it is the URL and includes the Unique identifier.
293+
We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url`, because it is the URL and includes the Unique identifier.
294294

295295
```js
296296
const { url } = request;

sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ You want to crawl a website with a proxy pool, but most of your proxies are bloc
2323

2424
Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](/platform/proxy#residential-proxy), but they can sometimes be too costly.
2525

26-
However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class.
26+
However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler` class.
2727

2828
### Solution
2929

sources/academy/webscraping/anti_scraping/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ slug: /anti-scraping
1212

1313
---
1414

15-
If at any point in time you've strayed away from the Academy's demo content, and into the wild west by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions.
15+
If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions.
1616

1717
This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more.
1818

@@ -91,7 +91,7 @@ A common workflow of a website after it has detected a bot goes as follows:
9191
2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist.
9292
3. If the captcha is failed, the bot is added to the blacklist.
9393

94-
One thing to keep in mind while navigating through this course is that advanced scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations.
94+
One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations.
9595

9696
Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them.
9797

@@ -111,7 +111,7 @@ Because we here at Apify scrape for a living, we have discovered many popular an
111111
112112
### IP rate-limiting
113113

114-
This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.
114+
This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.
115115

116116
> Learn more about rate limiting [here](./techniques/rate_limiting.md)
117117

sources/academy/webscraping/anti_scraping/mitigation/proxies.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,17 +26,27 @@ Although IP quality is still the most important factor when it comes to using pr
2626

2727
Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md).
2828

29-
## A bit about proxy links {#understanding-proxy-links}
29+
## About proxy links {#understanding-proxy-links}
3030

31-
When using proxies in your crawlers, you'll most likely be using them in a format that looks like this:
31+
To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials.
3232

3333
```text
3434
http://proxy.example.com:8080
3535
```
3636

37-
This link is separated into two main components: the **host**, and the **port**. In our case, our hostname is `http://proxy.example.com`, and our port is `8080`. Sometimes, a proxy might use an IP address as the host, such as `103.130.104.33`.
37+
The proxy link above has several parts:
3838

39-
If authentication (a username and a password) is required, the format will look a bit different:
39+
- `http://` tells us we're using HTTP protocol,
40+
- `proxy.example.com` is a hostname, i.e. an address to the proxy server,
41+
- `8080` is a port number.
42+
43+
Sometimes the proxy server has no name, so the link contains an IP address instead:
44+
45+
```text
46+
http://123.456.789.10:8080
47+
```
48+
49+
If proxy requires authentication, the proxy link can contain username and password:
4050

4151
```text
4252
http://USERNAME:[email protected]:8080

sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ The script is modified with some random JavaScript elements. Additionally, it al
176176

177177
### Data obfuscation
178178

179-
Two main data obfuscation techniues are widely employed:
179+
Two main data obfuscation techniques are widely employed:
180180

181181
1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`.
182182
2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect.

sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ In the past, most websites had their own anti-scraping solutions, the most commo
1717

1818
In cases when a higher number of requests is expected for the crawler, using a [proxy](../mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.
1919

20-
## Dealing rate limiting with proxy/session rotating {#dealing-with-rate-limiting}
20+
## Dealing with rate limiting by rotating proxy or session {#dealing-with-rate-limiting}
2121

2222
The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](../mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.
2323

sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ If we were to make a request with the **limit** set to **5** and the **offset**
3737

3838
## Cursor pagination {#cursor-pagination}
3939

40-
Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit** parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided.
40+
Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to.
4141

4242
One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one.
4343

sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ _Here's what we can see in the Network tab after reloading the page:_
2121

2222
Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found.
2323

24-
> **Note:** The keyword/piece of data that is used in this filtered search should be a target keyword or a piece of target data that that can be assumed will most likely be a part of the endpoint.
24+
> To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`).
2525
2626
After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks:
2727

sources/academy/webscraping/puppeteer_playwright/browser_contexts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ await browser.close();
7777

7878
## Using browser contexts {#using-browser-contexts}
7979

80-
In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android:
80+
In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device:
8181

8282
<Tabs groupId="main">
8383
<TabItem value="Playwright" label="Playwright">

sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,7 @@ import TabItem from '@theme/TabItem';
1414

1515
---
1616

17-
Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website.
18-
19-
> Most web data extraction cases involve looping through a list of items of some sort.
20-
21-
Playwright & Puppeteer offer two main methods for data extraction
17+
Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction:
2218

2319
1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`.
2420
2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio)

0 commit comments

Comments
 (0)