You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/academy/analyzing_pages_and_fixing_errors.md
+8-12Lines changed: 8 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ paths:
9
9
10
10
# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
11
11
12
-
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money my allowing you to avoid hiring an expensive developer to solve your issue for you.
12
+
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money by allowing you to avoid hiring an expensive developer to solve your issue for you.
13
13
14
14
This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.
15
15
@@ -65,8 +65,7 @@ try {
65
65
// ...
66
66
} catch (error) {
67
67
// You know where the code crashed so you can explain here
68
-
console.error('Request failed during login with an error:');
69
-
throw error;
68
+
thrownewError('Request failed during login with an error', { cause: error });
// You know where the code crashed so you can explain here
111
-
console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
112
-
throw error;
110
+
thrownewError('Request failed during login with an error', { cause: error });
113
111
}
114
112
// ...
115
113
```
@@ -125,8 +123,9 @@ To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a ran
125
123
126
124
Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting.
127
125
128
-
<!-- TODO: Make the code example below make sense without using Apify API or SDK -->
129
-
<!-- This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
126
+
## [](#with-the-apify-sdk) With the Apify SDK
127
+
128
+
This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
130
129
131
130
```JavaScript
132
131
import { Actor } from'apify';
@@ -172,11 +171,8 @@ try {
172
171
awaitreportingDataset.pushData(report);
173
172
174
173
// You know where the code crashed so you can explain here
175
-
console.error(
176
-
`Request failed during login with an error. Screenshot: ${screenshotLink}`
177
-
);
178
-
throw error;
174
+
thrownewError('Request failed during login with an error', { cause: error });
Copy file name to clipboardExpand all lines: content/academy/caching_responses_in_puppeteer.md
+2-10Lines changed: 2 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ paths:
9
9
10
10
# [](#caching-responses-in-puppeteer) Caching responses in Puppeteer
11
11
12
-
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article. Puppeteer now uses a native cache that should work well enough for most use cases.
12
+
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)
13
13
14
-
When running crawlers that go through a single website, each open page has to load all resources again (sadly, headless browsers don't use cache). The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
14
+
When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
15
15
16
16
For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**).
17
17
@@ -155,14 +155,6 @@ const crawler = new PuppeteerCrawler({
Copy file name to clipboardExpand all lines: content/academy/optimizing_scrapers.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ paths:
9
9
10
10
# [](#optimizing-scrapers) Optimizing scrapers
11
11
12
-
Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need.
12
+
Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need.
13
13
14
14
The goal of optimization is simple: Make the code run as fast possible and use the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). Memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation to smaller parts). So in the rest of this article, we will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU.
15
15
@@ -29,7 +29,7 @@ Now, if you want to build your own game and you are not a C/C++ veteran with a t
29
29
30
30
What are the engines of the scraping world? A [browser](https://github.com/puppeteer/puppeteer/blob/master/docs/api.md), an [HTTP library](https://www.npmjs.com/package/@apify/http-request), an [HTML parser](https://github.com/cheeriojs/cheerio), and a [JSON parser](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse). The CPU spends more than 99% of its workload in these libraries. As with engines, you are not likely gonna write these from scratch - instead you'll use something like [Crawlee](https://crawlee.dev) that handles a lot of the overheads for you.
31
31
32
-
It is about how you use these tools. The small amount of code you write in your [`requestHandler`](https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler) is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 50 times faster/cheaper than a browser based solution.
32
+
It is about how you use these tools. The small amount of code you write in your [`requestHandler`](https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler) is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 200 times faster/cheaper than a browser based solution.
33
33
34
34
**Ranking of the tools from the most efficient to the least:**
0 commit comments