Skip to content

Commit 9eff9b0

Browse files
committed
lukas comments
1 parent 551705c commit 9eff9b0

File tree

3 files changed

+12
-24
lines changed

3 files changed

+12
-24
lines changed

content/academy/analyzing_pages_and_fixing_errors.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ paths:
99

1010
# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
1111

12-
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money my allowing you to avoid hiring an expensive developer to solve your issue for you.
12+
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money by allowing you to avoid hiring an expensive developer to solve your issue for you.
1313

1414
This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.
1515

@@ -65,8 +65,7 @@ try {
6565
// ...
6666
} catch (error) {
6767
// You know where the code crashed so you can explain here
68-
console.error('Request failed during login with an error:');
69-
throw error;
68+
throw new Error('Request failed during login with an error', { cause: error });
7069
}
7170
```
7271

@@ -108,8 +107,7 @@ try {
108107
const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`
109108

110109
// You know where the code crashed so you can explain here
111-
console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
112-
throw error;
110+
throw new Error('Request failed during login with an error', { cause: error });
113111
}
114112
// ...
115113
```
@@ -125,8 +123,9 @@ To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a ran
125123

126124
Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting.
127125

128-
<!-- TODO: Make the code example below make sense without using Apify API or SDK -->
129-
<!-- This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
126+
## [](#with-the-apify-sdk) With the Apify SDK
127+
128+
This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
130129

131130
```JavaScript
132131
import { Actor } from 'apify';
@@ -172,11 +171,8 @@ try {
172171
await reportingDataset.pushData(report);
173172

174173
// You know where the code crashed so you can explain here
175-
console.error(
176-
`Request failed during login with an error. Screenshot: ${screenshotLink}`
177-
);
178-
throw error;
174+
throw new Error('Request failed during login with an error', { cause: error });
179175
}
180176
// ...
181177
await Actor.exit();
182-
``` -->
178+
```

content/academy/caching_responses_in_puppeteer.md

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ paths:
99

1010
# [](#caching-responses-in-puppeteer) Caching responses in Puppeteer
1111

12-
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article. Puppeteer now uses a native cache that should work well enough for most use cases.
12+
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)
1313
14-
When running crawlers that go through a single website, each open page has to load all resources again (sadly, headless browsers don't use cache). The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
14+
When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
1515

1616
For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**).
1717

@@ -155,14 +155,6 @@ const crawler = new PuppeteerCrawler({
155155
succeeded: true,
156156
});
157157
},
158-
159-
failedRequestHandler: async ({ request }) => {
160-
await Dataset.pushData({
161-
url: request.url,
162-
succeeded: false,
163-
errors: request.errorMessages,
164-
});
165-
},
166158
});
167159

168160
await crawler.run(['https://apify.com/store', 'https://apify.com']);

content/academy/optimizing_scrapers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ paths:
99

1010
# [](#optimizing-scrapers) Optimizing scrapers
1111

12-
Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need.
12+
Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need.
1313

1414
The goal of optimization is simple: Make the code run as fast possible and use the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). Memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation to smaller parts). So in the rest of this article, we will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU.
1515

@@ -29,7 +29,7 @@ Now, if you want to build your own game and you are not a C/C++ veteran with a t
2929

3030
What are the engines of the scraping world? A [browser](https://github.com/puppeteer/puppeteer/blob/master/docs/api.md), an [HTTP library](https://www.npmjs.com/package/@apify/http-request), an [HTML parser](https://github.com/cheeriojs/cheerio), and a [JSON parser](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse). The CPU spends more than 99% of its workload in these libraries. As with engines, you are not likely gonna write these from scratch - instead you'll use something like [Crawlee](https://crawlee.dev) that handles a lot of the overheads for you.
3131

32-
It is about how you use these tools. The small amount of code you write in your [`requestHandler`](https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler) is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 50 times faster/cheaper than a browser based solution.
32+
It is about how you use these tools. The small amount of code you write in your [`requestHandler`](https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler) is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 200 times faster/cheaper than a browser based solution.
3333

3434
**Ranking of the tools from the most efficient to the least:**
3535

0 commit comments

Comments
 (0)