Skip to content

Commit 3df9bde

Browse files
authored
Merge pull request #447 from apify/new-tutorials
feat(tutorials): new migrations
2 parents e3204bd + 9eff9b0 commit 3df9bde

18 files changed

+385
-7
lines changed
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: How to analyze and fix errors when scraping a website
3+
description: Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.
4+
menuWeight: 17
5+
category: tutorials
6+
paths:
7+
- analyzing-pages-and-fixing-errors
8+
---
9+
10+
# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
11+
12+
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money by allowing you to avoid hiring an expensive developer to solve your issue for you.
13+
14+
This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.
15+
16+
## [](#possible-causes) Possible causes
17+
18+
It is often tricky to see the full scope of what can go wrong. We assume once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation.
19+
20+
Websites change, they introduce new [anti-scraping technologies]({{@link anti_scraping.md}}), programming tools change and, in addition, people make mistakes.
21+
22+
Here are the most common reasons your working solution may break.
23+
24+
- The website changes its layout or [data feed](https://www.datafeedwatch.com/academy/data-feed).
25+
- A site's layout changes depending on location or uses [A/B testing](https://www.youtube.com/watch?v=XDoKXaGrUxE&feature=youtu.be).
26+
- A page starts to block you (recognizes you as a bot).
27+
- The website [loads its data later dynamically]({{@link dealing_with_dynamic_pages.md}}), so the code works only sometimes, if you are slow or lucky enough.
28+
- You made a mistake when updating your code.
29+
- Your [proxies]({{@link anti_scraping/mitigation/proxies.md}}) aren't working.
30+
- You have upgraded your [dependencies](https://www.quora.com/What-is-a-dependency-in-coding) (other software that your software relies upon), and the new versions no longer work (this is harder to debug).
31+
32+
## [](#issue-analysis) Diagnosing/analyzing the issue
33+
34+
Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler.
35+
36+
> Many issues are edge cases, which occur in just one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm).
37+
38+
### [](#logging) Logging
39+
40+
Logging is an essential tool for any programmer. When used correctly, they help you capture a surprising amount of information. Here are some general rules for logging:
41+
42+
- Usually, **many logs** is better than **no logs** at all.
43+
- Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size.
44+
- Focus on numbers. Log how many items you extract from a page, etc.
45+
- Structure your logs and use the same structure in all your logs.
46+
- Append the current page's URL to each log. This lets you immediately open that page and review it.
47+
48+
Here's an example of what a structured log message might look like:
49+
50+
```text
51+
[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store
52+
```
53+
54+
The log begins with the **page type**. Usually, we use labels such as **\[CATEGORY\]** and **\[DETAIL\]**. Then, we log important numbers and other information. Finally, we add the page's URL, so we can check if the log is correct.
55+
56+
#### [](#logging-errors) Logging errors
57+
58+
Errors require a different approach because, if your code crashes, your usual logs will not be called. Instead, exception handlers will print the error, but these are usually ugly messages with a [stack trace](https://en.wikipedia.org/wiki/Stack_trace) that only the experts will understand.
59+
60+
You can overcome this by adding [try/catch blocks](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/try...catch) into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried).
61+
62+
```JavaScript
63+
try {
64+
// Sensitive code block
65+
// ...
66+
} catch (error) {
67+
// You know where the code crashed so you can explain here
68+
throw new Error('Request failed during login with an error', { cause: error });
69+
}
70+
```
71+
72+
Read more information about logging and error handling in our developer [best practices]({{@link web_scraping_for_beginners/best_practices.md}}) section.
73+
74+
### [](#saving-snapshots) Saving snapshots
75+
76+
By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright]({{@link puppeteer_playwright.md}}) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can easily display in your own browser. Snapshots are useful throughout your code but especially important in error handling.
77+
78+
Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot.
79+
80+
Snapshots can tell you if:
81+
82+
- A website has changed its layout. This can also mean A/B testing or different content for different locations.
83+
- You have been blocked – you open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page.
84+
- Data load later dynamically – the page is empty.
85+
- The page was redirected – the content is different.
86+
87+
You can learn how to take snapshots in Puppeteer or Playwright in [this short lesson]({{@link puppeteer_playwright/page/page_methods.md}})
88+
89+
#### [](#when-to-save-snapshots) When to save snapshots
90+
91+
The most common approach is to save on error. We can enhance our previous try/catch block like this:
92+
93+
```JavaScript
94+
import { puppeteerUtils } from 'crawlee';
95+
96+
// ...
97+
// storeId is ID of current key value store, where we save snapshots
98+
const storeId = Actor.getEnv().defaultKeyValueStoreId;
99+
try {
100+
// Sensitive code block
101+
// ...
102+
} catch (error) {
103+
// Change the way you save it depending on what tool you use
104+
const randomNumber = Math.random();
105+
const key = `ERROR-LOGIN-${randomNumber}`;
106+
await puppeteerUtils.saveSnapshot(page, { key });
107+
const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`
108+
109+
// You know where the code crashed so you can explain here
110+
throw new Error('Request failed during login with an error', { cause: error });
111+
}
112+
// ...
113+
```
114+
115+
To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a random number so the next **ERROR-LOGIN**s would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better.
116+
117+
**Beware:**
118+
119+
- The snapshot's **name** (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick.
120+
- Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources.
121+
122+
### [](#error-reporting) Error reporting
123+
124+
Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting.
125+
126+
## [](#with-the-apify-sdk) With the Apify SDK
127+
128+
This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
129+
130+
```JavaScript
131+
import { Actor } from 'apify';
132+
import { puppeteerUtils } from 'crawlee';
133+
134+
await Actor.init();
135+
// ...
136+
// Let's create reporting dataset
137+
// If you already have one, this will continue adding to it
138+
const reportingDataset = await Actor.openDataset('REPORTING');
139+
140+
// storeId is ID of current key-value store, where we save snapshots
141+
const storeId = Actor.getEnv().defaultKeyValueStoreId;
142+
143+
// We can also capture actor and run IDs
144+
// to have easy access in the reporting dataset
145+
const { actorId, actorRunId } = Actor.getEnv();
146+
const linkToRun = `https://console.apify.com/actors/actorId#/runs/actorRunId`;
147+
148+
try {
149+
// Sensitive code block
150+
// ...
151+
} catch (error) {
152+
// Change the way you save it depending on what tool you use
153+
const randomNumber = Math.random();
154+
const key = `ERROR-LOGIN-${randomNumber}`;
155+
await puppeteerUtils.saveSnapshot(page, { key });
156+
157+
const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`;
158+
159+
// We create a report object
160+
const report = {
161+
errorType: 'login',
162+
errorMessage: error.toString(),
163+
164+
// You will have to adjust the keys if you save them in a non-standard way
165+
htmlSnapshot: `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.html?disableRedirect=true`,
166+
screenshot: screenshotLink,
167+
run: linkToRun,
168+
};
169+
170+
// And we push the report
171+
await reportingDataset.pushData(report);
172+
173+
// You know where the code crashed so you can explain here
174+
throw new Error('Request failed during login with an error', { cause: error });
175+
}
176+
// ...
177+
await Actor.exit();
178+
```
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: How to optimize Puppeteer by caching responses
3+
description: Learn why it's important to cache responses in memory when intercepting requests in Puppeteer, and how to do it.
4+
menuWeight: 19
5+
category: tutorials
6+
paths:
7+
- caching-responses-in-puppeteer
8+
---
9+
10+
# [](#caching-responses-in-puppeteer) Caching responses in Puppeteer
11+
12+
> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)
13+
14+
When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
15+
16+
For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**).
17+
18+
In this example, we will use a scraper which goes through top stories on the CNN website and takes a screenshot of each opened page. The scraper is very slow right now because it waits till all network requests are finished and because the posts contain videos. If the scraper runs with disabled caching, these statistics will show at the end of the run:
19+
20+
![Bad run stats]({{@asset images/bad-scraper-stats.webp}})
21+
22+
As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page. We also stored all the screenshots, which you can find [here](https://my.apify.com/storage/key-value/q2ipoeLLy265NtSiL).
23+
24+
From the screenshot above, it's clear that most of the traffic is coming from script files (124MB) and documents (22.8MB). For this kind of situation, it's always good to check if the content of the page is cache-able. You can do that using Chromes Developer tools.
25+
26+
## Understanding and reproducing the issue
27+
28+
If we go to the CNN website, open up the tools and go to the **Network** tab, we will find an option to disable caching.
29+
30+
![Disabling cache in the Network tab]({{@asset images/cnn-network-tab.webp}})
31+
32+
Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools.
33+
34+
![5.3MB of data transferred]({{@asset images/slow-no-cache.webp}})
35+
36+
If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses.
37+
38+
![642KB of data transferred]({{@asset images/fast-with-cache.webp}})
39+
40+
By comparison, the data transfer appears to be reduced by 88%!
41+
42+
## Solving the problem by creating an in-memory cache
43+
44+
We can now emulate this and cache responses in Puppeteer. All we have to do is to check, when the response is received, whether it contains the **cache-control** header, and whether it's set with a **max-age** higher than **0**. If so, then we'll save the headers, URL, and body of the response to memory, and on the next request check if the requested URL is already stored in the cache.
45+
46+
The code will look like this:
47+
48+
```JavaScript
49+
// On top of your code
50+
const cache = {};
51+
52+
// The code below should go between newPage function and goto function
53+
54+
await page.setRequestInterception(true);
55+
56+
page.on('request', async(request) => {
57+
const url = request.url();
58+
if (cache[url] && cache[url].expires > Date.now()) {
59+
await request.respond(cache[url]);
60+
return;
61+
}
62+
request.continue();
63+
});
64+
65+
page.on('response', async(response) => {
66+
const url = response.url();
67+
const headers = response.headers();
68+
const cacheControl = headers['cache-control'] || '';
69+
const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
70+
const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
71+
if (maxAge) {
72+
if (!cache[url] || cache[url].expires > Date.now()) return;
73+
74+
let buffer;
75+
try {
76+
buffer = await response.buffer();
77+
} catch (error) {
78+
// some responses do not contain buffer and do not need to be catched
79+
return;
80+
}
81+
82+
cache[url] = {
83+
status: response.status(),
84+
headers: response.headers(),
85+
body: buffer,
86+
expires: Date.now() + (maxAge * 1000),
87+
};
88+
}
89+
});
90+
```
91+
92+
> If the code above looks completely foreign to you, we recommending going through our free [Puppeteer/Playwright course]({{@link puppeteer_playwright.md}}).
93+
94+
After implementing this code, we can run the scraper again.
95+
96+
![Good run results]({{@asset images/good-run-results.webp}})
97+
98+
Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [here](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL).
99+
100+
It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy.
101+
102+
## Implementation in Crawlee
103+
104+
Since most of you are likely using [Crawlee](https://crawlee.dev), here is what response caching would look like using `PuppeteerCrawler`:
105+
106+
```JavaScript
107+
import { PuppeteerCrawler, Dataset } from 'crawlee';
108+
109+
const cache = {};
110+
111+
const crawler = new PuppeteerCrawler({
112+
preNavigationHooks: [async ({ page }) => {
113+
await page.setRequestInterception(true);
114+
115+
page.on('request', async (request) => {
116+
const url = request.url();
117+
if (cache[url] && cache[url].expires > Date.now()) {
118+
await request.respond(cache[url]);
119+
return;
120+
}
121+
request.continue();
122+
});
123+
124+
page.on('response', async (response) => {
125+
const url = response.url();
126+
const headers = response.headers();
127+
const cacheControl = headers['cache-control'] || '';
128+
const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
129+
const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
130+
131+
if (maxAge) {
132+
if (!cache[url] || cache[url].expires > Date.now()) return;
133+
134+
let buffer;
135+
try {
136+
buffer = await response.buffer();
137+
} catch (error) {
138+
// some responses do not contain buffer and do not need to be catched
139+
return;
140+
}
141+
142+
cache[url] = {
143+
status: response.status(),
144+
headers: response.headers(),
145+
body: buffer,
146+
expires: Date.now() + maxAge * 1000,
147+
};
148+
}
149+
});
150+
}],
151+
requestHandler: async ({ page, request }) => {
152+
await Dataset.pushData({
153+
title: await page.title(),
154+
url: request.url,
155+
succeeded: true,
156+
});
157+
},
158+
});
159+
160+
await crawler.run(['https://apify.com/store', 'https://apify.com']);
161+
```

content/academy/dealing_with_dynamic_pages.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Dealing with dynamic pages
2+
title: How to scrape from dynamic pages
33
description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
44
menuWeight: 13
55
category: tutorials
47.1 KB
Loading
22.2 KB
Binary file not shown.
41.6 KB
Loading
9.79 KB
Binary file not shown.
21.5 KB
Loading
6.77 KB
Binary file not shown.
37.8 KB
Loading

0 commit comments

Comments
 (0)