Skip to content

Commit eb88351

Browse files
committed
feat(analyzing-pages)
1 parent 14c63cb commit eb88351

File tree

5 files changed

+188
-6
lines changed

5 files changed

+188
-6
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
---
2+
title: How to analyze and fix errors when scraping a website
3+
description: Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.
4+
menuWeight: 17
5+
category: tutorials
6+
paths:
7+
- analyzing-pages-and-fixing-errors
8+
---
9+
10+
# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
11+
12+
Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money my allowing you to avoid hiring an expensive developer to solve your issue for you.
13+
14+
This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.
15+
16+
## [](#possible-causes) Possible causes
17+
18+
It is often tricky to see the full scope of what can go wrong. We assume once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation.
19+
20+
Websites change, they introduce new [anti-scraping technologies]({{@link anti_scraping.md}}), programming tools change and, in addition, people make mistakes.
21+
22+
Here are the most common reasons your working solution may break.
23+
24+
- The website changes its layout or [data feed](https://www.datafeedwatch.com/academy/data-feed).
25+
- A site's layout changes depending on location or uses [A/B testing](https://www.youtube.com/watch?v=XDoKXaGrUxE&feature=youtu.be).
26+
- A page starts to block you (recognizes you as a bot).
27+
- The website [loads its data later dynamically]({{@link dealing_with_dynamic_pages.md}}), so the code works only sometimes, if you are slow or lucky enough.
28+
- You made a mistake when updating your code.
29+
- Your [proxies]({{@link anti_scraping/mitigation/proxies.md}}) aren't working.
30+
- You have upgraded your [dependencies](https://www.quora.com/What-is-a-dependency-in-coding) (other software that your software relies upon), and the new versions no longer work (this is harder to debug).
31+
32+
## [](#issue-analysis) Diagnosing/analyzing the issue
33+
34+
Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler.
35+
36+
> Many issues are edge cases, which occur in just one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm).
37+
38+
### [](#logging) Logging
39+
40+
Logging is an essential tool for any programmer. When used correctly, they help you capture a surprising amount of information. Here are some general rules for logging:
41+
42+
- Usually, **many logs** is better than **no logs** at all.
43+
- Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size.
44+
- Focus on numbers. Log how many items you extract from a page, etc.
45+
- Structure your logs and use the same structure in all your logs.
46+
- Append the current page's URL to each log. This lets you immediately open that page and review it.
47+
48+
Here's an example of what a structured log message might look like:
49+
50+
```text
51+
[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store
52+
```
53+
54+
The log begins with the **page type**. Usually, we use labels such as **\[CATEGORY\]** and **\[DETAIL\]**. Then, we log important numbers and other information. Finally, we add the page's URL, so we can check if the log is correct.
55+
56+
#### [](#logging-errors) Logging errors
57+
58+
Errors require a different approach because, if your code crashes, your usual logs will not be called. Instead, exception handlers will print the error, but these are usually ugly messages with a [stack trace](https://en.wikipedia.org/wiki/Stack_trace) that only the experts will understand.
59+
60+
You can overcome this by adding [try/catch blocks](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/try...catch) into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried).
61+
62+
```JavaScript
63+
try {
64+
// Sensitive code block
65+
// ...
66+
} catch (error) {
67+
// You know where the code crashed so you can explain here
68+
console.error('Request failed during login with an error:');
69+
throw error;
70+
}
71+
```
72+
73+
Read more information about logging and error handling in our developer [best practices]({{@link web_scraping_for_beginners/best_practices.md}}) section.
74+
75+
### [](#saving-snapshots) Saving snapshots
76+
77+
By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright]({{@link puppeteer_playwright.md}}) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can easily display in your own browser. Snapshots are useful throughout your code but especially important in error handling.
78+
79+
Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot.
80+
81+
Snapshots can tell you if:
82+
83+
- A website has changed its layout. This can also mean A/B testing or different content for different locations.
84+
- You have been blocked – you open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page.
85+
- Data load later dynamically – the page is empty.
86+
- The page was redirected – the content is different.
87+
88+
You can learn how to take snapshots in Puppeteer or Playwright in [this short lesson]({{@link puppeteer_playwright/page/page_methods.md}})
89+
90+
#### [](#when-to-save-snapshots) When to save snapshots
91+
92+
The most common approach is to save on error. We can enhance our previous try/catch block like this:
93+
94+
```JavaScript
95+
import { puppeteerUtils } from 'crawlee';
96+
97+
// ...
98+
// storeId is ID of current key value store, where we save snapshots
99+
const storeId = Actor.getEnv().defaultKeyValueStoreId;
100+
try {
101+
// Sensitive code block
102+
// ...
103+
} catch (error) {
104+
// Change the way you save it depending on what tool you use
105+
const randomNumber = Math.random();
106+
const key = `ERROR-LOGIN-${randomNumber}`;
107+
await puppeteerUtils.saveSnapshot(page, { key });
108+
const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`
109+
110+
// You know where the code crashed so you can explain here
111+
console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
112+
throw error;
113+
}
114+
// ...
115+
```
116+
117+
To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a random number so the next **ERROR-LOGIN**s would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better.
118+
119+
**Beware:**
120+
121+
- The snapshot's **name** (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick.
122+
- Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources.
123+
124+
### [](#error-reporting) Error reporting
125+
126+
Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting.
127+
128+
<!-- TODO: Make the code example below make sense without using Apify API or SDK -->
129+
<!-- This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
130+
131+
```JavaScript
132+
import { Actor } from 'apify';
133+
import { puppeteerUtils } from 'crawlee';
134+
135+
await Actor.init();
136+
// ...
137+
// Let's create reporting dataset
138+
// If you already have one, this will continue adding to it
139+
const reportingDataset = await Actor.openDataset('REPORTING');
140+
141+
// storeId is ID of current key-value store, where we save snapshots
142+
const storeId = Actor.getEnv().defaultKeyValueStoreId;
143+
144+
// We can also capture actor and run IDs
145+
// to have easy access in the reporting dataset
146+
const { actorId, actorRunId } = Actor.getEnv();
147+
const linkToRun = `https://console.apify.com/actors/actorId#/runs/actorRunId`;
148+
149+
try {
150+
// Sensitive code block
151+
// ...
152+
} catch (error) {
153+
// Change the way you save it depending on what tool you use
154+
const randomNumber = Math.random();
155+
const key = `ERROR-LOGIN-${randomNumber}`;
156+
await puppeteerUtils.saveSnapshot(page, { key });
157+
158+
const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`;
159+
160+
// We create a report object
161+
const report = {
162+
errorType: 'login',
163+
errorMessage: error.toString(),
164+
165+
// You will have to adjust the keys if you save them in a non-standard way
166+
htmlSnapshot: `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.html?disableRedirect=true`,
167+
screenshot: screenshotLink,
168+
run: linkToRun,
169+
};
170+
171+
// And we push the report
172+
await reportingDataset.pushData(report);
173+
174+
// You know where the code crashed so you can explain here
175+
console.error(
176+
`Request failed during login with an error. Screenshot: ${screenshotLink}`
177+
);
178+
throw error;
179+
}
180+
// ...
181+
await Actor.exit();
182+
``` -->

content/academy/dealing_with_dynamic_pages.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Dealing with dynamic pages
2+
title: How to scrape from dynamic pages
33
description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
44
menuWeight: 13
55
category: tutorials

content/academy/js_in_html.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: JavaScript objects in HTML
2+
title: How to scrape hidden JavaScript objects in HTML
33
description: Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience.
44
menuWeight: 14
55
category: tutorials

content/academy/scraping_with_sitemaps.md renamed to content/academy/scraping_from_sitemaps.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: Scraping with sitemaps
2+
title: How to scrape from sitemaps
33
description: The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn a much easier way to extract data from websites using Crawlee.
44
menuWeight: 16
55
category: tutorials
66
paths:
7-
- scraping-with-sitemaps
7+
- scraping-from-sitemaps
88
---
99

10-
# [](#scraping-with-sitemaps) Scraping with sitemaps
10+
# [](#scraping-with-sitemaps) Scraping from sitemaps
1111

1212
Let's say we want to scrape a database of craft beers ([brewbound.com](https://brewbound.com)) before summer starts. If we are lucky, the website will contain a sitemap at [https://www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
1313

content/academy/scraping_shadow_doms.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Scraping sites with a shadow DOM
2+
title: How to scrape sites with a shadow DOM
33
description: The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround.
44
menuWeight: 15
55
category: tutorials

0 commit comments

Comments
 (0)