apify
diff --git a/‎content/academy/analyzing_pages_and_fixing_errors.md
Lines changed: 178 additions & 0 deletions b/‎content/academy/analyzing_pages_and_fixing_errors.md
Lines changed: 178 additions & 0 deletions
diff --git a/‎content/academy/caching_responses_in_puppeteer.md
Lines changed: 161 additions & 0 deletions b/‎content/academy/caching_responses_in_puppeteer.md
Lines changed: 161 additions & 0 deletions
diff --git a/‎content/academy/dealing_with_dynamic_pages.md
Lines changed: 1 addition & 1 deletion b/‎content/academy/dealing_with_dynamic_pages.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/academy/images/bad-scraper-stats.png
47.1 KB b/‎content/academy/images/bad-scraper-stats.png
47.1 KB
diff --git a/‎content/academy/images/bad-scraper-stats.webp
22.2 KB b/‎content/academy/images/bad-scraper-stats.webp
22.2 KB
diff --git a/‎content/academy/images/cnn-network-tab.png
41.6 KB b/‎content/academy/images/cnn-network-tab.png
41.6 KB
diff --git a/‎content/academy/images/cnn-network-tab.webp
9.79 KB b/‎content/academy/images/cnn-network-tab.webp
9.79 KB
diff --git a/‎content/academy/images/fast-with-cache.png
21.5 KB b/‎content/academy/images/fast-with-cache.png
21.5 KB
diff --git a/‎content/academy/images/fast-with-cache.webp
6.77 KB b/‎content/academy/images/fast-with-cache.webp
6.77 KB
diff --git a/‎content/academy/images/good-run-results.png
37.8 KB b/‎content/academy/images/good-run-results.png
37.8 KB
@@ -0,0 +1,178 @@
+---
+title: How to analyze and fix errors when scraping a website
+description: Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.
+menuWeight: 17
+category: tutorials
+paths:
+    - analyzing-pages-and-fixing-errors
+---
+
+# [](#scraping-with-sitemaps) Analyzing a page and fixing errors
+
+Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you safe money by allowing you to avoid hiring an expensive developer to solve your issue for you.
+
+This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.
+
+## [](#possible-causes) Possible causes
+
+It is often tricky to see the full scope of what can go wrong. We assume once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation.
+
+Websites change, they introduce new [anti-scraping technologies]({{@link anti_scraping.md}}), programming tools change and, in addition, people make mistakes.
+
+Here are the most common reasons your working solution may break.
+
+- The website changes its layout or [data feed](https://www.datafeedwatch.com/academy/data-feed).
+- A site's layout changes depending on location or uses [A/B testing](https://www.youtube.com/watch?v=XDoKXaGrUxE&feature=youtu.be).
+- A page starts to block you (recognizes you as a bot).
+- The website [loads its data later dynamically]({{@link dealing_with_dynamic_pages.md}}), so the code works only sometimes, if you are slow or lucky enough.
+- You made a mistake when updating your code.
+- Your [proxies]({{@link anti_scraping/mitigation/proxies.md}}) aren't working.
+- You have upgraded your [dependencies](https://www.quora.com/What-is-a-dependency-in-coding) (other software that your software relies upon), and the new versions no longer work (this is harder to debug).
+
+## [](#issue-analysis) Diagnosing/analyzing the issue
+
+Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler.
+
+> Many issues are edge cases, which occur in just one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm).
+
+### [](#logging) Logging
+
+Logging is an essential tool for any programmer. When used correctly, they help you capture a surprising amount of information. Here are some general rules for logging:
+
+- Usually, **many logs** is better than **no logs** at all.
+- Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size.
+- Focus on numbers. Log how many items you extract from a page, etc.
+- Structure your logs and use the same structure in all your logs.
+- Append the current page's URL to each log. This lets you immediately open that page and review it.
+
+Here's an example of what a structured log message might look like:
+
+```text
+[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store
+```
+
+The log begins with the **page type**. Usually, we use labels such as **\[CATEGORY\]** and **\[DETAIL\]**. Then, we log important numbers and other information. Finally, we add the page's URL, so we can check if the log is correct.
+
+#### [](#logging-errors) Logging errors
+
+Errors require a different approach because, if your code crashes, your usual logs will not be called. Instead, exception handlers will print the error, but these are usually ugly messages with a [stack trace](https://en.wikipedia.org/wiki/Stack_trace) that only the experts will understand.
+
+You can overcome this by adding [try/catch blocks](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/try...catch) into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried).
+
+```JavaScript
+try {
+    // Sensitive code block
+    // ...
+} catch (error) {
+    // You know where the code crashed so you can explain here
+    throw new Error('Request failed during login with an error', { cause: error });
+}
+```
+
+Read more information about logging and error handling in our developer [best practices]({{@link web_scraping_for_beginners/best_practices.md}}) section.
+
+### [](#saving-snapshots) Saving snapshots
+
+By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright]({{@link puppeteer_playwright.md}}) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can easily display in your own browser. Snapshots are useful throughout your code but especially important in error handling.
+
+Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot.
+
+Snapshots can tell you if:
+
+- A website has changed its layout. This can also mean A/B testing or different content for different locations.
+- You have been blocked – you open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page.
+- Data load later dynamically – the page is empty.
+- The page was redirected – the content is different.
+
+You can learn how to take snapshots in Puppeteer or Playwright in [this short lesson]({{@link puppeteer_playwright/page/page_methods.md}})
+
+#### [](#when-to-save-snapshots) When to save snapshots
+
+The most common approach is to save on error. We can enhance our previous try/catch block like this:
+
+```JavaScript
+import { puppeteerUtils } from 'crawlee';
+
+// ...
+// storeId is ID of current key value store, where we save snapshots
+const storeId = Actor.getEnv().defaultKeyValueStoreId;
+try {
+    // Sensitive code block
+    // ...
+} catch (error) {
+    // Change the way you save it depending on what tool you use
+    const randomNumber = Math.random();
+    const key = `ERROR-LOGIN-${randomNumber}`;
+    await puppeteerUtils.saveSnapshot(page, { key });
+    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`
+
+    // You know where the code crashed so you can explain here
+    throw new Error('Request failed during login with an error', { cause: error });
+}
+// ...
+```
+
+To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a random number so the next **ERROR-LOGIN**s would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better.
+
+**Beware:**
+
+- The snapshot's **name** (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick.
+- Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources.
+
+### [](#error-reporting) Error reporting
+
+Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple **dataset** reporting.
+
+## [](#with-the-apify-sdk) With the Apify SDK
+
+This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/storage#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.
+
+```JavaScript
+import { Actor } from 'apify';
+import { puppeteerUtils } from 'crawlee';
+
+await Actor.init();
+// ...
+// Let's create reporting dataset
+// If you already have one, this will continue adding to it
+const reportingDataset = await Actor.openDataset('REPORTING');
+
+// storeId is ID of current key-value store, where we save snapshots
+const storeId = Actor.getEnv().defaultKeyValueStoreId;
+
+// We can also capture actor and run IDs
+// to have easy access in the reporting dataset
+const { actorId, actorRunId } = Actor.getEnv();
+const linkToRun = `https://console.apify.com/actors/actorId#/runs/actorRunId`;
+
+try {
+    // Sensitive code block
+    // ...
+} catch (error) {
+    // Change the way you save it depending on what tool you use
+    const randomNumber = Math.random();
+    const key = `ERROR-LOGIN-${randomNumber}`;
+    await puppeteerUtils.saveSnapshot(page, { key });
+
+    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`;
+
+    // We create a report object
+    const report = {
+        errorType: 'login',
+        errorMessage: error.toString(),
+
+        // You will have to adjust the keys if you save them in a non-standard way
+        htmlSnapshot: `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.html?disableRedirect=true`,
+        screenshot: screenshotLink,
+        run: linkToRun,
+    };
+
+    // And we push the report
+    await reportingDataset.pushData(report);
+
+    // You know where the code crashed so you can explain here
+    throw new Error('Request failed during login with an error', { cause: error });
+}
+// ...
+await Actor.exit();
+```
@@ -0,0 +1,161 @@
+---
+title: How to optimize Puppeteer by caching responses
+description: Learn why it's important to cache responses in memory when intercepting requests in Puppeteer, and how to do it.
+menuWeight: 19
+category: tutorials
+paths:
+    - caching-responses-in-puppeteer
+---
+
+# [](#caching-responses-in-puppeteer) Caching responses in Puppeteer
+
+> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)
+
+When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).
+
+For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**).
+
+In this example, we will use a scraper which goes through top stories on the CNN website and takes a screenshot of each opened page. The scraper is very slow right now because it waits till all network requests are finished and because the posts contain videos. If the scraper runs with disabled caching, these statistics will show at the end of the run:
+
+![Bad run stats]({{@asset images/bad-scraper-stats.webp}})
+
+As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page. We also stored all the screenshots, which you can find [here](https://my.apify.com/storage/key-value/q2ipoeLLy265NtSiL).
+
+From the screenshot above, it's clear that most of the traffic is coming from script files (124MB) and documents (22.8MB). For this kind of situation, it's always good to check if the content of the page is cache-able. You can do that using Chromes Developer tools.
+
+## Understanding and reproducing the issue
+
+If we go to the CNN website, open up the tools and go to the **Network** tab, we will find an option to disable caching.
+
+![Disabling cache in the Network tab]({{@asset images/cnn-network-tab.webp}})
+
+Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools.
+
+![5.3MB of data transferred]({{@asset images/slow-no-cache.webp}})
+
+If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses.
+
+![642KB of data transferred]({{@asset images/fast-with-cache.webp}})
+
+By comparison, the data transfer appears to be reduced by 88%!
+
+## Solving the problem by creating an in-memory cache
+
+We can now emulate this and cache responses in Puppeteer. All we have to do is to check, when the response is received, whether it contains the **cache-control** header, and whether it's set with a **max-age** higher than **0**. If so, then we'll save the headers, URL, and body of the response to memory, and on the next request check if the requested URL is already stored in the cache.
+
+The code will look like this:
+
+```JavaScript
+// On top of your code
+const cache = {};
+
+// The code below should go between newPage function and goto function
+
+await page.setRequestInterception(true);
+
+page.on('request', async(request) => {
+    const url = request.url();
+    if (cache[url] && cache[url].expires > Date.now()) {
+        await request.respond(cache[url]);
+        return;
+    }
+    request.continue();
+});
+
+page.on('response', async(response) => {
+    const url = response.url();
+    const headers = response.headers();
+    const cacheControl = headers['cache-control'] || '';
+    const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
+    const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
+    if (maxAge) {
+        if (!cache[url] || cache[url].expires > Date.now()) return;
+        
+        let buffer;
+        try {
+            buffer = await response.buffer();
+        } catch (error) {
+            // some responses do not contain buffer and do not need to be catched
+            return;
+        }
+
+        cache[url] = {
+            status: response.status(),
+            headers: response.headers(),
+            body: buffer,
+            expires: Date.now() + (maxAge * 1000),
+        };
+    }
+});
+```
+
+> If the code above looks completely foreign to you, we recommending going through our free [Puppeteer/Playwright course]({{@link puppeteer_playwright.md}}).
+
+After implementing this code, we can run the scraper again.
+
+![Good run results]({{@asset images/good-run-results.webp}})
+
+Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [here](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL).
+
+It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy.
+
+## Implementation in Crawlee
+
+Since most of you are likely using [Crawlee](https://crawlee.dev), here is what response caching would look like using `PuppeteerCrawler`:
+
+```JavaScript
+import { PuppeteerCrawler, Dataset } from 'crawlee';
+
+const cache = {};
+
+const crawler = new PuppeteerCrawler({
+    preNavigationHooks: [async ({ page }) => {
+        await page.setRequestInterception(true);
+
+        page.on('request', async (request) => {
+            const url = request.url();
+            if (cache[url] && cache[url].expires > Date.now()) {
+                await request.respond(cache[url]);
+                return;
+            }
+            request.continue();
+        });
+
+        page.on('response', async (response) => {
+            const url = response.url();
+            const headers = response.headers();
+            const cacheControl = headers['cache-control'] || '';
+            const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
+            const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
+
+            if (maxAge) {
+                if (!cache[url] || cache[url].expires > Date.now()) return;
+
+                let buffer;
+                try {
+                    buffer = await response.buffer();
+                } catch (error) {
+                    // some responses do not contain buffer and do not need to be catched
+                    return;
+                }
+
+                cache[url] = {
+                    status: response.status(),
+                    headers: response.headers(),
+                    body: buffer,
+                    expires: Date.now() + maxAge * 1000,
+                };
+            }
+        });
+    }],
+    requestHandler: async ({ page, request }) => {
+        await Dataset.pushData({
+            title: await page.title(),
+            url: request.url,
+            succeeded: true,
+        });
+    },
+});
+
+await crawler.run(['https://apify.com/store', 'https://apify.com']);
+```
@@ -1,5 +1,5 @@
 ---
-title: Dealing with dynamic pages
+title: How to scrape from dynamic pages
 description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
 menuWeight: 13
 category: tutorials