Skip to content

Commit 38284a4

Browse files
committed
saving_useful_stats
1 parent 69429d7 commit 38284a4

File tree

2 files changed

+21
-41
lines changed

2 files changed

+21
-41
lines changed

content/academy/expert_scraping_with_apify/saving_useful_stats.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,16 @@ paths:
88

99
# [](#savings-useful-run-statistics) Saving useful run statistics
1010

11-
Using the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results, or potentially include statistics about each request within its corresponding dataset item.
11+
Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results, or potentially include statistics about each request within its corresponding dataset item.
1212

1313
The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of captchas hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an actor. As your projects scale, this will become more and more useful and important.
1414

1515
## [](#learning) Learning 🧠
1616

1717
Before moving on, give these valuable resources a quick lookover:
1818

19-
- Refamiliarize with the various available data on the [Request object](https://sdk.apify.com/docs/api/request).
20-
- Learn about the [`handleFailedRequest` function](https://sdk.apify.com/docs/typedefs/cheerio-crawler-options#handlefailedrequestfunction).
19+
- Refamiliarize with the various available data on the [Request object](https://crawlee.dev/api/core/class/Request).
20+
- Learn about the [`failedRequestHandler` function](https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#failedRequestHandler).
2121
- Ensure you are comfortable using [key-value stores](https://sdk.apify.com/docs/guides/data-storage#key-value-store) and [datasets](https://sdk.apify.com/docs/api/dataset#__docusaurus), and understand the differences between the two storage types.
2222

2323
## [](#quiz) Knowledge check 📝

content/academy/expert_scraping_with_apify/solutions/saving_stats.md

Lines changed: 18 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ paths:
1111
The code in this solution will be similar to what we already did in the **Handling migrations** solution; however, we'll be storing and logging different data. First, let's create a new file called **Stats.js** and write a utility class for storing our run stats:
1212

1313
```JavaScript
14-
const Apify = require('apify');
14+
import Actor from 'apify';
1515

1616
class Stats {
1717
constructor() {
@@ -22,12 +22,12 @@ class Stats {
2222
}
2323

2424
async initialize() {
25-
const data = await Apify.getValue('STATS');
25+
const data = await Actor.getValue('STATS');
2626

2727
if (data) this.state = data;
2828

29-
Apify.events.on('persistState', async () => {
30-
await Apify.setValue('STATS', this.state);
29+
Actor.on('persistState', async () => {
30+
await Actor.setValue('STATS', this.state);
3131
});
3232

3333
setInterval(() => console.log(this.state), 10000);
@@ -50,24 +50,20 @@ Cool, very similar to the **AsinTracker** class we wrote earlier. We'll now impo
5050
5151
```JavaScript
5252
// ...
53-
const Stats = require('./src/Stats');
53+
import Stats from './Stats.js';
5454

55-
const { log } = Apify.utils;
56-
57-
Apify.main(async () => {
58-
await asinTracker.initialize();
59-
await Stats.initialize();
55+
await Actor.init();
56+
await asinTracker.initialize();
57+
await Stats.initialize();
6058
// ...
6159
```
6260
6361
## [](#tracking-errors) Tracking errors
6462
65-
In order to keep track of errors, we must write a new function within the crawler's configuration called **handleFailedRequestFunction**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request.
63+
In order to keep track of errors, we must write a new function within the crawler's configuration called **failedRequestHandler**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request.
6664
6765
```JavaScript
68-
const crawler = new Apify.CheerioCrawler({
69-
requestList,
70-
requestQueue,
66+
const crawler = new CheerioCrawler({
7167
proxyConfiguration,
7268
useSessionPool: true,
7369
sessionPoolOptions: {
@@ -78,25 +74,9 @@ const crawler = new Apify.CheerioCrawler({
7874
},
7975
},
8076
maxConcurrency: 50,
81-
handlePageFunction: async (context) => {
82-
const { label } = context.request.userData;
83-
84-
switch (label) {
85-
default:
86-
return log.info('Unable to handle this request');
87-
case labels.START:
88-
await handleStart(context);
89-
break;
90-
case labels.PRODUCT:
91-
await handleProduct(context);
92-
break;
93-
case labels.OFFERS:
94-
await handleOffers(context, dataset);
95-
break;
96-
}
97-
},
77+
requestHandler: router,
9878
// Handle all failed requests
99-
handleFailedRequestFunction: async ({ error, request }) => {
79+
failedRequestHandler: async ({ error, request }) => {
10080
// Add an error for this url to our error tracker
10181
Stats.addError(request.url, error?.message);
10282
},
@@ -108,7 +88,7 @@ const crawler = new Apify.CheerioCrawler({
10888
Now, we'll just increment our **totalSaved** count for every offer added to the dataset.
10989
11090
```JavaScript
111-
exports.handleOffers = async ({ $, request }, dataset) => {
91+
router.addHandler(labels.OFFERS, async ({ $, request }) => {
11292
const { data } = request.userData;
11393

11494
const { asin } = data;
@@ -126,15 +106,15 @@ exports.handleOffers = async ({ $, request }, dataset) => {
126106
offer: element.find('.a-price .a-offscreen').text().trim(),
127107
});
128108
}
129-
};
109+
});
130110
```
131111
132112
## [](#saving-stats-with-dataset-items) Saving stats with dataset items
133113
134-
Still in the **handleOffers** function, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is easily accessible in the context object.
114+
Still in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is easily accessible in the context object.
135115
136116
```JavaScript
137-
exports.handleOffers = async ({ $, request, crawler: { requestQueue } }, dataset) => {
117+
router.addHandler(labels.OFFERS, async ({ $, request }) => {
138118
const { data } = request.userData;
139119

140120
const { asin } = data;
@@ -158,7 +138,7 @@ exports.handleOffers = async ({ $, request, crawler: { requestQueue } }, dataset
158138
currentPendingRequests: (await requestQueue.getInfo()).pendingRequestCount,
159139
});
160140
}
161-
};
141+
});
162142
```
163143
164144
## [](#quiz-answers) Quiz answers
@@ -177,4 +157,4 @@ exports.handleOffers = async ({ $, request, crawler: { requestQueue } }, dataset
177157
178158
**Q: Is storing these types of values necessary for every single actor?**
179159
180-
**A:** For small actors, it might be a waste of time to do this. For large-scale actors, it can be extremely helpful when debugging and most definitely worth the extra 10-20 minutes of development time. Usually though, the default statistics from the SDK might be enough for simple run stats.
160+
**A:** For small actors, it might be a waste of time to do this. For large-scale actors, it can be extremely helpful when debugging and most definitely worth the extra 10-20 minutes of development time. Usually though, the default statistics from the Crawlee and the SDK might be enough for simple run stats.

0 commit comments

Comments
 (0)