Skip to content

Commit 181ebde

Browse files
committed
tasks_and_storage
1 parent cd2a289 commit 181ebde

File tree

2 files changed

+108
-116
lines changed

2 files changed

+108
-116
lines changed

content/academy/expert_scraping_with_apify/solutions/integrating_webhooks.md

Lines changed: 26 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,10 @@ log.info('Crawl finished.');
1919
// We don't need the code below anymore!
2020

2121
// log.info('Sending dataset link...');
22-
// const dataset = await Apify.openDataset();
22+
// const dataset = await Actor.openDataset();
2323
// const { id } = await dataset.getInfo();
2424

25-
// await Apify.call('apify/send-mail', {
25+
// await Actor.call('apify/send-mail', {
2626
2727
// subject: 'Amazon Dataset',
2828
// text: `https://api.apify.com/v2/datasets/${id}/items?clean=true&format=json`,
@@ -44,19 +44,20 @@ First of all, we should clear out any of the boilerplate code within **main.js**
4444

4545
```JavaScript
4646
// main.js
47-
const Apify = require('apify');
47+
import { Actor } from 'apify';
4848

49-
Apify.main(async () => {
49+
await Actor.init();
5050

51-
});
51+
// ...
52+
53+
await Actor.exit();
5254
```
5355

5456
We'll be passing the ID of the Amazon actor's default dataset along to the new actor, so we can expect that as an input:
5557

5658
```JavaScript
57-
Apify.main(async () => {
58-
const { datasetId } = await Apify.getInput();
59-
const dataset = await Apify.openDataset(datasetId);
59+
const { datasetId } = await Actor.getInput();
60+
const dataset = await Actor.openDataset(datasetId);
6061
// ...
6162
```
6263

@@ -91,31 +92,33 @@ const filtered = items.reduce((acc, curr) => {
9192
The results should be an array, so finally, we can take the map we just created and push an array of all of its values to the actor's default dataset:
9293
9394
```JavaScript
94-
await Apify.pushData(Object.values(filtered));
95+
await Actor.pushData(Object.values(filtered));
9596
```
9697
9798
Our final code looks like this:
9899
99100
```JavaScript
100-
const Apify = require('apify');
101+
import { Actor } from 'apify';
101102

102-
Apify.main(async () => {
103-
const { datasetId } = await Apify.getInput();
104-
const dataset = await Apify.openDataset(datasetId);
103+
await Actor.init();
105104

106-
const { items } = await dataset.getData();
105+
const { datasetId } = await Actor.getInput();
106+
const dataset = await Actor.openDataset(datasetId);
107107

108-
const filtered = items.reduce((acc, curr) => {
109-
const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
110-
const price = +curr.offer.slice(1);
108+
const { items } = await dataset.getData();
111109

112-
if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
110+
const filtered = items.reduce((acc, curr) => {
111+
const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
112+
const price = +curr.offer.slice(1);
113+
114+
if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
115+
116+
return acc;
117+
}, {});
113118

114-
return acc;
115-
}, {});
119+
await Actor.pushData(Object.values(filtered));
116120

117-
await Apify.pushData(Object.values(filtered));
118-
});
121+
await Actor.exit();
119122
```
120123
121124
Cool! But **wait**, don't forget to configure the **INPUT_SCHEMA.json** file as well! It's not necessary to do this step, as we'll be calling the actor through Apify's API within a webhook, but it's still good to get into the habit of writing quality input schemas that describe the input values your actors are expecting.
@@ -185,7 +188,7 @@ Additionally, we should be able to see that our **filter-actor** was run, and ha
185188
186189
**Q: Within itself, can you get the exact time that an actor was started?**
187190
188-
**A:** Yes. The time the actor was started can be retrieved through the `startedAt` property from the `Apify.getEnv()` function, or directly from `process.env.APIFY_STARTED_AT`
191+
**A:** Yes. The time the actor was started can be retrieved through the `startedAt` property from the `Actor.getEnv()` function, or directly from `process.env.APIFY_STARTED_AT`
189192
190193
**Q: What are the types of default storages connected to an actor's run?**
191194

content/academy/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md

Lines changed: 82 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -12,49 +12,50 @@ Last lesson, our task was outlined for us. In this lesson, we'll be completing t
1212

1313
## [](#using-named-dataset) Using a named dataset
1414

15-
Something important to understand is that, in the Apify SDK, when you use `Apify.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Apify.openDataset()` function:
15+
Something important to understand is that, in the Apify SDK, when you use `Actor.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Actor.openDataset()` function:
1616

1717
```JavaScript
1818
// main.js
19-
Apify.main(async () => {
20-
const { keyword } = await Apify.getInput();
19+
// ...
20+
21+
await Actor.init()
22+
23+
const { keyword } = await Actor.getInput();
2124

22-
// Open a dataset with a custom named based on the
23-
// keyword which was inputted by the user
24-
const dataset = await Apify.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
25+
// Open a dataset with a custom named based on the
26+
// keyword which was inputted by the user
27+
const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
2528
// ...
2629
```
2730

28-
If we remember correctly, we are pushing data to the dataset in the `handleOffers()` function we created in **routes.js**. Let's pass the `dataset` variable pointing to our named dataset into `handleOffers()` as an argument:
31+
If we remember correctly, we are pushing data to the dataset in the `labels.OFFERS` handler we created in **routes.js**. Let's export the `dataset` variable pointing to our named dataset so we can import it in **routes.js** and use it in the handler:
2932

3033
```JavaScript
31-
// ...
32-
case labels.OFFERS:
33-
await handleOffers(context, dataset);
34-
break;
35-
}
36-
// ...
34+
export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
3735
```
3836

39-
Finally, let's modify the function to use the new `dataset` variable being passed to it rather than the `Apify` class:
37+
Finally, let's modify the function to use the new `dataset` variable rather than the `Actor` class:
4038

4139
```JavaScript
42-
// Expect a second parameter, which will be a dataset
43-
exports.handleOffers = async ({ $, request }, dataset) => {
40+
// Import the dataset pointer
41+
import { dataset } from './main.js';
42+
43+
// ...
44+
45+
router.addHandler(labels.OFFERS, async ({ $, request }) => {
4446
const { data } = request.userData;
4547

4648
for (const offer of $('#aod-offer')) {
4749
const element = $(offer);
4850

49-
// Replace "Apify" with the name of the second
50-
// parameter, in this case we called it "dataset"
51+
// Replace "Actor" with "dataset"
5152
await dataset.pushData({
5253
...data,
5354
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
5455
offer: element.find('.a-price .a-offscreen').text().trim(),
5556
});
5657
}
57-
};
58+
});
5859
```
5960

6061
That's it! Now, our actor will push its data to a dataset named **amazon-offers-KEYWORD**!
@@ -66,6 +67,8 @@ We now want to store the cheapest item in the default key-value store under a ke
6667
Let's add the following code to the bottom of the actor, after **Crawl finished.** is logged to the console:
6768

6869
```Javascript
70+
// ...
71+
6972
const cheapest = items.reduce((prev, curr) => {
7073
// If there is no previous offer price, or the previous is more
7174
// expensive, set the cheapest to our current item
@@ -76,7 +79,9 @@ const cheapest = items.reduce((prev, curr) => {
7679

7780
// Set the "CHEAPEST-ITEM" key in the key-value store to be the
7881
// newly discovered cheapest item
79-
await Apify.setValue(CHEAPEST_ITEM, cheapest);
82+
await Actor.setValue(CHEAPEST_ITEM, cheapest);
83+
84+
await Actor.exit();
8085
```
8186
8287
> If you start receiving a linting error after adding the following code to your **main.js** file, add `"parserOptions": { "ecmaVersion": "latest" }` to the **.eslintrc** file in the root directory of your project.
@@ -85,19 +90,17 @@ You might have noticed that we are using a variable instead of a string for the
8590
8691
```JavaScript
8792
// constants.js
88-
const BASE_URL = 'https://www.amazon.com';
93+
export const BASE_URL = 'https://www.amazon.com';
8994

90-
const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;
95+
export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;
9196

92-
const labels = {
97+
export const labels = {
9398
START: 'START',
9499
PRODUCT: 'PRODUCT',
95100
OFFERS: 'OFFERS',
96101
};
97102

98-
const CHEAPEST_ITEM = 'CHEAPEST-ITEM';
99-
100-
module.exports = { BASE_URL, OFFERS_URL, labels, CHEAPEST_ITEM };
103+
export const CHEAPEST_ITEM = 'CHEAPEST-ITEM';
101104
```
102105
103106
## [](#code-check-in) Code check-in
@@ -106,81 +109,67 @@ Just to ensure we're all on the same page, here is what the **main.js** file loo
106109
107110
```JavaScript
108111
// main.js
109-
const Apify = require('apify');
112+
import { Actor } from 'apify';
113+
import { CheerioCrawler, log } from '@crawlee/cheerio';
110114

111-
const { handleStart, handleProduct, handleOffers } = require('./src/routes');
112-
const { BASE_URL, labels, CHEAPEST_ITEM } = require('./src/constants');
115+
import { router } from './routes.js';
116+
import { BASE_URL, CHEAPEST_ITEM } from './constants';
113117

114-
const { log } = Apify.utils;
118+
await Actor.init();
115119

116-
Apify.main(async () => {
117-
const { keyword } = await Apify.getInput();
120+
const { keyword } = await Actor.getInput();
118121

119-
const dataset = await Apify.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
122+
const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
120123

121-
const requestList = await Apify.openRequestList('start-urls', [
122-
{
123-
url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
124-
userData: {
125-
label: 'START',
126-
keyword,
127-
},
128-
},
129-
]);
130-
131-
const requestQueue = await Apify.openRequestQueue();
132-
const proxyConfiguration = await Apify.createProxyConfiguration({
133-
groups: ['RESIDENTIAL'],
134-
});
135-
136-
const crawler = new Apify.CheerioCrawler({
137-
requestList,
138-
requestQueue,
139-
proxyConfiguration,
140-
useSessionPool: true,
141-
maxConcurrency: 50,
142-
handlePageFunction: async (context) => {
143-
const { label } = context.request.userData;
144-
145-
switch (label) {
146-
default:
147-
return log.info('Unable to handle this request');
148-
case labels.START:
149-
await handleStart(context);
150-
break;
151-
case labels.PRODUCT:
152-
await handleProduct(context);
153-
break;
154-
case labels.OFFERS:
155-
await handleOffers(context, dataset);
156-
break;
157-
}
158-
},
159-
});
124+
const proxyConfiguration = await Actor.createProxyConfiguration({
125+
groups: ['RESIDENTIAL'],
126+
});
127+
128+
const crawler = new Actor.CheerioCrawler({
129+
proxyConfiguration,
130+
useSessionPool: true,
131+
maxConcurrency: 50,
132+
requestHandler: router,
133+
});
160134

161-
log.info('Starting the crawl.');
162-
await crawler.run();
163-
log.info('Crawl finished.');
135+
await crawler.addRequests([
136+
{
137+
url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
138+
userData: {
139+
label: 'START',
140+
keyword,
141+
},
142+
},
143+
])
164144

165-
const { items } = await dataset.getData();
145+
log.info('Starting the crawl.');
146+
await crawler.run();
147+
log.info('Crawl finished.');
166148

167-
const cheapest = items.reduce((prev, curr) => {
168-
if (!prev?.offer) return curr;
169-
if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
170-
return prev;
171-
});
149+
const { items } = await dataset.getData();
172150

173-
await Apify.setValue(CHEAPEST_ITEM, cheapest);
151+
const cheapest = items.reduce((prev, curr) => {
152+
if (!prev?.offer) return curr;
153+
if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
154+
return prev;
174155
});
156+
157+
await Actor.setValue(CHEAPEST_ITEM, cheapest);
158+
159+
await Actor.exit();
175160
```
176161
177162
And here is **routes.js**:
178163
179164
```JavaScript
180165
// routes.js
181-
const { BASE_URL, OFFERS_URL, labels } = require('./constants');
166+
import { dataset } from './main.js';
167+
import { BASE_URL, OFFERS_URL, labels } from './constants';
168+
import { createCheerioRouter } from '@crawlee/cheerio';
182169

183-
exports.handleStart = async ({ $, crawler: { requestQueue }, request }) => {
170+
export const router = createCheerioRouter();
171+
172+
router.addHandler(labels.START, async ({ $, crawler, request }) => {
184173
const { keyword } = request.userData;
185174

186175
const products = $('div > div[data-asin]:not([data-asin=""])');
@@ -191,7 +180,7 @@ exports.handleStart = async ({ $, crawler: { requestQueue }, request }) => {
191180

192181
const url = `${BASE_URL}${titleElement.attr('href')}`;
193182

194-
await requestQueue.addRequest({
183+
await crawler.addRequests([{
195184
url,
196185
userData: {
197186
label: labels.PRODUCT,
@@ -202,16 +191,16 @@ exports.handleStart = async ({ $, crawler: { requestQueue }, request }) => {
202191
keyword,
203192
},
204193
},
205-
});
194+
}]);
206195
}
207-
};
196+
});
208197

209-
exports.handleProduct = async ({ $, crawler: { requestQueue }, request }) => {
198+
router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
210199
const { data } = request.userData;
211200

212201
const element = $('div#productDescription');
213202

214-
await requestQueue.addRequest({
203+
await crawler.addRequests([{
215204
url: OFFERS_URL(data.asin),
216205
userData: {
217206
label: labels.OFFERS,
@@ -220,10 +209,10 @@ exports.handleProduct = async ({ $, crawler: { requestQueue }, request }) => {
220209
description: element.text().trim(),
221210
},
222211
},
223-
});
224-
};
212+
}]);
213+
});
225214

226-
exports.handleOffers = async ({ $, request }, dataset) => {
215+
router.addHandler(labels.OFFERS, async ({ $, request }) => {
227216
const { data } = request.userData;
228217

229218
for (const offer of $('#aod-offer')) {
@@ -235,7 +224,7 @@ exports.handleOffers = async ({ $, request }, dataset) => {
235224
offer: element.find('.a-price .a-offscreen').text().trim(),
236225
});
237226
}
238-
};
227+
});
239228
```
240229
241230
Don't forget to push your changes to Github using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform!

0 commit comments

Comments
 (0)