|
1 | 1 | --- |
2 | 2 | title: III - Using storage & creating tasks |
3 | | -description: Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task. |
| 3 | +description: Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform. |
4 | 4 | sidebar_position: 3 |
5 | 5 | slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks |
6 | 6 | --- |
7 | 7 |
|
8 | 8 | # Using storage & creating tasks {#using-storage-creating-tasks} |
9 | 9 |
|
10 | | -**Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.** |
11 | | - |
12 | | ---- |
13 | | - |
14 | | -Last lesson, our task was outlined for us. In this lesson, we'll be completing that task by making our Amazon Actor push to a **named dataset** and use the **default key-value store** to store the cheapest item found by the scraper. Finally, we'll create a task for the Actor back on the Apify platform. |
15 | | - |
16 | | -## Using a named dataset {#using-named-dataset} |
17 | | - |
18 | | -Something important to understand is that, in the Apify SDK, when you use `Actor.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Actor.openDataset()` function: |
19 | | - |
20 | | -```js |
21 | | -// main.js |
22 | | -// ... |
23 | | - |
24 | | -await Actor.init(); |
25 | | - |
26 | | -const { keyword } = await Actor.getInput(); |
27 | | - |
28 | | -// Open a dataset with a custom named based on the |
29 | | -// keyword which was inputted by the user |
30 | | -const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
31 | | -// ... |
32 | | -``` |
33 | | - |
34 | | -If we remember correctly, we are pushing data to the dataset in the `labels.OFFERS` handler we created in **routes.js**. Let's export the `dataset` variable pointing to our named dataset so we can import it in **routes.js** and use it in the handler: |
35 | | - |
36 | | -```js |
37 | | -export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
38 | | -``` |
39 | | - |
40 | | -Finally, let's modify the function to use the new `dataset` variable rather than the `Actor` class: |
41 | | - |
42 | | -```js |
43 | | -// Import the dataset pointer |
44 | | -import { dataset } from './main.js'; |
45 | | - |
46 | | -// ... |
47 | | - |
48 | | -router.addHandler(labels.OFFERS, async ({ $, request }) => { |
49 | | - const { data } = request.userData; |
50 | | - |
51 | | - for (const offer of $('#aod-offer')) { |
52 | | - const element = $(offer); |
53 | | - |
54 | | - // Replace "Actor" with "dataset" |
55 | | - await dataset.pushData({ |
56 | | - ...data, |
57 | | - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), |
58 | | - offer: element.find('.a-price .a-offscreen').text().trim(), |
59 | | - }); |
60 | | - } |
61 | | -}); |
62 | | -``` |
63 | | - |
64 | | -That's it! Now, our Actor will push its data to a dataset named **amazon-offers-KEYWORD**! |
65 | | - |
66 | | -## Using a key-value store {#using-key-value-store} |
67 | | - |
68 | | -We now want to store the cheapest item in the default key-value store under a key named **CHEAPEST-ITEM**. The most efficient and practical way of doing this is by filtering through all of the newly named dataset's items and pushing the cheapest one to the store. |
69 | | - |
70 | | -Let's add the following code to the bottom of the Actor after **Crawl finished** is logged to the console: |
71 | | - |
72 | | -```js |
73 | | -// ... |
74 | | - |
75 | | -const cheapest = items.reduce((prev, curr) => { |
76 | | - // If there is no previous offer price, or the previous is more |
77 | | - // expensive, set the cheapest to our current item |
78 | | - if (!prev?.offer || +prev.offer.slice(1) > +curr.offer.slice(1)) return curr; |
79 | | - // Otherwise, keep our previous item |
80 | | - return prev; |
81 | | -}); |
82 | | - |
83 | | -// Set the "CHEAPEST-ITEM" key in the key-value store to be the |
84 | | -// newly discovered cheapest item |
85 | | -await Actor.setValue(CHEAPEST_ITEM, cheapest); |
86 | | - |
87 | | -await Actor.exit(); |
88 | | -``` |
89 | | -
|
90 | | -> If you start receiving a linting error after adding the following code to your **main.js** file, add `"parserOptions": { "ecmaVersion": "latest" }` to the **.eslintrc** file in the root directory of your project. |
91 | | -
|
92 | | -You might have noticed that we are using a variable instead of a string for the key name in the key-value store. This is because we're using an exported variable from **constants.js** (which is best practice, as discussed in the [**modularity**](../../../webscraping/scraping_basics_javascript/challenge/modularity.md)) lesson back in the **Web scraping for beginners** course. Here is what our **constants.js** file looks like: |
93 | | -
|
94 | | -```js |
95 | | -// constants.js |
96 | | -export const BASE_URL = 'https://www.amazon.com'; |
97 | | - |
98 | | -export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`; |
99 | | - |
100 | | -export const labels = { |
101 | | - START: 'START', |
102 | | - PRODUCT: 'PRODUCT', |
103 | | - OFFERS: 'OFFERS', |
104 | | -}; |
105 | | - |
106 | | -export const CHEAPEST_ITEM = 'CHEAPEST-ITEM'; |
107 | | -``` |
108 | | -
|
109 | | -## Code check-in {#code-check-in} |
110 | | -
|
111 | | -Here is what the **main.js** file looks like now: |
112 | | -
|
113 | | -```js |
114 | | -// main.js |
115 | | -import { Actor } from 'apify'; |
116 | | -import { CheerioCrawler, log } from '@crawlee/cheerio'; |
117 | | - |
118 | | -import { router } from './routes.js'; |
119 | | -import { BASE_URL, CHEAPEST_ITEM } from './constants'; |
120 | | - |
121 | | -await Actor.init(); |
122 | | - |
123 | | -const { keyword } = await Actor.getInput(); |
124 | | - |
125 | | -export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
126 | | - |
127 | | -const proxyConfiguration = await Actor.createProxyConfiguration({ |
128 | | - groups: ['RESIDENTIAL'], |
129 | | -}); |
130 | | - |
131 | | -const crawler = new Actor.CheerioCrawler({ |
132 | | - proxyConfiguration, |
133 | | - useSessionPool: true, |
134 | | - maxConcurrency: 50, |
135 | | - requestHandler: router, |
136 | | -}); |
137 | | - |
138 | | -await crawler.addRequests([ |
139 | | - { |
140 | | - url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`, |
141 | | - label: 'START', |
142 | | - userData: { |
143 | | - keyword, |
144 | | - }, |
145 | | - }, |
146 | | -]); |
147 | | - |
148 | | -log.info('Starting the crawl.'); |
149 | | -await crawler.run(); |
150 | | -log.info('Crawl finished.'); |
151 | | - |
152 | | -const { items } = await dataset.getData(); |
153 | | - |
154 | | -const cheapest = items.reduce((prev, curr) => { |
155 | | - if (!prev?.offer) return curr; |
156 | | - if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr; |
157 | | - return prev; |
158 | | -}); |
159 | | - |
160 | | -await Actor.setValue(CHEAPEST_ITEM, cheapest); |
161 | | - |
162 | | -await Actor.exit(); |
163 | | -``` |
164 | | -
|
165 | | -And here is **routes.js**: |
166 | | -
|
167 | | -```js |
168 | | -// routes.js |
169 | | -import { createCheerioRouter } from '@crawlee/cheerio'; |
170 | | -import { dataset } from './main.js'; |
171 | | -import { BASE_URL, OFFERS_URL, labels } from './constants'; |
172 | | - |
173 | | -export const router = createCheerioRouter(); |
174 | | - |
175 | | -router.addHandler(labels.START, async ({ $, crawler, request }) => { |
176 | | - const { keyword } = request.userData; |
177 | | - |
178 | | - const products = $('div > div[data-asin]:not([data-asin=""])'); |
179 | | - |
180 | | - for (const product of products) { |
181 | | - const element = $(product); |
182 | | - const titleElement = $(element.find('.a-text-normal[href]')); |
183 | | - |
184 | | - const url = `${BASE_URL}${titleElement.attr('href')}`; |
185 | | - |
186 | | - await crawler.addRequests([{ |
187 | | - url, |
188 | | - label: labels.PRODUCT, |
189 | | - userData: { |
190 | | - data: { |
191 | | - title: titleElement.first().text().trim(), |
192 | | - asin: element.attr('data-asin'), |
193 | | - itemUrl: url, |
194 | | - keyword, |
195 | | - }, |
196 | | - }, |
197 | | - }]); |
198 | | - } |
199 | | -}); |
200 | | - |
201 | | -router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { |
202 | | - const { data } = request.userData; |
203 | | - |
204 | | - const element = $('div#productDescription'); |
205 | | - |
206 | | - await crawler.addRequests([{ |
207 | | - url: OFFERS_URL(data.asin), |
208 | | - label: labels.OFFERS, |
209 | | - userData: { |
210 | | - data: { |
211 | | - ...data, |
212 | | - description: element.text().trim(), |
213 | | - }, |
214 | | - }, |
215 | | - }]); |
216 | | -}); |
217 | | - |
218 | | -router.addHandler(labels.OFFERS, async ({ $, request }) => { |
219 | | - const { data } = request.userData; |
220 | | - |
221 | | - for (const offer of $('#aod-offer')) { |
222 | | - const element = $(offer); |
223 | | - |
224 | | - await dataset.pushData({ |
225 | | - ...data, |
226 | | - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), |
227 | | - offer: element.find('.a-price .a-offscreen').text().trim(), |
228 | | - }); |
229 | | - } |
230 | | -}); |
231 | | -``` |
232 | | -
|
233 | | -Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform! |
234 | | -
|
235 | | -## Creating a task {#creating-task} |
236 | | -
|
237 | | -Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**: |
238 | | -
|
239 | | - |
240 | | -
|
241 | | -Then, configure the task to use **google pixel** as a keyword and click **Save**. |
242 | | -
|
243 | | -> You can also add a custom name and description for the task in the **Settings** tab! |
244 | | -
|
245 | | - |
246 | | -
|
247 | | -After saving it, you'll be able to see the newly created task in the **Tasks** tab on the Apify Console. Go ahead and run it. Did it work? |
248 | | -
|
249 | 10 | ## Quiz answers 📝 {#quiz-answers} |
250 | 11 |
|
251 | 12 | **Q: What is the relationship between Actors and tasks?** |
|
0 commit comments