|
1 | 1 | ---
|
2 | 2 | title: III - Using storage & creating tasks
|
3 |
| -description: Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task. |
| 3 | +description: Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform. |
4 | 4 | sidebar_position: 3
|
5 | 5 | slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks
|
6 | 6 | ---
|
7 | 7 |
|
8 | 8 | # Using storage & creating tasks {#using-storage-creating-tasks}
|
9 | 9 |
|
10 |
| -**Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.** |
11 |
| - |
12 |
| ---- |
13 |
| - |
14 |
| -Last lesson, our task was outlined for us. In this lesson, we'll be completing that task by making our Amazon Actor push to a **named dataset** and use the **default key-value store** to store the cheapest item found by the scraper. Finally, we'll create a task for the Actor back on the Apify platform. |
15 |
| - |
16 |
| -## Using a named dataset {#using-named-dataset} |
17 |
| - |
18 |
| -Something important to understand is that, in the Apify SDK, when you use `Actor.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Actor.openDataset()` function: |
19 |
| - |
20 |
| -```js |
21 |
| -// main.js |
22 |
| -// ... |
23 |
| - |
24 |
| -await Actor.init(); |
25 |
| - |
26 |
| -const { keyword } = await Actor.getInput(); |
27 |
| - |
28 |
| -// Open a dataset with a custom named based on the |
29 |
| -// keyword which was inputted by the user |
30 |
| -const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
31 |
| -// ... |
32 |
| -``` |
33 |
| - |
34 |
| -If we remember correctly, we are pushing data to the dataset in the `labels.OFFERS` handler we created in **routes.js**. Let's export the `dataset` variable pointing to our named dataset so we can import it in **routes.js** and use it in the handler: |
35 |
| - |
36 |
| -```js |
37 |
| -export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
38 |
| -``` |
39 |
| - |
40 |
| -Finally, let's modify the function to use the new `dataset` variable rather than the `Actor` class: |
41 |
| - |
42 |
| -```js |
43 |
| -// Import the dataset pointer |
44 |
| -import { dataset } from './main.js'; |
45 |
| - |
46 |
| -// ... |
47 |
| - |
48 |
| -router.addHandler(labels.OFFERS, async ({ $, request }) => { |
49 |
| - const { data } = request.userData; |
50 |
| - |
51 |
| - for (const offer of $('#aod-offer')) { |
52 |
| - const element = $(offer); |
53 |
| - |
54 |
| - // Replace "Actor" with "dataset" |
55 |
| - await dataset.pushData({ |
56 |
| - ...data, |
57 |
| - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), |
58 |
| - offer: element.find('.a-price .a-offscreen').text().trim(), |
59 |
| - }); |
60 |
| - } |
61 |
| -}); |
62 |
| -``` |
63 |
| - |
64 |
| -That's it! Now, our Actor will push its data to a dataset named **amazon-offers-KEYWORD**! |
65 |
| - |
66 |
| -## Using a key-value store {#using-key-value-store} |
67 |
| - |
68 |
| -We now want to store the cheapest item in the default key-value store under a key named **CHEAPEST-ITEM**. The most efficient and practical way of doing this is by filtering through all of the newly named dataset's items and pushing the cheapest one to the store. |
69 |
| - |
70 |
| -Let's add the following code to the bottom of the Actor after **Crawl finished** is logged to the console: |
71 |
| - |
72 |
| -```js |
73 |
| -// ... |
74 |
| - |
75 |
| -const cheapest = items.reduce((prev, curr) => { |
76 |
| - // If there is no previous offer price, or the previous is more |
77 |
| - // expensive, set the cheapest to our current item |
78 |
| - if (!prev?.offer || +prev.offer.slice(1) > +curr.offer.slice(1)) return curr; |
79 |
| - // Otherwise, keep our previous item |
80 |
| - return prev; |
81 |
| -}); |
82 |
| - |
83 |
| -// Set the "CHEAPEST-ITEM" key in the key-value store to be the |
84 |
| -// newly discovered cheapest item |
85 |
| -await Actor.setValue(CHEAPEST_ITEM, cheapest); |
86 |
| - |
87 |
| -await Actor.exit(); |
88 |
| -``` |
89 |
| -
|
90 |
| -> If you start receiving a linting error after adding the following code to your **main.js** file, add `"parserOptions": { "ecmaVersion": "latest" }` to the **.eslintrc** file in the root directory of your project. |
91 |
| -
|
92 |
| -You might have noticed that we are using a variable instead of a string for the key name in the key-value store. This is because we're using an exported variable from **constants.js** (which is best practice, as discussed in the [**modularity**](../../../webscraping/scraping_basics_javascript/challenge/modularity.md)) lesson back in the **Web scraping for beginners** course. Here is what our **constants.js** file looks like: |
93 |
| -
|
94 |
| -```js |
95 |
| -// constants.js |
96 |
| -export const BASE_URL = 'https://www.amazon.com'; |
97 |
| - |
98 |
| -export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`; |
99 |
| - |
100 |
| -export const labels = { |
101 |
| - START: 'START', |
102 |
| - PRODUCT: 'PRODUCT', |
103 |
| - OFFERS: 'OFFERS', |
104 |
| -}; |
105 |
| - |
106 |
| -export const CHEAPEST_ITEM = 'CHEAPEST-ITEM'; |
107 |
| -``` |
108 |
| -
|
109 |
| -## Code check-in {#code-check-in} |
110 |
| -
|
111 |
| -Here is what the **main.js** file looks like now: |
112 |
| -
|
113 |
| -```js |
114 |
| -// main.js |
115 |
| -import { Actor } from 'apify'; |
116 |
| -import { CheerioCrawler, log } from '@crawlee/cheerio'; |
117 |
| - |
118 |
| -import { router } from './routes.js'; |
119 |
| -import { BASE_URL, CHEAPEST_ITEM } from './constants'; |
120 |
| - |
121 |
| -await Actor.init(); |
122 |
| - |
123 |
| -const { keyword } = await Actor.getInput(); |
124 |
| - |
125 |
| -export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`); |
126 |
| - |
127 |
| -const proxyConfiguration = await Actor.createProxyConfiguration({ |
128 |
| - groups: ['RESIDENTIAL'], |
129 |
| -}); |
130 |
| - |
131 |
| -const crawler = new Actor.CheerioCrawler({ |
132 |
| - proxyConfiguration, |
133 |
| - useSessionPool: true, |
134 |
| - maxConcurrency: 50, |
135 |
| - requestHandler: router, |
136 |
| -}); |
137 |
| - |
138 |
| -await crawler.addRequests([ |
139 |
| - { |
140 |
| - url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`, |
141 |
| - label: 'START', |
142 |
| - userData: { |
143 |
| - keyword, |
144 |
| - }, |
145 |
| - }, |
146 |
| -]); |
147 |
| - |
148 |
| -log.info('Starting the crawl.'); |
149 |
| -await crawler.run(); |
150 |
| -log.info('Crawl finished.'); |
151 |
| - |
152 |
| -const { items } = await dataset.getData(); |
153 |
| - |
154 |
| -const cheapest = items.reduce((prev, curr) => { |
155 |
| - if (!prev?.offer) return curr; |
156 |
| - if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr; |
157 |
| - return prev; |
158 |
| -}); |
159 |
| - |
160 |
| -await Actor.setValue(CHEAPEST_ITEM, cheapest); |
161 |
| - |
162 |
| -await Actor.exit(); |
163 |
| -``` |
164 |
| -
|
165 |
| -And here is **routes.js**: |
166 |
| -
|
167 |
| -```js |
168 |
| -// routes.js |
169 |
| -import { createCheerioRouter } from '@crawlee/cheerio'; |
170 |
| -import { dataset } from './main.js'; |
171 |
| -import { BASE_URL, OFFERS_URL, labels } from './constants'; |
172 |
| - |
173 |
| -export const router = createCheerioRouter(); |
174 |
| - |
175 |
| -router.addHandler(labels.START, async ({ $, crawler, request }) => { |
176 |
| - const { keyword } = request.userData; |
177 |
| - |
178 |
| - const products = $('div > div[data-asin]:not([data-asin=""])'); |
179 |
| - |
180 |
| - for (const product of products) { |
181 |
| - const element = $(product); |
182 |
| - const titleElement = $(element.find('.a-text-normal[href]')); |
183 |
| - |
184 |
| - const url = `${BASE_URL}${titleElement.attr('href')}`; |
185 |
| - |
186 |
| - await crawler.addRequests([{ |
187 |
| - url, |
188 |
| - label: labels.PRODUCT, |
189 |
| - userData: { |
190 |
| - data: { |
191 |
| - title: titleElement.first().text().trim(), |
192 |
| - asin: element.attr('data-asin'), |
193 |
| - itemUrl: url, |
194 |
| - keyword, |
195 |
| - }, |
196 |
| - }, |
197 |
| - }]); |
198 |
| - } |
199 |
| -}); |
200 |
| - |
201 |
| -router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { |
202 |
| - const { data } = request.userData; |
203 |
| - |
204 |
| - const element = $('div#productDescription'); |
205 |
| - |
206 |
| - await crawler.addRequests([{ |
207 |
| - url: OFFERS_URL(data.asin), |
208 |
| - label: labels.OFFERS, |
209 |
| - userData: { |
210 |
| - data: { |
211 |
| - ...data, |
212 |
| - description: element.text().trim(), |
213 |
| - }, |
214 |
| - }, |
215 |
| - }]); |
216 |
| -}); |
217 |
| - |
218 |
| -router.addHandler(labels.OFFERS, async ({ $, request }) => { |
219 |
| - const { data } = request.userData; |
220 |
| - |
221 |
| - for (const offer of $('#aod-offer')) { |
222 |
| - const element = $(offer); |
223 |
| - |
224 |
| - await dataset.pushData({ |
225 |
| - ...data, |
226 |
| - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), |
227 |
| - offer: element.find('.a-price .a-offscreen').text().trim(), |
228 |
| - }); |
229 |
| - } |
230 |
| -}); |
231 |
| -``` |
232 |
| -
|
233 |
| -Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform! |
234 |
| -
|
235 |
| -## Creating a task {#creating-task} |
236 |
| -
|
237 |
| -Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**: |
238 |
| -
|
239 |
| - |
240 |
| -
|
241 |
| -Then, configure the task to use **google pixel** as a keyword and click **Save**. |
242 |
| -
|
243 |
| -> You can also add a custom name and description for the task in the **Settings** tab! |
244 |
| -
|
245 |
| - |
246 |
| -
|
247 |
| -After saving it, you'll be able to see the newly created task in the **Tasks** tab on the Apify Console. Go ahead and run it. Did it work? |
248 |
| -
|
249 | 10 | ## Quiz answers 📝 {#quiz-answers}
|
250 | 11 |
|
251 | 12 | **Q: What is the relationship between Actors and tasks?**
|
|
0 commit comments