Skip to content

Commit 750ac86

Browse files
docs(academy): remove some noisy exercise tasks from Expert scraping (#1420)
I removed one exercise that felt duplicated and without much value. Also removed 2 questions that were also repeated elsewhere.
1 parent fb56b9b commit 750ac86

File tree

5 files changed

+3
-260
lines changed

5 files changed

+3
-260
lines changed

sources/academy/platform/expert_scraping_with_apify/apify_api_and_client.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ You can use one of the two main ways to programmatically interact with the Apify
3030

3131
## Our task
3232

33-
In the previous lesson, we created a **task** for the Amazon Actor we built in the first two lessons of this course. Now, we'll be creating another new Actor, which will have two jobs:
33+
We'll be creating another new Actor, which will have two jobs:
3434

3535
1. Programmatically call the task for the Amazon Actor.
3636
2. Export its results into CSV format under a new key called **OUTPUT.csv** in the default key-value store.

sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,7 @@ Before moving on, give these valuable resources a quick lookover:
2828

2929
1. Why might you want to store statistics about an Actor's run (or a specific request)?
3030
2. In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it?
31-
3. We are building a new imaginary scraper for a website that sometimes displays captchas at unexpected times, rather than displaying the content we want. How would you keep a count of the total number of captchas hit for the entire run? Where would you store this data? Why?
32-
4. Is storing these types of values necessary for every single Actor?
33-
5. What is the difference between the `failedRequestHandler` and `errorHandler`?
31+
3. What is the difference between the `failedRequestHandler` and `errorHandler`?
3432

3533
## Our task
3634

sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,6 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => {
154154
155155
**A:** This information is available directly on the request object under the property **retryCount**.
156156
157-
**Q: We are building a new imaginary scraper for a website that sometimes displays captchas at unexpected times, rather than displaying the content we want. How would you keep a count of the total number of captchas hit for the entire run? Where would you store this data? Why?**
158-
159-
**A:** First, build a function that detects if the captcha has been hit. If so, it will throw an error and add to the **numberOfCaptchas** count. This data might be stored on a persisted state object to help better assess the anti-scraping mitigation techniques the scraper should be used.
160-
161-
**Q: Is storing these types of values necessary for every single Actor?**
162-
163-
**A:** For small Actors, it might be a waste of time to do this. For large-scale Actors, it can be extremely helpful when debugging and most definitely worth the extra 10–20 minutes of development time. Usually though, the default statistics from the Crawlee and the SDK might be enough for simple run stats.
164-
165157
**Q: What is the difference between the `failedRequestHandler` and `errorHandler`?**
166158
167159
**A:** `failedRequestHandler` runs after a request has failed and reached its `maxRetries` count. `errorHandler` runs on every failure and retry.

sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md

Lines changed: 1 addition & 240 deletions
Original file line numberDiff line numberDiff line change
@@ -1,251 +1,12 @@
11
---
22
title: III - Using storage & creating tasks
3-
description: Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.
3+
description: Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform.
44
sidebar_position: 3
55
slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks
66
---
77

88
# Using storage & creating tasks {#using-storage-creating-tasks}
99

10-
**Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.**
11-
12-
---
13-
14-
Last lesson, our task was outlined for us. In this lesson, we'll be completing that task by making our Amazon Actor push to a **named dataset** and use the **default key-value store** to store the cheapest item found by the scraper. Finally, we'll create a task for the Actor back on the Apify platform.
15-
16-
## Using a named dataset {#using-named-dataset}
17-
18-
Something important to understand is that, in the Apify SDK, when you use `Actor.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Actor.openDataset()` function:
19-
20-
```js
21-
// main.js
22-
// ...
23-
24-
await Actor.init();
25-
26-
const { keyword } = await Actor.getInput();
27-
28-
// Open a dataset with a custom named based on the
29-
// keyword which was inputted by the user
30-
const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
31-
// ...
32-
```
33-
34-
If we remember correctly, we are pushing data to the dataset in the `labels.OFFERS` handler we created in **routes.js**. Let's export the `dataset` variable pointing to our named dataset so we can import it in **routes.js** and use it in the handler:
35-
36-
```js
37-
export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
38-
```
39-
40-
Finally, let's modify the function to use the new `dataset` variable rather than the `Actor` class:
41-
42-
```js
43-
// Import the dataset pointer
44-
import { dataset } from './main.js';
45-
46-
// ...
47-
48-
router.addHandler(labels.OFFERS, async ({ $, request }) => {
49-
const { data } = request.userData;
50-
51-
for (const offer of $('#aod-offer')) {
52-
const element = $(offer);
53-
54-
// Replace "Actor" with "dataset"
55-
await dataset.pushData({
56-
...data,
57-
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
58-
offer: element.find('.a-price .a-offscreen').text().trim(),
59-
});
60-
}
61-
});
62-
```
63-
64-
That's it! Now, our Actor will push its data to a dataset named **amazon-offers-KEYWORD**!
65-
66-
## Using a key-value store {#using-key-value-store}
67-
68-
We now want to store the cheapest item in the default key-value store under a key named **CHEAPEST-ITEM**. The most efficient and practical way of doing this is by filtering through all of the newly named dataset's items and pushing the cheapest one to the store.
69-
70-
Let's add the following code to the bottom of the Actor after **Crawl finished** is logged to the console:
71-
72-
```js
73-
// ...
74-
75-
const cheapest = items.reduce((prev, curr) => {
76-
// If there is no previous offer price, or the previous is more
77-
// expensive, set the cheapest to our current item
78-
if (!prev?.offer || +prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
79-
// Otherwise, keep our previous item
80-
return prev;
81-
});
82-
83-
// Set the "CHEAPEST-ITEM" key in the key-value store to be the
84-
// newly discovered cheapest item
85-
await Actor.setValue(CHEAPEST_ITEM, cheapest);
86-
87-
await Actor.exit();
88-
```
89-
90-
> If you start receiving a linting error after adding the following code to your **main.js** file, add `"parserOptions": { "ecmaVersion": "latest" }` to the **.eslintrc** file in the root directory of your project.
91-
92-
You might have noticed that we are using a variable instead of a string for the key name in the key-value store. This is because we're using an exported variable from **constants.js** (which is best practice, as discussed in the [**modularity**](../../../webscraping/scraping_basics_javascript/challenge/modularity.md)) lesson back in the **Web scraping for beginners** course. Here is what our **constants.js** file looks like:
93-
94-
```js
95-
// constants.js
96-
export const BASE_URL = 'https://www.amazon.com';
97-
98-
export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;
99-
100-
export const labels = {
101-
START: 'START',
102-
PRODUCT: 'PRODUCT',
103-
OFFERS: 'OFFERS',
104-
};
105-
106-
export const CHEAPEST_ITEM = 'CHEAPEST-ITEM';
107-
```
108-
109-
## Code check-in {#code-check-in}
110-
111-
Here is what the **main.js** file looks like now:
112-
113-
```js
114-
// main.js
115-
import { Actor } from 'apify';
116-
import { CheerioCrawler, log } from '@crawlee/cheerio';
117-
118-
import { router } from './routes.js';
119-
import { BASE_URL, CHEAPEST_ITEM } from './constants';
120-
121-
await Actor.init();
122-
123-
const { keyword } = await Actor.getInput();
124-
125-
export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
126-
127-
const proxyConfiguration = await Actor.createProxyConfiguration({
128-
groups: ['RESIDENTIAL'],
129-
});
130-
131-
const crawler = new Actor.CheerioCrawler({
132-
proxyConfiguration,
133-
useSessionPool: true,
134-
maxConcurrency: 50,
135-
requestHandler: router,
136-
});
137-
138-
await crawler.addRequests([
139-
{
140-
url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
141-
label: 'START',
142-
userData: {
143-
keyword,
144-
},
145-
},
146-
]);
147-
148-
log.info('Starting the crawl.');
149-
await crawler.run();
150-
log.info('Crawl finished.');
151-
152-
const { items } = await dataset.getData();
153-
154-
const cheapest = items.reduce((prev, curr) => {
155-
if (!prev?.offer) return curr;
156-
if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
157-
return prev;
158-
});
159-
160-
await Actor.setValue(CHEAPEST_ITEM, cheapest);
161-
162-
await Actor.exit();
163-
```
164-
165-
And here is **routes.js**:
166-
167-
```js
168-
// routes.js
169-
import { createCheerioRouter } from '@crawlee/cheerio';
170-
import { dataset } from './main.js';
171-
import { BASE_URL, OFFERS_URL, labels } from './constants';
172-
173-
export const router = createCheerioRouter();
174-
175-
router.addHandler(labels.START, async ({ $, crawler, request }) => {
176-
const { keyword } = request.userData;
177-
178-
const products = $('div > div[data-asin]:not([data-asin=""])');
179-
180-
for (const product of products) {
181-
const element = $(product);
182-
const titleElement = $(element.find('.a-text-normal[href]'));
183-
184-
const url = `${BASE_URL}${titleElement.attr('href')}`;
185-
186-
await crawler.addRequests([{
187-
url,
188-
label: labels.PRODUCT,
189-
userData: {
190-
data: {
191-
title: titleElement.first().text().trim(),
192-
asin: element.attr('data-asin'),
193-
itemUrl: url,
194-
keyword,
195-
},
196-
},
197-
}]);
198-
}
199-
});
200-
201-
router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
202-
const { data } = request.userData;
203-
204-
const element = $('div#productDescription');
205-
206-
await crawler.addRequests([{
207-
url: OFFERS_URL(data.asin),
208-
label: labels.OFFERS,
209-
userData: {
210-
data: {
211-
...data,
212-
description: element.text().trim(),
213-
},
214-
},
215-
}]);
216-
});
217-
218-
router.addHandler(labels.OFFERS, async ({ $, request }) => {
219-
const { data } = request.userData;
220-
221-
for (const offer of $('#aod-offer')) {
222-
const element = $(offer);
223-
224-
await dataset.pushData({
225-
...data,
226-
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
227-
offer: element.find('.a-price .a-offscreen').text().trim(),
228-
});
229-
}
230-
});
231-
```
232-
233-
Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform!
234-
235-
## Creating a task {#creating-task}
236-
237-
Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**:
238-
239-
![Create new task button](./images/create-new-task.jpg)
240-
241-
Then, configure the task to use **google pixel** as a keyword and click **Save**.
242-
243-
> You can also add a custom name and description for the task in the **Settings** tab!
244-
245-
![Creating a task](./images/creating-task.png)
246-
247-
After saving it, you'll be able to see the newly created task in the **Tasks** tab on the Apify Console. Go ahead and run it. Did it work?
248-
24910
## Quiz answers 📝 {#quiz-answers}
25011

25112
**Q: What is the relationship between Actors and tasks?**

sources/academy/platform/expert_scraping_with_apify/tasks_and_storage.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,6 @@ Storage allows us to save persistent data for further processing. As you'll lear
3434
2. What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage?
3535
3. What is data retention, and how does it work for all types of storages (default and named)?
3636

37-
## Our task {#our-task}
38-
39-
Once again, we'll be adding onto our main Amazon-scraping Actor in this activity, but don't worry - this lesson will be quite light, just like the last one.
40-
41-
We have decided that we want to retain the data scraped by the Actor for a long period of time, so instead of pushing to the default dataset, we will be pushing to a named dataset. Additionally, we want to save the absolute cheapest item found by the scraper into the default key-value store under a key named **CHEAPEST-ITEM**.
42-
43-
Finally, we'll create a task for the Actor that saves the configuration with the **keyword** set to **google pixel**.
44-
4537
[**Solution**](./solutions/using_storage_creating_tasks.md)
4638

4739
## Next up {#next}

0 commit comments

Comments
 (0)