Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ You can use one of the two main ways to programmatically interact with the Apify

## Our task

In the previous lesson, we created a **task** for the Amazon Actor we built in the first two lessons of this course. Now, we'll be creating another new Actor, which will have two jobs:
We'll be creating another new Actor, which will have two jobs:

1. Programmatically call the task for the Amazon Actor.
2. Export its results into CSV format under a new key called **OUTPUT.csv** in the default key-value store.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@ Before moving on, give these valuable resources a quick lookover:

1. Why might you want to store statistics about an Actor's run (or a specific request)?
2. In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it?
3. We are building a new imaginary scraper for a website that sometimes displays captchas at unexpected times, rather than displaying the content we want. How would you keep a count of the total number of captchas hit for the entire run? Where would you store this data? Why?
4. Is storing these types of values necessary for every single Actor?
5. What is the difference between the `failedRequestHandler` and `errorHandler`?
3. What is the difference between the `failedRequestHandler` and `errorHandler`?

## Our task

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,6 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => {

**A:** This information is available directly on the request object under the property **retryCount**.

**Q: We are building a new imaginary scraper for a website that sometimes displays captchas at unexpected times, rather than displaying the content we want. How would you keep a count of the total number of captchas hit for the entire run? Where would you store this data? Why?**

**A:** First, build a function that detects if the captcha has been hit. If so, it will throw an error and add to the **numberOfCaptchas** count. This data might be stored on a persisted state object to help better assess the anti-scraping mitigation techniques the scraper should be used.

**Q: Is storing these types of values necessary for every single Actor?**

**A:** For small Actors, it might be a waste of time to do this. For large-scale Actors, it can be extremely helpful when debugging and most definitely worth the extra 10–20 minutes of development time. Usually though, the default statistics from the Crawlee and the SDK might be enough for simple run stats.

**Q: What is the difference between the `failedRequestHandler` and `errorHandler`?**

**A:** `failedRequestHandler` runs after a request has failed and reached its `maxRetries` count. `errorHandler` runs on every failure and retry.
Original file line number Diff line number Diff line change
@@ -1,251 +1,12 @@
---
title: III - Using storage & creating tasks
description: Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.
description: Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform.
sidebar_position: 3
slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks
---

# Using storage & creating tasks {#using-storage-creating-tasks}

**Follow along with step-by-step instructions on how to complete the task outlined in the previous lesson. Use different storage types, and create a task.**

---

Last lesson, our task was outlined for us. In this lesson, we'll be completing that task by making our Amazon Actor push to a **named dataset** and use the **default key-value store** to store the cheapest item found by the scraper. Finally, we'll create a task for the Actor back on the Apify platform.

## Using a named dataset {#using-named-dataset}

Something important to understand is that, in the Apify SDK, when you use `Actor.pushData()`, the data will always be pushed to the default dataset. To open up a named dataset, we'll use the `Actor.openDataset()` function:

```js
// main.js
// ...

await Actor.init();

const { keyword } = await Actor.getInput();

// Open a dataset with a custom named based on the
// keyword which was inputted by the user
const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
// ...
```

If we remember correctly, we are pushing data to the dataset in the `labels.OFFERS` handler we created in **routes.js**. Let's export the `dataset` variable pointing to our named dataset so we can import it in **routes.js** and use it in the handler:

```js
export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);
```

Finally, let's modify the function to use the new `dataset` variable rather than the `Actor` class:

```js
// Import the dataset pointer
import { dataset } from './main.js';

// ...

router.addHandler(labels.OFFERS, async ({ $, request }) => {
const { data } = request.userData;

for (const offer of $('#aod-offer')) {
const element = $(offer);

// Replace "Actor" with "dataset"
await dataset.pushData({
...data,
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
offer: element.find('.a-price .a-offscreen').text().trim(),
});
}
});
```

That's it! Now, our Actor will push its data to a dataset named **amazon-offers-KEYWORD**!

## Using a key-value store {#using-key-value-store}

We now want to store the cheapest item in the default key-value store under a key named **CHEAPEST-ITEM**. The most efficient and practical way of doing this is by filtering through all of the newly named dataset's items and pushing the cheapest one to the store.

Let's add the following code to the bottom of the Actor after **Crawl finished** is logged to the console:

```js
// ...

const cheapest = items.reduce((prev, curr) => {
// If there is no previous offer price, or the previous is more
// expensive, set the cheapest to our current item
if (!prev?.offer || +prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
// Otherwise, keep our previous item
return prev;
});

// Set the "CHEAPEST-ITEM" key in the key-value store to be the
// newly discovered cheapest item
await Actor.setValue(CHEAPEST_ITEM, cheapest);

await Actor.exit();
```

> If you start receiving a linting error after adding the following code to your **main.js** file, add `"parserOptions": { "ecmaVersion": "latest" }` to the **.eslintrc** file in the root directory of your project.

You might have noticed that we are using a variable instead of a string for the key name in the key-value store. This is because we're using an exported variable from **constants.js** (which is best practice, as discussed in the [**modularity**](../../../webscraping/scraping_basics_javascript/challenge/modularity.md)) lesson back in the **Web scraping for beginners** course. Here is what our **constants.js** file looks like:

```js
// constants.js
export const BASE_URL = 'https://www.amazon.com';

export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;

export const labels = {
START: 'START',
PRODUCT: 'PRODUCT',
OFFERS: 'OFFERS',
};

export const CHEAPEST_ITEM = 'CHEAPEST-ITEM';
```

## Code check-in {#code-check-in}

Here is what the **main.js** file looks like now:

```js
// main.js
import { Actor } from 'apify';
import { CheerioCrawler, log } from '@crawlee/cheerio';

import { router } from './routes.js';
import { BASE_URL, CHEAPEST_ITEM } from './constants';

await Actor.init();

const { keyword } = await Actor.getInput();

export const dataset = await Actor.openDataset(`amazon-offers-${keyword.replace(' ', '-')}`);

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
});

const crawler = new Actor.CheerioCrawler({
proxyConfiguration,
useSessionPool: true,
maxConcurrency: 50,
requestHandler: router,
});

await crawler.addRequests([
{
url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
label: 'START',
userData: {
keyword,
},
},
]);

log.info('Starting the crawl.');
await crawler.run();
log.info('Crawl finished.');

const { items } = await dataset.getData();

const cheapest = items.reduce((prev, curr) => {
if (!prev?.offer) return curr;
if (+prev.offer.slice(1) > +curr.offer.slice(1)) return curr;
return prev;
});

await Actor.setValue(CHEAPEST_ITEM, cheapest);

await Actor.exit();
```

And here is **routes.js**:

```js
// routes.js
import { createCheerioRouter } from '@crawlee/cheerio';
import { dataset } from './main.js';
import { BASE_URL, OFFERS_URL, labels } from './constants';

export const router = createCheerioRouter();

router.addHandler(labels.START, async ({ $, crawler, request }) => {
const { keyword } = request.userData;

const products = $('div > div[data-asin]:not([data-asin=""])');

for (const product of products) {
const element = $(product);
const titleElement = $(element.find('.a-text-normal[href]'));

const url = `${BASE_URL}${titleElement.attr('href')}`;

await crawler.addRequests([{
url,
label: labels.PRODUCT,
userData: {
data: {
title: titleElement.first().text().trim(),
asin: element.attr('data-asin'),
itemUrl: url,
keyword,
},
},
}]);
}
});

router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
const { data } = request.userData;

const element = $('div#productDescription');

await crawler.addRequests([{
url: OFFERS_URL(data.asin),
label: labels.OFFERS,
userData: {
data: {
...data,
description: element.text().trim(),
},
},
}]);
});

router.addHandler(labels.OFFERS, async ({ $, request }) => {
const { data } = request.userData;

for (const offer of $('#aod-offer')) {
const element = $(offer);

await dataset.pushData({
...data,
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
offer: element.find('.a-price .a-offscreen').text().trim(),
});
}
});
```

Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform!

## Creating a task {#creating-task}

Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**:

![Create new task button](./images/create-new-task.jpg)

Then, configure the task to use **google pixel** as a keyword and click **Save**.

> You can also add a custom name and description for the task in the **Settings** tab!

![Creating a task](./images/creating-task.png)

After saving it, you'll be able to see the newly created task in the **Tasks** tab on the Apify Console. Go ahead and run it. Did it work?

## Quiz answers 📝 {#quiz-answers}

**Q: What is the relationship between Actors and tasks?**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,6 @@ Storage allows us to save persistent data for further processing. As you'll lear
2. What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage?
3. What is data retention, and how does it work for all types of storages (default and named)?

## Our task {#our-task}

Once again, we'll be adding onto our main Amazon-scraping Actor in this activity, but don't worry - this lesson will be quite light, just like the last one.

We have decided that we want to retain the data scraped by the Actor for a long period of time, so instead of pushing to the default dataset, we will be pushing to a named dataset. Additionally, we want to save the absolute cheapest item found by the scraper into the default key-value store under a key named **CHEAPEST-ITEM**.

Finally, we'll create a task for the Actor that saves the configuration with the **keyword** set to **google pixel**.

[**Solution**](./solutions/using_storage_creating_tasks.md)

## Next up {#next}
Expand Down
Loading