Skip to content

Commit 9c702c9

Browse files
committed
Merge branch 'revamp-beginners-course' of https://github.com/apify/apify-docs into revamp-beginners-course
2 parents b33401e + 14b6080 commit 9c702c9

File tree

3 files changed

+14
-22
lines changed

3 files changed

+14
-22
lines changed

content/academy/web_scraping_for_beginners/challenge.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ paths:
1010

1111
Before moving onto the other courses in the academy, we recommend following along with this section, as it combines everything you've learned in the previous lessons into one cohesive project that helps you prove to yourself that you've thoroughly understood the material.
1212

13-
We recommended that you make sure you've gone through both the [data collection]({{@link web_scraping_for_beginners/data_collection.md}}) [crawing]({{@link web_scraping_for_beginners/crawling.md}}) sections of this course to ensure the smoothest development process.
13+
We recommend that you make sure you've gone through both the [data collection]({{@link web_scraping_for_beginners/data_collection.md}}) and [crawling]({{@link web_scraping_for_beginners/crawling.md}}) sections of this course to ensure the smoothest development process.
1414

1515
## [](#learning) Learning 🧠
1616

@@ -29,7 +29,7 @@ On Amazon, we can use this link to get to the results page of any product we wan
2929
https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=KEYWORD
3030
```
3131

32-
Our actor's input will look like this:
32+
Our crawler's input will look like this:
3333

3434
```JSON
3535
{

content/academy/web_scraping_for_beginners/challenge/initializing_and_setting_up.md

Lines changed: 12 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,43 +11,36 @@ paths:
1111
The Crawlee CLI makes it extremely easy for us to set up a project in Crawlee and hit the ground running. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command:
1212

1313
```shell
14-
npx crawlee create demo-actor
15-
```
16-
17-
> You don't have to call it **demo-actor**, but that's what we'll be calling it in this tutorial.
14+
npx crawlee create amazon-crawler
1815

1916
Once you run this command, you'll get prompted into a menu which you can navigate using your arrow keys. Each of these options will generate different boilerplate code when selected. We're going to work with CheerioCrawler today, so we'll select the **CheerioCrawler template project** template, then press **Enter**.
2017
2118
![Crawlee CLI "create" command]({{@asset web_scraping_for_beginners/challenge/images/crawlee-create.webp}})
2219
23-
Once it's completed, open up the **demo-actor** folder that was generated by the `npx crawlee create` command. We're going to modify the **main.js** boilerplate to fit our needs:
20+
Once it's completed, open up the **amazon-crawler** folder that was generated by the `npx crawlee create` command. We're going to modify the **main.js** boilerplate to fit our needs:
2421
2522
```JavaScript
2623
// main.js
2724
import { CheerioCrawler, KeyValueStore, log } from 'crawlee';
2825
import { router } from './routes.js';
2926
3027
// Grab our keyword from the input
31-
const { keyword = 'iphone' } = (await KeyValueStore.getInput()) ?? {};
28+
const { keyword } = await KeyValueStore.getInput();
3229
3330
const crawler = new CheerioCrawler({
3431
requestHandler: router,
3532
});
3633
37-
// Add our initial requests
38-
await crawler.addRequests([
39-
{
40-
// Turn the inputted keyword into a link we can make a request with
41-
url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
42-
label: 'START',
43-
userData: {
44-
keyword,
45-
},
46-
},
47-
]);
4834
4935
log.info('Starting the crawl.');
50-
await crawler.run();
36+
await crawler.run([{
37+
// Turn the keyword into a link we can make a request with
38+
url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
39+
label: 'START',
40+
userData: {
41+
keyword,
42+
},
43+
}]);
5144
log.info('Crawl finished.');
5245
```
5346
@@ -62,7 +55,7 @@ router.addDefaultHandler(({ log }) => {
6255
});
6356
```
6457
65-
Finally, we'll modify our input file in **storage/key_value_stores/default/INPUT.json** to look like this:
58+
Finally, we'll add the following input file to **INPUT.json** in the project's root directory (next to `package.json`, `node_modules` and others)
6659
6760
```JSON
6861
{

content/academy/web_scraping_for_beginners/challenge/scraping_amazon.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
2626
});
2727
```
2828

29-
> If you are sometimes getting an error along the lines of **RequestError: Proxy responded with 407**, don't worry, this is totally normal. The request will retry and succeed.
3029

3130
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman]({{@link tools/proxyman.md}}) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:
3231

0 commit comments

Comments
 (0)