Skip to content

Commit b33401e

Browse files
committed
feat(beginners-academy): resolve Ondra's comments
1 parent 19cbd51 commit b33401e

File tree

2 files changed

+11
-21
lines changed

2 files changed

+11
-21
lines changed

content/academy/web_scraping_for_beginners/challenge.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ We recommended that you make sure you've gone through both the [data collection]
1616

1717
Before continuing, it is highly recommended to do the following:
1818

19-
- Look over [how to build a crawler in Crawlee](https://crawlee.dev/docs/introduction/first-crawler) and ideally **code along**
20-
- Read [this short article](https://help.apify.com/en/articles/1829103-request-labels-and-how-to-pass-data-to-other-requests) about **request labels** and [`userData`](https://crawlee.dev/api/core/class/Request#userData) (this will be extremely useful later on)
21-
- Check out [this article](https://blog.apify.com/what-is-a-dynamic-page/) about dynamic pages
22-
- Read about the [RequestList](https://crawlee.dev/api/core/class/RequestList) and [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue)
19+
- Look over [how to build a crawler in Crawlee](https://crawlee.dev/docs/introduction/first-crawler) and ideally **code along**.
20+
- Read [this short article](https://help.apify.com/en/articles/1829103-request-labels-and-how-to-pass-data-to-other-requests) about [**request labels**](https://crawlee.dev/api/core/class/Request#label) (this will be extremely useful later on).
21+
- Check out [this lesson]({{@link tutorials/dealing_with_dynamic_pages.md}}) about dynamic pages.
22+
- Read about the [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue).
2323

2424
## [](#our-task) Our task
2525

content/academy/web_scraping_for_beginners/challenge/scraping_amazon.md

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,14 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
2828

2929
> If you are sometimes getting an error along the lines of **RequestError: Proxy responded with 407**, don't worry, this is totally normal. The request will retry and succeed.
3030
31-
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman]({{@link tools/proxyman.md}}) to analyze requests which we can't see inside the network tab, then we'll click the button on the product page that loads up all of the product offers:
31+
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman]({{@link tools/proxyman.md}}) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:
3232

3333
![View offers button]({{@asset web_scraping_for_beginners/challenge/images/view-offers-button.webp}})
3434

3535
After clicking this button and checking back in Proxyman, we discovered this link:
3636

37+
> You can find the request below in the network tab just fine, but with Proxyman, it is much easier and fasterdue to the extended filtering options.
38+
3739
```text
3840
https://www.amazon.com/gp/aod/ajax/ref=auto_load_aod?asin=B07ZPKBL9V&pc=dp
3941
```
@@ -46,17 +48,7 @@ Here's what this page looks like:
4648

4749
Wow, that's ugly. But for our scenario, this is really great. When we click the **View offers** button, we usually have to wait for the offers to load and render, which would mean we could have to switch our entire crawler to a **PuppeteerCrawler** or **PlaywrightCrawler**. The data on this page we've just found appears to be loaded statically, which means we can still use CheerioCrawler and keep the scraper as efficient as possible 😎
4850

49-
First, we'll create a function which can generate an offers URL for us in **constants.js**:
50-
51-
```JavaScript
52-
// constants.js
53-
54-
// ...
55-
56-
export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;
57-
```
58-
59-
Then, we'll import and use that function to create a request for each product's offers page:
51+
First, we'll create a request for each product's offers page:
6052

6153
```JavaScript
6254
// routes.js
@@ -70,7 +62,7 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
7062

7163
// Add to the request queue
7264
await crawler.addRequests([{
73-
url: OFFERS_URL(data.asin),
65+
url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`,
7466
label: labels.OFFERS,
7567
userData: {
7668
data: {
@@ -111,8 +103,6 @@ That should be it! Let's just make sure we've all got the same code:
111103
// constants.js
112104
export const BASE_URL = 'https://www.amazon.com';
113105

114-
export const OFFERS_URL = (asin) => `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${asin}&pc=dp`;
115-
116106
export const labels = {
117107
START: 'START',
118108
PRODUCT: 'PRODUCT',
@@ -124,7 +114,7 @@ export const labels = {
124114
// routes.js
125115
import { Actor } from 'apify';
126116
import { createCheerioRouter } from '@crawlee/cheerio';
127-
import { BASE_URL, OFFERS_URL, labels } from './constants';
117+
import { BASE_URL, labels } from './constants';
128118

129119
const router = createCheerioRouter();
130120

@@ -163,7 +153,7 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
163153

164154
await crawler.addRequests([
165155
{
166-
url: OFFERS_URL(data.asin),
156+
url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`,
167157
label: labels.OFFERS,
168158
userData: {
169159
data: {

0 commit comments

Comments
 (0)