You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/academy/web_scraping_for_beginners/challenge.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,10 +16,10 @@ We recommended that you make sure you've gone through both the [data collection]
16
16
17
17
Before continuing, it is highly recommended to do the following:
18
18
19
-
- Look over [how to build a crawler in Crawlee](https://crawlee.dev/docs/introduction/first-crawler) and ideally **code along**
20
-
- Read [this short article](https://help.apify.com/en/articles/1829103-request-labels-and-how-to-pass-data-to-other-requests) about **request labels** and [`userData`](https://crawlee.dev/api/core/class/Request#userData) (this will be extremely useful later on)
21
-
- Check out [this article](https://blog.apify.com/what-is-a-dynamic-page/) about dynamic pages
22
-
- Read about the [RequestList](https://crawlee.dev/api/core/class/RequestList) and [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue)
19
+
- Look over [how to build a crawler in Crawlee](https://crawlee.dev/docs/introduction/first-crawler) and ideally **code along**.
20
+
- Read [this short article](https://help.apify.com/en/articles/1829103-request-labels-and-how-to-pass-data-to-other-requests) about [**request labels**](https://crawlee.dev/api/core/class/Request#label) (this will be extremely useful later on).
21
+
- Check out [this lesson]({{@link tutorials/dealing_with_dynamic_pages.md}}) about dynamic pages.
22
+
- Read about the [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue).
> If you are sometimes getting an error along the lines of **RequestError: Proxy responded with 407**, don't worry, this is totally normal. The request will retry and succeed.
30
30
31
-
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman]({{@link tools/proxyman.md}}) to analyze requests which we can't see inside the network tab, then we'll click the button on the product page that loads up all of the product offers:
31
+
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman]({{@link tools/proxyman.md}}) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:
@@ -46,17 +48,7 @@ Here's what this page looks like:
46
48
47
49
Wow, that's ugly. But for our scenario, this is really great. When we click the **View offers** button, we usually have to wait for the offers to load and render, which would mean we could have to switch our entire crawler to a **PuppeteerCrawler** or **PlaywrightCrawler**. The data on this page we've just found appears to be loaded statically, which means we can still use CheerioCrawler and keep the scraper as efficient as possible 😎
48
50
49
-
First, we'll create a function which can generate an offers URL for us in **constants.js**:
0 commit comments