You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/academy/concepts.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ paths:
7
7
- concepts
8
8
---
9
9
10
-
# [](#concepts) Concepts
10
+
# [](#concepts) Concepts 🤔
11
11
12
12
There are some terms and concepts you'll see frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
37
-
38
-
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
36
+
### [](#jquery-or-cheerio) jQuery or Cheerio
39
37
40
-
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
38
+
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough.
45
43
46
-
### [](#jquery-or-cheerio) jQuery or Cheerio
44
+
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
47
45
48
-
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
46
+
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
47
+
48
+
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
49
+
50
+
<!-- todo: remove all requirements up to this point -->
49
51
50
52
### [](#git) Git
51
53
@@ -57,7 +59,7 @@ Docker is a massive topic on its own, but don't be worried! We only expect you t
57
59
58
60
### [](#actor-basics) The basics of actors
59
61
60
-
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet read the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to give it a glance before moving forward.
62
+
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet gone through the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward.
Copy file name to clipboardExpand all lines: content/academy/tools.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ paths:
7
7
- tools
8
8
---
9
9
10
-
# [](#tools) Tools
10
+
# [](#tools) Tools 🔧
11
11
12
12
Here at Apify, we've found many tools, some quite popular and well-known and some niche, which can aid any developer in their scraper development process. We've compiled some of our favorite developer tools into to this short section. Each tool featured here serves a specific purpose, if not multiple purposes, which are directly relevant to web-scraping and web-automation.
Copy file name to clipboardExpand all lines: content/academy/tutorials/dealing_with_dynamic_pages.md
+10-20Lines changed: 10 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,24 +1,24 @@
1
1
---
2
2
title: Dealing with dynamic pages
3
3
description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
# [](#dealing-with-dynamic-pages) Dealing with dynamic pages
10
10
11
-
In the last few lessons, we learned about Crawlee, which is a powerful library for writing reliable and efficient scrapers. We recommend reading up on those last two lessons in order to install the `crawlee` package and familiarize yourself with it before moving forward with this lesson.
11
+
<!--In the last few lessons, we learned about Crawlee, which is a powerful library for writing reliable and efficient scrapers. We recommend reading up on those last two lessons in order to install the `crawlee` package and familiarize yourself with it before moving forward with this lesson.-->
12
12
13
13
In this lesson, we'll be discussing dynamic content and how to scrape it while utilizing Crawlee.
14
14
15
15
## [](#quick-experiment) A quick experiment
16
16
17
17
From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. Easy enough! We did something very similar in the previous modules.
18
18
19
-

19
+

20
20
21
-
In your project from the previous lessons, or in a new project, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:
21
+
First, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:
22
22
23
23
```JavaScript
24
24
import { CheerioCrawler } from'crawlee';
@@ -71,16 +71,14 @@ const crawler = new CheerioCrawler({
> Here, we are using the [`Array.prototype.map()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) function to loop through all of the product elements and save them into an array we call `results` all at the same time.
80
78
81
79
After running it, you might say, "Great! It works!" **But wait...** What are those results being logged to console?
82
80
83
-

81
+

84
82
85
83
Every single image seems to have the same exact "URL," but they are most definitely not the image URLs we are looking for. This is strange, because in the browser, we were getting URLs that looked like this:
86
84
@@ -130,14 +128,12 @@ const crawler = new PuppeteerCrawler({
Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling.
143
139
@@ -177,9 +173,7 @@ const crawler = new PuppeteerCrawler({
@@ -197,7 +191,3 @@ Each product looks like this, and each image is a valid link that can be visited
197
191
## [](#small-recap) Small Recap
198
192
199
193
Making static HTTP requests only downloads the HTML content from the `DOMContentLoaded` event. We must use a browser to allow dynamic code to load, or find different means altogether of scraping the data (see [API Scraping]({{@link api_scraping.md}}))
200
-
201
-
## [](#wrap-up) Wrap up 💥
202
-
203
-
And this is it for the [Basics of crawling section]({{@link web_scraping_for_beginners/crawling.md}}) of the [Web scraping for beginners]({{@link web_scraping_for_beginners.md}}) course. For now, this is also the last section of the **Web scraping for beginners** course. If you want to learn more about web scraping, we recommend checking venturing out and following the other lessons in the Academy. We will keep updating the Academy with more content regularly, until we cover all the advanced and expert topics we promised at the beginning.
0 commit comments