Skip to content

Commit a8d6145

Browse files
committed
academy: create tutorials group
1 parent e28fdcc commit a8d6145

14 files changed

+41
-58
lines changed

content/academy/concepts.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ paths:
77
- concepts
88
---
99

10-
# [](#concepts) Concepts
10+
# [](#concepts) Concepts 🤔
1111

1212
There are some terms and concepts you'll see frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
1313

content/academy/expert_scraping_with_apify.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,19 +33,21 @@ Throughout the next lessons, we will sometimes use certain technologies and term
3333
- [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)
3434
- [DevTools]({{@link web_scraping_for_beginners/data_collection/browser_devtools.md}})
3535

36-
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
37-
38-
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
36+
### [](#jquery-or-cheerio) jQuery or Cheerio
3937

40-
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
38+
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
4139

4240
### [](#puppeteer-playwright) Puppeteer/Playwright
4341

4442
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough.
4543

46-
### [](#jquery-or-cheerio) jQuery or Cheerio
44+
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
4745

48-
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
46+
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
47+
48+
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
49+
50+
<!-- todo: remove all requirements up to this point -->
4951

5052
### [](#git) Git
5153

@@ -57,7 +59,7 @@ Docker is a massive topic on its own, but don't be worried! We only expect you t
5759

5860
### [](#actor-basics) The basics of actors
5961

60-
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet read the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to give it a glance before moving forward.
62+
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet gone through the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward.
6163

6264
## [](#next) Next up
6365

content/academy/tools.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ paths:
77
- tools
88
---
99

10-
# [](#tools) Tools
10+
# [](#tools) Tools 🔧
1111

1212
Here at Apify, we've found many tools, some quite popular and well-known and some niche, which can aid any developer in their scraper development process. We've compiled some of our favorite developer tools into to this short section. Each tool featured here serves a specific purpose, if not multiple purposes, which are directly relevant to web-scraping and web-automation.
1313

content/academy/tutorials.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
title: Tutorials
3+
description: Learn
4+
menuWeight: 11
5+
category: glossary
6+
paths:
7+
- tutorials
8+
---
9+
10+
# Tutorials 📚
11+
12+
<!-- blah blah this section has tutorials -->
Lines changed: 10 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,24 @@
11
---
22
title: Dealing with dynamic pages
33
description: Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?
4-
menuWeight: 10
4+
menuWeight: 1
55
paths:
6-
- web-scraping-for-beginners/crawling/dealing-with-dynamic-pages
6+
- tutorials/dealing-with-dynamic-pages
77
---
88

99
# [](#dealing-with-dynamic-pages) Dealing with dynamic pages
1010

11-
In the last few lessons, we learned about Crawlee, which is a powerful library for writing reliable and efficient scrapers. We recommend reading up on those last two lessons in order to install the `crawlee` package and familiarize yourself with it before moving forward with this lesson.
11+
<!-- In the last few lessons, we learned about Crawlee, which is a powerful library for writing reliable and efficient scrapers. We recommend reading up on those last two lessons in order to install the `crawlee` package and familiarize yourself with it before moving forward with this lesson. -->
1212

1313
In this lesson, we'll be discussing dynamic content and how to scrape it while utilizing Crawlee.
1414

1515
## [](#quick-experiment) A quick experiment
1616

1717
From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. Easy enough! We did something very similar in the previous modules.
1818

19-
![New arrival products in Fakestore]({{@asset web_scraping_for_beginners/crawling/images/new-arrivals.webp}})
19+
![New arrival products in Fakestore]({{@asset tutorials/images/new-arrivals.webp}})
2020

21-
In your project from the previous lessons, or in a new project, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:
21+
First, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:
2222

2323
```JavaScript
2424
import { CheerioCrawler } from 'crawlee';
@@ -71,16 +71,14 @@ const crawler = new CheerioCrawler({
7171
},
7272
});
7373

74-
await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
75-
76-
await crawler.run();
74+
await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
7775
```
7876

7977
> Here, we are using the [`Array.prototype.map()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) function to loop through all of the product elements and save them into an array we call `results` all at the same time.
8078
8179
After running it, you might say, "Great! It works!" **But wait...** What are those results being logged to console?
8280

83-
![Bad results in console]({{@asset web_scraping_for_beginners/crawling/images/bad-results.webp}})
81+
![Bad results in console]({{@asset tutorials/images/bad-results.webp}})
8482

8583
Every single image seems to have the same exact "URL," but they are most definitely not the image URLs we are looking for. This is strange, because in the browser, we were getting URLs that looked like this:
8684

@@ -130,14 +128,12 @@ const crawler = new PuppeteerCrawler({
130128
},
131129
});
132130

133-
await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }])
134-
135-
await crawler.run();
131+
await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
136132
```
137133

138134
After running this one, we can see that our results look different from before. We're getting the image links!
139135

140-
![Not perfect results]({{@asset web_scraping_for_beginners/crawling/images/almost-there.webp}})
136+
![Not perfect results]({{@asset tutorials/images/almost-there.webp}})
141137

142138
Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling.
143139

@@ -177,9 +173,7 @@ const crawler = new PuppeteerCrawler({
177173
},
178174
});
179175

180-
await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }])
181-
182-
await crawler.run();
176+
await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
183177
```
184178

185179
Let's run this and check our dataset results...
@@ -197,7 +191,3 @@ Each product looks like this, and each image is a valid link that can be visited
197191
## [](#small-recap) Small Recap
198192

199193
Making static HTTP requests only downloads the HTML content from the `DOMContentLoaded` event. We must use a browser to allow dynamic code to load, or find different means altogether of scraping the data (see [API Scraping]({{@link api_scraping.md}}))
200-
201-
## [](#wrap-up) Wrap up 💥
202-
203-
And this is it for the [Basics of crawling section]({{@link web_scraping_for_beginners/crawling.md}}) of the [Web scraping for beginners]({{@link web_scraping_for_beginners.md}}) course. For now, this is also the last section of the **Web scraping for beginners** course. If you want to learn more about web scraping, we recommend checking venturing out and following the other lessons in the Academy. We will keep updating the Academy with more content regularly, until we cover all the advanced and expert topics we promised at the beginning.

0 commit comments

Comments
 (0)