Skip to content

Commit 778cc5b

Browse files
authored
Merge branch 'master' into percy-test
2 parents 9b7fd5c + 5145913 commit 778cc5b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+746
-805
lines changed

content/academy/anti_scraping.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Solely based on the way how the bots operate. It comperes data-rich pages visits
7373

7474
By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot.
7575

76-
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link web_scraping_for_beginners/data_collection/js_in_html.md}}))
76+
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link tutorials/js_in_html.md}}))
7777

7878
### IP session consistency
7979

@@ -89,6 +89,6 @@ One of the most successful and advanced methods is collecting the browser's "fin
8989

9090
> It's important to note that this method also blocks all users that cannot evaluate JavaScript (such as bots sending only static HTTP requests), and combines both of the fundamental methods mentioned earlier.
9191
92-
## [](#next) Next up
92+
## [](#first) First up
9393

9494
In our [first section]({{@link anti_scraping/techniques.md}}), we'll be discussing more in-depth about the various anti-scraping methods and techniques websites use, as well as how to mitigate these protections.

content/academy/api_scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,6 @@ const decoded = Buffer.from(value, 'base64').toString('utf-8')
9494
console.log(decoded)
9595
```
9696

97-
## [](#next) Next up
97+
## [](#first) First up
9898

9999
Get started with this course by learning some general knowledge about API scraping in the [General API Scraping]({{@link api_scraping/general_api_scraping.md}}) section! This section will teach you everything you need to know about scraping APIs before moving into more complex sections.

content/academy/apify_platform.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@ The [Apify platform](https://apify.com) was built to serve large-scale and high-
1515

1616
In this course, you'll be learning how to become an Apify platform developer from the ground up. From creating your first account, to developing actors, this is your one-stop-shop for understanding how the platform works, and how to work with it.
1717

18-
## [](#next) Next up
18+
## [](#first) First up
1919

2020
We'll start off this course light, by showing you how to create an Apify account and get everything ready for development with the platform. [Let's go!]({{@link apify_platform/getting_started.md}})

content/academy/concepts.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
title: Concepts
33
description: Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.
4-
menuWeight: 9
4+
menuWeight: 11
55
category: glossary
66
paths:
77
- concepts
88
---
99

10-
# [](#concepts) Concepts
10+
# [](#concepts) Concepts 🤔
1111

1212
There are some terms and concepts you'll see frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
1313

content/academy/expert_scraping_with_apify.md

Lines changed: 5 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -17,36 +17,16 @@ Before developing a pro-level Apify scraper, there are some important things you
1717

1818
> If you've already gone through the [Web scraping for beginners course]({{@link web_scraping_for_beginners.md}}) and the first lessons of the [Apify platform course]({{@link apify_platform.md}}), you will be more than well equipped to continue on with the lessons in this course.
1919
20-
### [](#javascript-and-node) JavaScript + Node.js
20+
<!-- ### [](#puppeteer-playwright) Puppeteer/Playwright
2121
22-
It is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to developing an actor on the Apify platform. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:
23-
24-
- [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)
25-
- [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)
26-
- [Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)
27-
28-
### [](#general-web-development) General web development
29-
30-
Throughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).
31-
32-
- [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)
33-
- [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)
34-
- [DevTools]({{@link web_scraping_for_beginners/data_collection/browser_devtools.md}})
22+
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough. -->
3523

3624
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
3725

3826
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
3927

4028
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
4129

42-
### [](#puppeteer-playwright) Puppeteer/Playwright
43-
44-
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough.
45-
46-
### [](#jquery-or-cheerio) jQuery or Cheerio
47-
48-
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
49-
5030
### [](#git) Git
5131

5232
In one of the later lessons, we'll be learning how to integrate our actor on the Apify platform with a Github repository. For this, you'll need to understand at least the basics of [Git](https://git-scm.com/docs). Here's a [great tutorial](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners) to help you get started with Git.
@@ -57,10 +37,10 @@ Docker is a massive topic on its own, but don't be worried! We only expect you t
5737

5838
### [](#actor-basics) The basics of actors
5939

60-
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet read the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to give it a glance before moving forward.
40+
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet gone through the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward.
6141

62-
## [](#next) Next up
42+
## [](#first) First up
6343

64-
[Next up]({{@link expert_scraping_with_apify/crawlee.md}}), we'll be learning in-depth about the most important tool in your actor-development toolbelt: The **Crawlee**.
44+
[First up]({{@link expert_scraping_with_apify/actors_webhooks.md}}), we'll be learning in-depth about integrating actors with each other using webhooks.
6545

6646
> Each lesson will have a short _(and optional)_ quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers.

content/academy/expert_scraping_with_apify/actors_webhooks.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: II - Webhooks & advanced actor overview
2+
title: I - Webhooks & advanced actor overview
33
description: Learn more advanced details about actors, how they work, and the default configurations they can take. Also learn how to integrate your actor with webhooks.
4-
menuWeight: 6.2
4+
menuWeight: 6.1
55
paths:
66
- expert-scraping-with-apify/actors-webhooks
77
---
@@ -12,7 +12,9 @@ Thus far, you've run actors on the platform and written an actor of your own, wh
1212

1313
## [](#advanced-actors) Advanced actor overview
1414

15-
Take another look at the files within your project from the previous lesson. You'll notice that there is a **Dockerfile**. Every single actor has a Dockerfile (the actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the actor's code. "Apify Actors" is basically just a serverless platform that is running multiple Docker containers. For a deeper understanding of actor Dockerfiles, refer to the [Apify actor Dockerfile docs](https://sdk.apify.com/docs/guides/docker-images#example-dockerfile).
15+
In this course, we'll be working out of the Amazon scraper project from the **Web scraping for beginners** course. If you haven't already built that project, you can do it in three short lessons [here]({{@link web_scraping_for_beginners/challenge.md}}). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.
16+
17+
Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single actor has a Dockerfile (the actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the actor's code. "Apify Actors" is basically just a serverless platform that is running multiple Docker containers. For a deeper understanding of actor Dockerfiles, refer to the [Apify actor Dockerfile docs](https://sdk.apify.com/docs/guides/docker-images#example-dockerfile).
1618

1719
## [](#webhooks) Webhooks
1820

content/academy/expert_scraping_with_apify/apify_api_and_client.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: V - Apify API & client
2+
title: IV - Apify API & client
33
description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.
4-
menuWeight: 6.5
4+
menuWeight: 6.4
55
paths:
66
- expert-scraping-with-apify/apify-api-and-client
77
---

content/academy/expert_scraping_with_apify/bypassing_anti_scraping.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: VII - Bypassing anti-scraping methods
2+
title: VI - Bypassing anti-scraping methods
33
description: Learn about bypassing anti-bot methods with proxies and proxy/session rotation. Use Crawlee and the Apify SDK to abstract away the overheads that come with these concepts.
4-
menuWeight: 6.7
4+
menuWeight: 6.6
55
paths:
66
- expert-scraping-with-apify/bypassing-anti-scraping
77
---

content/academy/expert_scraping_with_apify/crawlee.md

Lines changed: 0 additions & 121 deletions
This file was deleted.

content/academy/expert_scraping_with_apify/managing_source_code.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: III - Managing source code
2+
title: II - Managing source code
33
description: Learn how to manage your actor's source code more efficiently by integrating it with a Github repository. This is the standard on the Apify platform.
4-
menuWeight: 6.3
4+
menuWeight: 6.2
55
paths:
66
- expert-scraping-with-apify/managing-source-code
77
---

0 commit comments

Comments
 (0)