You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/academy/anti_scraping.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,7 +73,7 @@ Solely based on the way how the bots operate. It comperes data-rich pages visits
73
73
74
74
By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot.
75
75
76
-
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@linkweb_scraping_for_beginners/data_collection/js_in_html.md}}))
76
+
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@linktutorials/js_in_html.md}}))
77
77
78
78
### IP session consistency
79
79
@@ -89,6 +89,6 @@ One of the most successful and advanced methods is collecting the browser's "fin
89
89
90
90
> It's important to note that this method also blocks all users that cannot evaluate JavaScript (such as bots sending only static HTTP requests), and combines both of the fundamental methods mentioned earlier.
91
91
92
-
## [](#next) Next up
92
+
## [](#first) First up
93
93
94
94
In our [first section]({{@link anti_scraping/techniques.md}}), we'll be discussing more in-depth about the various anti-scraping methods and techniques websites use, as well as how to mitigate these protections.
Get started with this course by learning some general knowledge about API scraping in the [General API Scraping]({{@link api_scraping/general_api_scraping.md}}) section! This section will teach you everything you need to know about scraping APIs before moving into more complex sections.
Copy file name to clipboardExpand all lines: content/academy/apify_platform.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,6 @@ The [Apify platform](https://apify.com) was built to serve large-scale and high-
15
15
16
16
In this course, you'll be learning how to become an Apify platform developer from the ground up. From creating your first account, to developing actors, this is your one-stop-shop for understanding how the platform works, and how to work with it.
17
17
18
-
## [](#next) Next up
18
+
## [](#first) First up
19
19
20
20
We'll start off this course light, by showing you how to create an Apify account and get everything ready for development with the platform. [Let's go!]({{@link apify_platform/getting_started.md}})
Copy file name to clipboardExpand all lines: content/academy/concepts.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
2
title: Concepts
3
3
description: Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.
4
-
menuWeight: 9
4
+
menuWeight: 11
5
5
category: glossary
6
6
paths:
7
7
- concepts
8
8
---
9
9
10
-
# [](#concepts) Concepts
10
+
# [](#concepts) Concepts 🤔
11
11
12
12
There are some terms and concepts you'll see frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
Copy file name to clipboardExpand all lines: content/academy/expert_scraping_with_apify.md
+5-25Lines changed: 5 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,36 +17,16 @@ Before developing a pro-level Apify scraper, there are some important things you
17
17
18
18
> If you've already gone through the [Web scraping for beginners course]({{@link web_scraping_for_beginners.md}}) and the first lessons of the [Apify platform course]({{@link apify_platform.md}}), you will be more than well equipped to continue on with the lessons in this course.
It is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to developing an actor on the Apify platform. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:
-[Modularity in Node.js](https://www.section.io/engineering-education/how-to-use-modular-patterns-in-nodejs/)
27
-
28
-
### [](#general-web-development) General web development
29
-
30
-
Throughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because the knowledge of them will be **assumed** (unless we're showing something out of the ordinary).
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough. -->
35
23
36
24
### [](#crawlee-apify-sdk-and-cli) Crawlee, Apify SDK, and the Apify CLI
37
25
38
26
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson]({{@link web_scraping_for_beginners/crawling/pro_scraping.md}}) in the **Web scraping for beginners** course (and ideally follow along). To familiarize with the Apify SDK,you can refer to the [Apify Platform]({{@link apify_platform.md}}) course.
39
27
40
28
The Apify CLI will play a core role in the running and testing of the actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson]({{@link tools/apify_cli.md}}).
[Puppeteer](https://pptr.dev/) is a library for running and controlling a [headless browser]({{@link web_scraping_for_beginners/crawling/headless_browser.md}}) in Node.js, and was developed at Google. The team working on it was hired by Microsoft to work on the [Playwright](https://playwright.dev/) project; therefore, many parallels can be seen between both the `puppeteer` and `playwright` packages. Proficiency in at least one of these will be good enough.
45
-
46
-
### [](#jquery-or-cheerio) jQuery or Cheerio
47
-
48
-
We'll be using the [`cheerio`](https://www.npmjs.com/package/cheerio) package a whole lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js.
49
-
50
30
### [](#git) Git
51
31
52
32
In one of the later lessons, we'll be learning how to integrate our actor on the Apify platform with a Github repository. For this, you'll need to understand at least the basics of [Git](https://git-scm.com/docs). Here's a [great tutorial](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners) to help you get started with Git.
@@ -57,10 +37,10 @@ Docker is a massive topic on its own, but don't be worried! We only expect you t
57
37
58
38
### [](#actor-basics) The basics of actors
59
39
60
-
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet read the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to give it a glance before moving forward.
40
+
Part of this course will be learning more in-depth about actors; however, some basic knowledge is already assumed. If you haven't yet gone through the [actors]({{@link apify_platform/getting_started/actors.md}}) lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward.
61
41
62
-
## [](#next) Next up
42
+
## [](#first) First up
63
43
64
-
[Next up]({{@link expert_scraping_with_apify/crawlee.md}}), we'll be learning in-depth about the most important tool in your actor-development toolbelt: The **Crawlee**.
44
+
[First up]({{@link expert_scraping_with_apify/actors_webhooks.md}}), we'll be learning in-depth about integrating actors with each other using webhooks.
65
45
66
46
> Each lesson will have a short _(and optional)_ quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers.
Copy file name to clipboardExpand all lines: content/academy/expert_scraping_with_apify/actors_webhooks.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
-
title: II - Webhooks & advanced actor overview
2
+
title: I - Webhooks & advanced actor overview
3
3
description: Learn more advanced details about actors, how they work, and the default configurations they can take. Also learn how to integrate your actor with webhooks.
4
-
menuWeight: 6.2
4
+
menuWeight: 6.1
5
5
paths:
6
6
- expert-scraping-with-apify/actors-webhooks
7
7
---
@@ -12,7 +12,9 @@ Thus far, you've run actors on the platform and written an actor of your own, wh
12
12
13
13
## [](#advanced-actors) Advanced actor overview
14
14
15
-
Take another look at the files within your project from the previous lesson. You'll notice that there is a **Dockerfile**. Every single actor has a Dockerfile (the actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the actor's code. "Apify Actors" is basically just a serverless platform that is running multiple Docker containers. For a deeper understanding of actor Dockerfiles, refer to the [Apify actor Dockerfile docs](https://sdk.apify.com/docs/guides/docker-images#example-dockerfile).
15
+
In this course, we'll be working out of the Amazon scraper project from the **Web scraping for beginners** course. If you haven't already built that project, you can do it in three short lessons [here]({{@link web_scraping_for_beginners/challenge.md}}). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.
16
+
17
+
Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single actor has a Dockerfile (the actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the actor's code. "Apify Actors" is basically just a serverless platform that is running multiple Docker containers. For a deeper understanding of actor Dockerfiles, refer to the [Apify actor Dockerfile docs](https://sdk.apify.com/docs/guides/docker-images#example-dockerfile).
Copy file name to clipboardExpand all lines: content/academy/expert_scraping_with_apify/apify_api_and_client.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
-
title: V - Apify API & client
2
+
title: IV - Apify API & client
3
3
description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.
Copy file name to clipboardExpand all lines: content/academy/expert_scraping_with_apify/bypassing_anti_scraping.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
-
title: VII - Bypassing anti-scraping methods
2
+
title: VI - Bypassing anti-scraping methods
3
3
description: Learn about bypassing anti-bot methods with proxies and proxy/session rotation. Use Crawlee and the Apify SDK to abstract away the overheads that come with these concepts.
Copy file name to clipboardExpand all lines: content/academy/expert_scraping_with_apify/managing_source_code.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
-
title: III - Managing source code
2
+
title: II - Managing source code
3
3
description: Learn how to manage your actor's source code more efficiently by integrating it with a Github repository. This is the standard on the Apify platform.
0 commit comments