diff --git a/sources/academy/glossary/concepts/css_selectors.md b/sources/academy/glossary/concepts/css_selectors.md index 4c2fd79ed..ee36b53c0 100644 --- a/sources/academy/glossary/concepts/css_selectors.md +++ b/sources/academy/glossary/concepts/css_selectors.md @@ -59,7 +59,7 @@ CSS selectors are important for web scraping because they allow you to target sp For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project. -Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it easily. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. +Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. ## Resources diff --git a/sources/academy/glossary/concepts/http_headers.md b/sources/academy/glossary/concepts/http_headers.md index 4e3c71528..64266bc8d 100644 --- a/sources/academy/glossary/concepts/http_headers.md +++ b/sources/academy/glossary/concepts/http_headers.md @@ -23,7 +23,7 @@ For some websites, you won't need to worry about modifying headers at all, as th Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user). -Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which simply would not know which data to return without knowing which exact website is requesting it. +Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it. ## Custom headers required {#needs-custom-headers} @@ -44,7 +44,7 @@ You could use Chrome DevTools to inspect request headers, and [Insomnia](../tool HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of: 1. HTTP/2 headers do not include status messages. They only contain status codes. -2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will simply ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. +2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. 3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**). > To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out [this](https://httptoolkit.tech/blog/translating-http-2-into-http-1/) article diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md index 3e96ba1b9..cfdf1eefe 100644 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ b/sources/academy/glossary/tools/edit_this_cookie.md @@ -11,7 +11,7 @@ slug: /tools/edit-this-cookie --- -**EditThisCookie** is a simple Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. +**EditThisCookie** is a Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. ![EditThisCookie popup](./images/edit-this-cookie-popup.png) @@ -21,11 +21,11 @@ At the top of the popup, there is a row of buttons. From left to right, here is ### Delete all cookies -Clicking this button will simply remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. +Clicking this button will remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. ### Reset -Basically just a refresh button. +A refresh button. ### Add a new cookie diff --git a/sources/academy/glossary/tools/insomnia.md b/sources/academy/glossary/tools/insomnia.md index dc3523f19..143e57a4e 100644 --- a/sources/academy/glossary/tools/insomnia.md +++ b/sources/academy/glossary/tools/insomnia.md @@ -1,13 +1,13 @@ --- title: Insomnia -description: Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers. +description: Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers. sidebar_position: 9.2 slug: /tools/insomnia --- # What is Insomnia {#what-is-insomnia} -**Learn about Insomnia, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.** +**Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- @@ -66,4 +66,4 @@ This will bring up the **Manage cookies** window, where all cached cookies can b ## Postman or Insomnia {#postman-or-insomnia} -The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, just choose the one that has the most intuitive interface for you. +The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, choose the one that has the most intuitive interface for you. diff --git a/sources/academy/glossary/tools/modheader.md b/sources/academy/glossary/tools/modheader.md index 7cafa0ae5..581e6628d 100644 --- a/sources/academy/glossary/tools/modheader.md +++ b/sources/academy/glossary/tools/modheader.md @@ -19,9 +19,9 @@ If you read about [Postman](./postman.md), you might remember that you can use i After you install the ModHeader extension, you should see it pinned in Chrome's task bar. When you click it, you'll see an interface like this pop up: -![Modheader's simple interface](./images/modheader.jpg) +![Modheader's interface](./images/modheader.jpg) -Here, you can add headers, remove headers, and even save multiple collections of headers that you can easily toggle between (which are called **Profiles** within the extension itself). +Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself). ## Use cases {#use-cases} diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md index d1671cc68..27fb8a523 100644 --- a/sources/academy/glossary/tools/postman.md +++ b/sources/academy/glossary/tools/postman.md @@ -1,19 +1,19 @@ --- title: Postman -description: Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers. +description: Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers. sidebar_position: 9.3 slug: /tools/postman --- # What is Postman? {#what-is-postman} -**Learn about Postman, a simple yet super valuable tool for testing requests and proxies when building scalable web scrapers.** +**Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- -[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to easily test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. +[Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. -The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a simple signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/). +The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/getting-started/introduction/). ## Understanding the interface {#understanding-the-interface} @@ -43,7 +43,7 @@ In order to use a proxy, the proxy's server and configuration must be provided i ![Proxy configuration in Postman settings](./images/postman-proxy.png) -After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings just needs to be un-ticked to disable it. +After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings needs to be un-ticked to disable it. ## Managing the cookies cache {#managing-cookies} @@ -55,7 +55,7 @@ In order to check whether there are any cookies associated with a certain reques ![Button to view the cached cookies](./images/postman-cookies-button.png) -Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to ****, within this window we would be able to easily find cached cookies associated with github.com. Cookies can also be easily edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. +Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to ****, within this window we would be able to find cached cookies associated with github.com. Cookies can also be edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. ![Managing cookies in Postman with the "MANAGE COOKIES" window](./images/postman-manage-cookies.png) diff --git a/sources/academy/glossary/tools/quick_javascript_switcher.md b/sources/academy/glossary/tools/quick_javascript_switcher.md index ba2f3d580..eca2c21b3 100644 --- a/sources/academy/glossary/tools/quick_javascript_switcher.md +++ b/sources/academy/glossary/tools/quick_javascript_switcher.md @@ -1,17 +1,17 @@ --- title: Quick JavaScript Switcher -description: Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. +description: Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. sidebar_position: 9.9 slug: /tools/quick-javascript-switcher --- # Quick JavaScript Switcher -**Discover a super simple tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** +**Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** --- -**Quick JavaScript Switcher** is a very simple Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. +**Quick JavaScript Switcher** is a Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not. diff --git a/sources/academy/glossary/tools/user_agent_switcher.md b/sources/academy/glossary/tools/user_agent_switcher.md index 7b86fcbcc..65a1445a7 100644 --- a/sources/academy/glossary/tools/user_agent_switcher.md +++ b/sources/academy/glossary/tools/user_agent_switcher.md @@ -1,17 +1,17 @@ --- title: User-Agent Switcher -description: Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. +description: Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. sidebar_position: 9.8 slug: /tools/user-agent-switcher --- # User-Agent Switcher -**Learn how to easily switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** +**Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** --- -**User-Agent Switcher** is a simple Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. +**User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. ![User-Agent Switcher groups](./images/user-agent-switcher-groups.png) diff --git a/sources/academy/platform/deploying_your_code/deploying.md b/sources/academy/platform/deploying_your_code/deploying.md index 8e6b2c89c..bfabc2aa7 100644 --- a/sources/academy/platform/deploying_your_code/deploying.md +++ b/sources/academy/platform/deploying_your_code/deploying.md @@ -21,12 +21,10 @@ Before we deploy our project onto the Apify platform, let's ensure that we've pu ### Creating the Actor -Before anything can be integrated, we've gotta create a new Actor. Luckily, this is super easy to do. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **New** button, then select the **Empty** template. +Before anything can be integrated, we've gotta create a new Actor. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal) and click on the **New** button, then select the **Empty** template. ![Create new button](../getting_started/images/create-new-actor.png) -Easy peasy! - ### Changing source code location {#change-source-code} In the **Source** tab on the new Actor's page, we'll click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**. diff --git a/sources/academy/platform/deploying_your_code/docker_file.md b/sources/academy/platform/deploying_your_code/docker_file.md index 81ab4704c..43e0902dc 100644 --- a/sources/academy/platform/deploying_your_code/docker_file.md +++ b/sources/academy/platform/deploying_your_code/docker_file.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in. Every Actor must have a Dockerfile, as Actors run in Docker containers. -> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to just run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc). +> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc). ## Base images {#base-images} @@ -24,7 +24,7 @@ If your project doesn’t already contain a Dockerfile, don’t worry! Apify off > Tip: You can see all of Apify's Docker images [on DockerHub](https://hub.docker.com/r/apify/). -At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or just install them yourself during the build step. +At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or install them yourself during the build step. Once you find the base image you need, you can add it as the initial `FROM` statement: @@ -111,7 +111,7 @@ CMD python3 main.py ## Examples {#examples} -The examples we just showed were for Node.js and Python, however, to drive home the fact that Actors can be written in any language, here are some examples of some Dockerfiles for Actors written in different programming languages: +The examples above show how to deploy Actors written in Node.js or Python, but you can use any language. As an inspiration, here are a few examples for other languages: Go, Rust, Julia. diff --git a/sources/academy/platform/deploying_your_code/index.md b/sources/academy/platform/deploying_your_code/index.md index cbbe233c1..c016bd8bb 100644 --- a/sources/academy/platform/deploying_your_code/index.md +++ b/sources/academy/platform/deploying_your_code/index.md @@ -1,6 +1,6 @@ --- title: Deploying your code -description: In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor in just a few minutes! +description: In this course learn how to take an existing project of yours and deploy it to the Apify platform as an actor. sidebar_position: 9 category: apify platform slug: /deploying-your-code @@ -11,25 +11,24 @@ import TabItem from '@theme/TabItem'; # Deploying your code to Apify {#deploying} -**In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor in just a few minutes!** +**In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor.** --- This section will discuss how to use your newfound knowledge of the Apify platform and Actors from the [**Getting started**](../getting_started/index.md) section to deploy your existing project's code to the Apify platform as an Actor. - -Because Actors are basically just chunks of code running in Docker containers, you're able to **_Actorify_** just about anything! +Any program running in a Docker container can become an Apify Actor. ![The deployment workflow](../../images/deployment-workflow.png) -Actors are language agnostic, which means that the language your project is written in does not affect your ability to actorify it. +Apify provides detailed guidance on how to deploy Node.js and Python programs as Actors, but apart from that you're not limited in what programming language you choose for your scraper. ![Supported languages](../../images/supported-languages.jpg) -Though the majority of Actors currently on the platform were written in Node.js, and despite the fact our current preferred languages are JavaScript and Python, there are a few examples of Actors in other languages: +Here are a few examples of Actors in other languages: -- [Actor written in Rust](https://apify.com/lukaskrivka/rust-actor-example) -- [GO Actor](https://apify.com/jirimoravcik/go-actor-example) -- [Actor written with Julia](https://apify.com/jirimoravcik/julia-actor-example) +- [Rust actor](https://apify.com/lukaskrivka/rust-actor-example) +- [Go actor](https://apify.com/jirimoravcik/go-actor-example) +- [Julia actor](https://apify.com/jirimoravcik/julia-actor-example) ## The "actorification" workflow {#workflow} diff --git a/sources/academy/platform/deploying_your_code/input_schema.md b/sources/academy/platform/deploying_your_code/input_schema.md index d1f78da24..4a60c8c9f 100644 --- a/sources/academy/platform/deploying_your_code/input_schema.md +++ b/sources/academy/platform/deploying_your_code/input_schema.md @@ -28,7 +28,7 @@ In the root of our project, we'll create a file named **INPUT_SCHEMA.json** and } ``` -The **title** and **description** simply describe what the input schema is for, and a bit about what the Actor itself does. +The **title** and **description** describe what the input schema is for, and a bit about what the Actor itself does. ## Properties {#properties} @@ -102,7 +102,7 @@ Here is what the input schema we wrote will render on the platform: ![Rendered UI from input schema](./images/rendered-ui.png) -Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the Actor and not become overwhelmed. +Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to understand the Actor and not become overwhelmed. It's not expected to memorize all of the fields that properties can take or the different editor types available, which is why it's always good to reference the [input schema documentation](/platform/actors/development/actor-definition/input-schema) when writing a schema. diff --git a/sources/academy/platform/deploying_your_code/inputs_outputs.md b/sources/academy/platform/deploying_your_code/inputs_outputs.md index 3bae75dc0..45a5be745 100644 --- a/sources/academy/platform/deploying_your_code/inputs_outputs.md +++ b/sources/academy/platform/deploying_your_code/inputs_outputs.md @@ -11,7 +11,7 @@ slug: /deploying-your-code/inputs-outputs --- -Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. With Apify, it is extremely easy to take in inputs and deliver outputs. +Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. Apify provides a convenient way to handle inputs and deliver outputs. An important thing to understand regarding inputs and outputs is that they are read/written differently depending on where the Actor is running: @@ -221,4 +221,4 @@ After running our script, there should be a single item in the default dataset t ## Next up {#next} -That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the [next lesson](./input_schema.md), we'll be learning how to easily generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format. +That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the [next lesson](./input_schema.md), we'll be learning how to generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format. diff --git a/sources/academy/platform/deploying_your_code/output_schema.md b/sources/academy/platform/deploying_your_code/output_schema.md index afdd1eaec..25dc9bc6b 100644 --- a/sources/academy/platform/deploying_your_code/output_schema.md +++ b/sources/academy/platform/deploying_your_code/output_schema.md @@ -69,7 +69,7 @@ Next, copy-paste the following template code into your `actor.json` file. } ``` -To configure the output schema, simply replace the fields in the template with the relevant fields to your Actor. +To configure the output schema, replace the fields in the template with the relevant fields to your Actor. For reference, you can use the [Zappos Scraper source code](https://github.com/PerVillalva/zappos-scraper-actor/blob/main/.actor/actor.json) as an example of how the final implementation of the output tab should look in a live Actor. diff --git a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md index 4beda0b22..bbf31f730 100644 --- a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md @@ -17,7 +17,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh In this course, we'll be working out of the Amazon scraper project from the **Web scraping for beginners** course. If you haven't already built that project, you can do it in three short lessons [here](../../webscraping/web_scraping_for_beginners/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. -Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is basically just a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). +Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). ## Webhooks {#webhooks} diff --git a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md index 09cb219ab..38d3fdaa9 100644 --- a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md +++ b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md @@ -31,8 +31,6 @@ Also, try to explore the **Multifile editor** in one of the Actors you developed ## Our task {#our-task} -> This lesson's task is so quick and easy, we won't even be splitting this topic into two parts like the previous two topics! - First, we must initialize a GitHub repository (you can use Gitlab if you like, but this lesson's examples will be using GitHub). Then, after pushing our main Amazon Actor's code to the repo, we must switch its source code to use the content of the GitHub repository instead. ## Integrating GitHub source code {#integrating-github} diff --git a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md index c3f9d5e15..c59da8053 100644 --- a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md +++ b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md @@ -11,7 +11,7 @@ slug: /expert-scraping-with-apify/migrations-maintaining-state --- -We already know that Actors are basically just Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**. +We already know that Actors are Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**. On migration, the process inside of an Actor is completely restarted and everything in its memory is lost, meaning that any values stored within variables or classes are lost. diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md index 2cfba52a4..fe23b28fc 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md @@ -11,7 +11,7 @@ slug: /expert-scraping-with-apify/solutions/handling-migrations --- -Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to easily store, modify, persist, and log our tracked ASIN data. +Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to store, modify, persist, and log our tracked ASIN data. Here's the skeleton of our class: diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md index f8740cfa3..f2fe3b880 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md @@ -68,7 +68,7 @@ const filtered = items.reduce((acc, curr) => { }, {}); ``` -The results should be an array, so finally, we can take the map we just created and push an array of all of its values to the Actor's default dataset: +The results should be an array, so we can take the map we just created and push an array of its values to the Actor's default dataset: ```js await Actor.pushData(Object.values(filtered)); diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md index a9cf0ac81..73c4741d1 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md @@ -73,7 +73,7 @@ And that's it! We've successfully configured the session pool to match the task' ## Limiting proxy location {#limiting-proxy-location} -The final requirement was to only use proxies from the US. Back in our **ProxyConfiguration**, we just need to add the **countryCode** key and set it to **US**: +The final requirement was to use proxies only from the US. Back in our **ProxyConfiguration**, we need to add the **countryCode** key and set it to **US**: ```js const proxyConfiguration = await Actor.createProxyConfiguration({ @@ -94,7 +94,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({ **Q: How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios?** -**A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can easily switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups. +**A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups. **Q: Does it make sense to rotate proxies when you are logged into a website?** @@ -106,7 +106,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({ **Q: What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler?** -**A:** Simply making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`. +**A:** Making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`. **Q: Name a few different ways how a website can prevent you from scraping it.** diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md index a8bcd851b..95970aab0 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md @@ -88,7 +88,7 @@ const crawler = new CheerioCrawler({ ## Tracking total saved {#tracking-total-saved} -Now, we'll just increment our **totalSaved** count for every offer added to the dataset. +Now, we'll increment our **totalSaved** count for every offer added to the dataset. ```js router.addHandler(labels.OFFERS, async ({ $, request }) => { @@ -114,7 +114,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { ## Saving stats with dataset items {#saving-stats-with-dataset-items} -Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is easily accessible in the context object. +Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is accessible in the context object. ```js router.addHandler(labels.OFFERS, async ({ $, request }) => { diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md index b533b5838..87ae6b910 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md @@ -108,7 +108,7 @@ export const CHEAPEST_ITEM = 'CHEAPEST-ITEM'; ## Code check-in {#code-check-in} -Just to ensure we're all on the same page, here is what the **main.js** file looks like now: +Here is what the **main.js** file looks like now: ```js // main.js @@ -232,7 +232,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { Don't forget to push your changes to GitHub using `git push origin MAIN_BRANCH_NAME` to see them on the Apify platform! -## Creating a task (It's easy!) {#creating-task} +## Creating a task {#creating-task} Back on the platform, on your Actor's page, you can see a button in the top right hand corner that says **Create new task**: diff --git a/sources/academy/platform/get_most_of_actors/actor_readme.md b/sources/academy/platform/get_most_of_actors/actor_readme.md index a5c766cc4..834de5878 100644 --- a/sources/academy/platform/get_most_of_actors/actor_readme.md +++ b/sources/academy/platform/get_most_of_actors/actor_readme.md @@ -42,7 +42,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move 3. **How much will it cost to scrape (target site)?** - - Simple text explaining what type of proxies are needed and how many platform credits (calculated mainly from consumption units) are needed for 1000 results. + - Explanation of what type of proxies are needed and how many platform credits (calculated mainly from consumption units) are needed for 1000 results. - This is calculated from carrying out several runs (or from runs saved in the DB). - Here’s an example for this section: @@ -57,7 +57,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move - Add a video tutorial or GIF from an ideal Actor run. - > Tip: For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Simply add a separate line with the URL of your YouTube video. + > Tip: For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Add a separate line with the URL of your YouTube video. - Consider adding a short numbered tutorial as Google will sometimes pick these up as rich snippets. Remember that this might be in search results, so you can repeat the name of the Actor and give a link, e.g. @@ -71,7 +71,7 @@ Aim for sections 1–6 below and try to include at least 300 words. You can move 6. **Input** - - Each Actor detail page has an input tab, so you just need to refer to that. If you like, you can add a screenshot showing the user what the input fields will look like. + - Refer to the input tab on Actor's detail page. If you like, you can add a screenshot showing the user what the input fields will look like. - This is an example of how to refer to the input tab: > Twitter Scraper has the following input options. Click on the [input tab](https://apify.com/vdrmota/twitter-scraper/input-schema) for more information. diff --git a/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md b/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md index dbe76dc26..43ad39b74 100644 --- a/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md +++ b/sources/academy/platform/get_most_of_actors/guidelines_for_writing.md @@ -1,5 +1,5 @@ --- -title: Guidelines for writing tutorials +title: Guidelines for writing tutorials description: Create a guide for your users so they can get the best out of your Actor. Make sure your tutorial is both user- and SEO-friendly. Your tutorial will be published on Apify Blog. sidebar_position: 3 slug: /get-most-of-actors/guidelines-writing-tutorials @@ -42,12 +42,12 @@ These guidelines are of course not set in stone. They are here to give you a gen ## Tutorial template -A simple tutorial template for you to start from. Feel free to expand and modify it as you see fit. +A tutorial template for you to start from. Feel free to expand and modify it as you see fit. ```markdown # How to [perform task] automatically -A simple step-by-step guide to [describe what the guide helps achieve]. +A step-by-step guide to [describe what the guide helps achieve]. The web is a vast and dynamic space, continuously expanding and evolving. Often, there's a need to [describe the problem or need the tool addresses]. A handy tool for anyone who wants to [describe what the tool helps with] would be invaluable. @@ -69,7 +69,7 @@ Here's how to [quick intro to the tutorial itself] ### Step 1. Find the [Actor name] -Navigate to [Tool Name] and click the [CTA button]. You'll be redirected to Apify Console. +Navigate to [Tool Name] and click the [CTA button]. You'll be redirected to Apify Console. ### Step 2. Add URL or choose [setting 1], [setting 2], and [setting 3] diff --git a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md index 5f6351cde..b3d7f99b4 100644 --- a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md +++ b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md @@ -94,7 +94,7 @@ To dive deep into numbers for a specific Actor, you can visit the Actor insights Your paid Actors’ profits are directly related to the amount of paying users you have for your tool. After publishing and monetizing your software, comes a crucial step for your Actor’s success: **attracting users**. -Getting new users can be an art in itself, but there are **two simple steps** you can take to ensure your Actor is getting the attention it deserves. +Getting new users can be an art in itself, but there are **two proven steps** you can take to ensure your Actor is getting the attention it deserves. 1. **SEO-optimized description and README** diff --git a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md index f0e806e65..ec50b909d 100644 --- a/sources/academy/platform/get_most_of_actors/seo_and_promotion.md +++ b/sources/academy/platform/get_most_of_actors/seo_and_promotion.md @@ -23,7 +23,7 @@ On the other hand, if you precisely address a niche segment of users who will be ## Keywords -Several freemium tools exist that make it easy to identify the right phrases and keywords: +Several freemium tools exist that help with identifying the right phrases and keywords: - [wordstream.com/keywords](https://www.wordstream.com/keywords) - [neilpatel.com/ubersuggest](https://neilpatel.com/ubersuggest/) @@ -41,7 +41,7 @@ The best combinations are those with **high search volume** and **low competitio - Page body (e.g., README). - The texts in your links. -> While crafting your content with keywords, beware of [over-optimizing or keyword stuffing](https://yoast.com/over-optimized-website/) the page. You can use synonyms or related keywords to help this. Google is smart enough to evaluate the page based on how well the whole topic is covered (not just based on keywords), but using them helps. +> While crafting your content with keywords, beware of [over-optimizing or keyword stuffing](https://yoast.com/over-optimized-website/) the page. You can use synonyms or related keywords to help this. Google is smart enough to evaluate the page based on how well the whole topic is covered (not only by keywords), but using them helps. ## Optimizing your Actor details @@ -49,9 +49,7 @@ While blog posts and promotion are important, your Actor is the main product. He ### Name -The Actor name is your Actor's developer-style name, which is prefixed by your username (e.g. `jancurn/find-broken-links`). The name is used to generate URL used for your Actor (e.g. ), making it an important signal for search engines. - -However, the name should also be readable and clear enough, so that people using your Actor can understand what it does just from the name. +The Actor name is your Actor's developer-style name, which is prefixed by your username (e.g. `jancurn/find-broken-links`). The name is used to generate URL used for your Actor (e.g. ), making it an important signal for search engines. The name should also be readable and clear enough, so that people using your Actor can understand what it does. [Read more about naming your Actor](./naming_your_actor.md)!. @@ -114,7 +112,7 @@ Now that you’ve created a cool new Actor, let others see it! Share it on your - Use relevant and widely used hashtags (Twitter). -> **GOOD**: Need to crawl #Amazon or #Yelp? See my Amazon crawler ... +> **GOOD**: Need to crawl #Amazon or #Yelp? See my Amazon crawler https:\/\/... >
**AVOID**: I just #created something, check it out on Apify... - Post in groups or pages with relevant target groups (Facebook and LinkedIn). diff --git a/sources/academy/platform/getting_started/actors.md b/sources/academy/platform/getting_started/actors.md index 565a878a1..1f89ee263 100644 --- a/sources/academy/platform/getting_started/actors.md +++ b/sources/academy/platform/getting_started/actors.md @@ -15,11 +15,11 @@ After you've followed the **Getting started** lesson, you're almost ready to sta ## What's an Actor? {#what-is-an-actor} -When you deploy your script to the Apify platform, it is then called an **Actor**, which is simply a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. +When you deploy your script to the Apify platform, it is then called an **Actor**, which is a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a basic action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Once an Actor has been pushed to the Apify platform, they can be shared to the world through the [Apify Store](https://apify.com/store), and even monetized after going public. -> Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to just scraping. They are just pieces of code running in Docker containers, which means they can be used for nearly anything. +> Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to scraping. They can be any program running in a Docker container. ## Actors on the Apify platform {#actors-on-platform} @@ -29,7 +29,7 @@ On the front page of the Actor, click the green **Try for free** button. If you' ![Actor configuration](./images/seo-actor-config.png) -This is where we can provide input to the Actor. The defaults here are just fine, so we'll just leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself. +This is where we can provide input to the Actor. The defaults here are just fine, so we'll leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself. ![Actor logs](./images/actor-logs.jpg) diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index 8a7cbceb6..0f410f7b7 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -39,7 +39,7 @@ Our **adding-actor** takes in two input values (`num1` and `num2`). When using t ## Parameters {#parameters} -Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by simply passing the **format** parameter with a value of **csv** to change the output format: +Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by passing the **format** parameter with a value of **csv** to change the output format: ```text https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv @@ -47,7 +47,7 @@ https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-it Additional parameters can be passed to this endpoint. You can learn about them [here](/api/v2#/reference/actors/run-actor-synchronously-and-get-dataset-items/run-actor-synchronously-with-input-and-get-dataset-items) -> Note: It is safer to put your API token in the **Authorization** header like so: `Authorization: Bearer YOUR_TOKEN`. This is very easy to configure in popular HTTP clients, such as [Postman](../../glossary/tools/postman.md), [Insomnia](../../glossary/tools/insomnia.md). +> Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md), provide a convenient way to configure the Authorization header for all your API requests. ## Sending the request {#sending-the-request} @@ -69,7 +69,7 @@ What we've done in this lesson only scratches the surface of what the Apify API ## Next up {#next} -[Next up](./apify_client.md), we'll be learning about how to use Apify's JavaScript and Python clients to easily interact with the API right within our code. +[Next up](./apify_client.md), we'll be learning about how to use Apify's JavaScript and Python clients to interact with the API right within our code. - -In this lesson, we'll be discussing dynamic content and how to scrape it while utilizing Crawlee. - ## A quick experiment {#quick-experiment} -From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. Easy enough! We did something very similar in the previous modules. +From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. ![New arrival products in Fakestore](./images/new-arrivals.jpg) diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index 1bac20897..de668ba9a 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -7,7 +7,7 @@ slug: /node-js/debugging-web-scraper A lot of beginners struggle through trial and error while scraping a simple site. They write some code that might work, press the run button, see that error happened and they continue writing more code that might work but probably won't. This is extremely inefficient and gets tedious really fast. -What beginners are missing are simple tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser. +What beginners are missing are basic tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser. Pressing F12 while browsing with Chrome, Firefox, or other popular browsers opens up the browser console, the magic toolbox of any web developer. The console allows you to run a code in the context of the website you are in. Don't worry, you cannot mess the site up (well, unless you start doing really nasty tricks) as the page content is downloaded on your computer and any change is only local to your PC. @@ -29,7 +29,7 @@ You can test a `pageFunction` code in two ways in your console: ## Pasting and running a small code snippet -Usually, you don't need to paste in the whole pageFunction as you can simply isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1. +Usually, you don't need to paste in the whole pageFunction as you can isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1. I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). My declaration will change from: @@ -44,7 +44,7 @@ into results = []; ``` -You can easily get all the information you need by running a small snippet of your pageFunction like this +You can get all the information you need by running a snippet of your `pageFunction` like this: ```js results = []; @@ -56,7 +56,7 @@ $('.my-list-item').each((i, el) => { }); ``` -Now the `results` variable stays on the page and you can do whatever you wish with it. Usually, simply log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console. +Now the `results` variable stays on the page and you can do whatever you wish with it. Log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console. ```js results; @@ -65,7 +65,7 @@ results; ## Pasting and running a full pageFunction -If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will just have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere. +If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere. ```js diff --git a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md index f107388cc..61a869e32 100644 --- a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md +++ b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md @@ -5,7 +5,7 @@ sidebar_position: 16 slug: /node-js/filter-blocked-requests-using-sessions --- -_This article explains how the problem was solved before the [SessionPool](/sdk/js/docs/api/session-pool) class was added into [Apify SDK](/sdk/js/). We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, just follow the documentation and examples of SessionPool._ +_This article explains how the problem was solved before the [SessionPool](/sdk/js/docs/api/session-pool) class was added into [Apify SDK](/sdk/js/). We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, follow the documentation and examples of SessionPool._ ### Overview of the problem @@ -23,7 +23,7 @@ You want to crawl a website with a proxy pool, but most of your proxies are bloc Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](/platform/proxy#residential-proxy), but they can sometimes be too costly. -However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (simply it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually just throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. +However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://help.apify.com/en/articles/2190650-how-to-handle-blocked-requests-in-puppeteercrawler) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. ### Solution @@ -52,7 +52,7 @@ Apify.main(async () => { ### Algorithm -You don't necessarily need to understand the solution below - it should be fine to just copy/paste it to your Actor. +You don't necessarily need to understand the solution below - it should be fine to copy/paste it to your Actor. `sessions`  will be an object whose keys will be the names of the sessions and values will be objects with the name of the session (we choose a random number as a name here) and user agent (you can add any other useful properties that you want to match with each session.) This will be created automatically, for example: @@ -162,7 +162,7 @@ const crawler = new Apify.PuppeteerCrawler({ }); ``` -We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't simply passed the user agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself. +We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't pass the User-Agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself. Now we need to retrieve the session name back in the `gotoFunction`, pass it into userData and fix the hacked userAgent back to normal so it is not suspicious for the website. @@ -202,7 +202,7 @@ Things to consider 1. Since the good and bad proxies are getting filtered over time, this solution only makes sense for crawlers with at least hundreds of requests. -2. This solution will not help you if you simply don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl. +2. This solution will not help you if you don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl. 3. If you are more concerned about the speed of your crawler and less about banning proxies, set the `maxSessions` parameter of `pickSession` function to a number relatively lower than your total number of proxies. If on the other hand, keeping your proxies alive is more important, set `maxSessions`  relatively higher so you will always pick new proxies. diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index 1e4e0d9a9..da2a425b3 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -5,13 +5,13 @@ sidebar_position: 15.9 slug: /node-js/handle-blocked-requests-puppeteer --- -One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://www.apify.com/docs/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to easily cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). +One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://www.apify.com/docs/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). # BasicCrawler > Getting around website defense mechanisms when crawling. -Setting proxy rotation in [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler) is pretty simple. When using plain HTTP requests (like with the popular '[request-promise](https://www.npmjs.com/package/request-promise)' npm package), a fresh proxy is set up on each request. +You can use `handleRequestFunction` to set up proxy rotation for a [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler). The following example shows how to use a fresh proxy on each request if you make requests through the popular [request-promise](https://www.npmjs.com/package/request-promise) npm package: ```js const Apify = require('apify'); @@ -31,7 +31,7 @@ const crawler = new Apify.BasicCrawler({ }); ``` -Each time handleRequestFunction is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not easily burn through your proxies. +Each time `handleRequestFunction` is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. # Puppeteer Crawler @@ -39,7 +39,7 @@ With [PuppeteerCrawler](/sdk/js/docs/api/puppeteer-crawler) the situation is a l The straightforward solution would be to set the 'retireInstanceAfterRequestCount' option to 1. PuppeteerCrawler would then rotate the proxies in the same way as BasicCrawler. While this approach could sometimes be useful for the toughest websites, the price you pay is in performance. Restarting the browser is an expensive operation. -That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by simply passing it into the object parameter of gotoFunction or handlePageFunction. +That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by passing it into the object parameter of gotoFunction or handlePageFunction. ```js const crawler = new PuppeteerCrawler({ @@ -54,9 +54,9 @@ const crawler = new PuppeteerCrawler({ }); ``` -It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be just missing or corrupted. The developer can then choose if he will try to handle these problems in the code or just focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. +It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. -Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code some simple Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. +Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. For example, let's assume we have already initialized a requestList of Google search pages. Let's show how you can use the retire() function in both gotoFunction and handlePageFunction. diff --git a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md index af1347488..ac42f1600 100644 --- a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md +++ b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md @@ -17,13 +17,13 @@ The `Target closed` error happens when you try to access the `page` object (or s ![Chrome crashed tab](./images/chrome-crashed-tab.png) -Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot simply close your old tabs to free extra memory so it will usually kill your current memory hungry tab. +Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot close your old tabs to free extra memory so it will usually kill your current memory hungry tab. ### Memory solution If you use [Crawlee](https://crawlee.dev/), your concurrency automatically scales up and down to fit in the allocated memory. You can change the allocated memory using the environment variable or the [Configuration](https://crawlee.dev/docs/guides/configuration) class. But very hungry pages can still occasionally cause sudden memory spikes, and you might have to limit the [maxConcurrency](https://crawlee.dev/docs/guides/scaling-crawlers#minconcurrency-and-maxconcurrency) of the crawler. This problem is very rare, though. -Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or just increase the allocated memory. +Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or increase the allocated memory. ## Page closed prematurely diff --git a/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md b/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md index 86891b591..da90807f5 100644 --- a/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md +++ b/sources/academy/tutorials/node_js/how_to_save_screenshots_puppeteer.md @@ -28,7 +28,7 @@ Because this is so common use-case Apify SDK has a utility function called [save - You can also save the HTML of the page -A simple example in an Apify Actor: +An example of such Apify Actor: ```js import { Actor } from 'apify'; diff --git a/sources/academy/tutorials/node_js/index.md b/sources/academy/tutorials/node_js/index.md index 7873f0d4f..c8abaa847 100644 --- a/sources/academy/tutorials/node_js/index.md +++ b/sources/academy/tutorials/node_js/index.md @@ -12,4 +12,4 @@ slug: /node-js --- -This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or just need some general tips for scraping in Node.js, this section is right for you. +This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or need some general tips for scraping in Node.js, this section is right for you. diff --git a/sources/academy/tutorials/node_js/optimizing_scrapers.md b/sources/academy/tutorials/node_js/optimizing_scrapers.md index 790af4709..616009678 100644 --- a/sources/academy/tutorials/node_js/optimizing_scrapers.md +++ b/sources/academy/tutorials/node_js/optimizing_scrapers.md @@ -13,7 +13,7 @@ slug: /node-js/optimizing-scrapers Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need. -The goal of optimization is simple: Make the code run as fast as possible and use the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU. +The goal of optimization is to make the code run as fast as possible while using the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU. One more thing to remember. Optimization has its own cost: development time. You should always think about how much time you're able to spend on it and if it's worth it. diff --git a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md index f6c000c92..cfc564886 100644 --- a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md +++ b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md @@ -5,11 +5,11 @@ sidebar_position: 15.6 slug: /node-js/processing-multiple-pages-web-scraper --- -There is a certain scraping scenario in which you need to process the same URL many times, but each time with a different setup (e.g. filling in a form with different data each time). This is easy to do with Apify, but how to go about it may not be obvious at first glance. +Sometimes you need to process the same URL several times, but each time with a different setup. For example, you may want to submit the same form with different data each time. -We'll show you how to do this with a simple example: starting a scraper with an array of keywords, inputting each of the keywords separately into Google, and retrieving the results on the last page. The tutorial will be split into these three main parts. +Let's illustrate a solution to this problem by creating a scraper which starts with an array of keywords and inputs each of them to Google, one by one. Then it retrieves the results. -This whole thing could be done in a much easier way, by directly enqueuing the search URL, but we're choosing this approach to demonstrate some of the not so obvious features of the Apify scraper. +> This isn't an efficient solution to searching keywords on Google. You could directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD`. # Enqueuing start pages for all keywords diff --git a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md index 01b27a5fe..1d3a06dfc 100644 --- a/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md +++ b/sources/academy/tutorials/node_js/request_labels_in_apify_actors.md @@ -5,7 +5,7 @@ sidebar_position: 15.1 slug: /node-js/request-labels-in-apify-actors --- -Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request easily? +Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request? Here's how to do it. @@ -50,13 +50,13 @@ await requestQueue.addRequest({ }); ``` -Now, in the "SELLERDETAIL" url, we can just evaluate the page and extracted data merge to the object from the item detail, for example like this +Now, in the "SELLERDETAIL" url, we can evaluate the page and extracted data merge to the object from the item detail, for example like this ```js const result = { ...request.userData.data, ...sellerDetail }; ``` -So next just save the results and we're done! +Save the results and we're done! ```js await Apify.pushData(result); diff --git a/sources/academy/tutorials/node_js/scraping_shadow_doms.md b/sources/academy/tutorials/node_js/scraping_shadow_doms.md index ada3afcb0..eaeca879d 100644 --- a/sources/academy/tutorials/node_js/scraping_shadow_doms.md +++ b/sources/academy/tutorials/node_js/scraping_shadow_doms.md @@ -1,17 +1,17 @@ --- title: How to scrape sites with a shadow DOM -description: The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround. +description: The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround. sidebar_position: 14.8 slug: /node-js/scraping-shadow-doms --- # How to scrape sites with a shadow DOM {#scraping-shadow-doms} -**The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround.** +**The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround.** --- -Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't easily access the content from outside. +Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't access the content from outside. Let's take a look at this page [alodokter.com](https://www.alodokter.com/). If you click on the menu and open a Chrome debugger, you will see that the menu tree is attached to the main DOM as shadow DOM under the element ``. @@ -32,7 +32,7 @@ const links = $(shadowRoot.innerHTML).find('a'); const urls = links.map((obj, el) => el.href); ``` -However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't easily take advantage of all the scripts and tools you already have. +However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't take advantage of all the scripts and tools you already have. So instead of that, we can replace the content of each element containing shadow DOM with the HTML of shadow DOM. @@ -45,7 +45,7 @@ for (const el of document.getElementsByTagName('*')) { } ``` -After you run this, you can access all the elements and content easily using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered. +After you run this, you can access all the elements and content using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered. Some websites may contain shadow DOMs recursively inside of shadow DOMs. In these cases, we must replace them with HTML recursively: diff --git a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md index 93cc43188..c73105246 100644 --- a/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md +++ b/sources/academy/tutorials/node_js/when_to_use_puppeteer_scraper.md @@ -19,7 +19,7 @@ Ok, so both Web Scraper and Puppeteer Scraper use Puppeteer to give commands to ## Execution environment -It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Just remember that it's the environment where Puppeteer runs. +It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Remember that it's the environment where Puppeteer runs. By now you probably figured this out on your own, so this will not come as a surprise. The difference between Web Scraper and Puppeteer Scraper is where your page function gets executed. When using the Web Scraper, it's executed in the browser environment. It means that it gets access to all the browser specific features such as the `window` or `document` objects, but it cannot control the browser with Puppeteer directly. This is done automatically in the background by the scraper. Whereas in Puppeteer Scraper, the page function is executed in the Node.js environment, giving you full access to Puppeteer and all its features. @@ -28,11 +28,11 @@ _This does not mean that you can't execute in-browser code with Puppeteer Scrape ## Practical differences -Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can easily do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping. +Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping. ## Evaluating in-browser code -In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's just a single function call away. +In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's a single function call away. ```js const bodyHTML = await context.page.evaluate(() => { @@ -41,7 +41,7 @@ const bodyHTML = await context.page.evaluate(() => { }); ``` -The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot simply use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables. +The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables. _See the_ `page.evaluate()` _[documentation](https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args) for info on how to pass variables from Node.js to browser._ @@ -102,7 +102,7 @@ await context.page.goto('https://some-new-page.com'); Some very useful scraping techniques revolve around listening to network requests and responses and even modifying them on the fly. Web Scraper's page function doesn't have access to the network, besides calling JavaScript APIs such as `fetch()`. Puppeteer Scraper, on the other hand, has full control over the browser's network activity. -With a simple call, you can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console. +You can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console. ```js context.page.on('request', (req) => console.log(req.url())); @@ -116,7 +116,7 @@ _Explaining how to do interception properly is out of scope of this article. See A large number of websites use either form submissions or JavaScript redirects for navigation and displaying of data. With Web Scraper, you cannot crawl those websites, because there are no links to find and enqueue on those pages. Puppeteer Scraper enables you to automatically click all those elements that cause navigation, intercept the navigation requests and enqueue them to the request queue. -If it seems complicated, don't worry. We've abstracted all the complexity away into a simple `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`. +If it seems complicated, don't worry. We've abstracted all the complexity away to a `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`. _The_ `Clickable elements selector` _will also work on regular non-JavaScript links, however, it is significantly slower than using the plain_ `Link selector`_. Unless you know you need it, use the_ `Link selector` _for best performance._ @@ -126,7 +126,7 @@ Since we're actually clicking in the page, which may or may not trigger some nas ## Plain form submit navigations -This is easy and will work out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can just set the `Clickable elements selector` and you're good to go: +This works out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can set the `Clickable elements selector` and you're good to go: ```js 'a[onclick ^= getPage]'; @@ -142,7 +142,7 @@ Those are similar to the ones above with an important caveat. Once you click the ## Frontend navigations -Websites often won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. To paginate websites like that is quite easy actually and it can be done in both Web Scraper and Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Just click the next button to load the next set of courses. +Websites often won't navigate away just to fetch the next set of results. They will do it in the background and update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Click the next button to load the next set of courses. ```js // Web Scraper\ @@ -170,6 +170,6 @@ And we're only scratching the surface here. ## Wrapping it up -Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. For basic scraping of simple websites Web Scraper is a great tool, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers. +Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. Web Scraper is a great tool for basic scraping, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers. Once you start hitting some roadblocks, you may find that Puppeteer Scraper is just what you need to overcome them. And if Puppeteer Scraper still doesn't cut it, there's still Apify SDK to rule them all. We hope you found this tutorial helpful and happy scraping. diff --git a/sources/academy/tutorials/php/index.md b/sources/academy/tutorials/php/index.md index 241dac6f7..dbf075161 100644 --- a/sources/academy/tutorials/php/index.md +++ b/sources/academy/tutorials/php/index.md @@ -12,4 +12,4 @@ slug: /php --- -This section contains web-scraping or web-scraping related tutorials for PHP. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or just need some general tips for scraping in Apify with PHP, this section is right for you. +This section contains web-scraping or web-scraping related tutorials for PHP. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Apify with PHP, this section is right for you. diff --git a/sources/academy/tutorials/php/using_apify_from_php.md b/sources/academy/tutorials/php/using_apify_from_php.md index 885354b3b..b6ea6425c 100644 --- a/sources/academy/tutorials/php/using_apify_from_php.md +++ b/sources/academy/tutorials/php/using_apify_from_php.md @@ -166,7 +166,7 @@ $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` -We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Downloading it is simple: +We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Let's download it: ```php // Don't forget to replace the @@ -230,9 +230,7 @@ $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ ## How to use Apify Proxy -A [proxy](/platform/proxy) is another important Apify feature you will need. Guzzle makes it easy to use. - -If you just want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. +Let's use another important feature: [proxy](/platform/proxy). If you want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. ```php $client = new \GuzzleHttp\Client([ diff --git a/sources/academy/tutorials/python/index.md b/sources/academy/tutorials/python/index.md index c01869468..ea2a5e088 100644 --- a/sources/academy/tutorials/python/index.md +++ b/sources/academy/tutorials/python/index.md @@ -12,4 +12,4 @@ slug: /python --- -This section contains various web-scraping or web-scraping related tutorials for Python. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or just need some general tips for scraping in Python, this section is right for you. +This section contains various web-scraping or web-scraping related tutorials for Python. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Python, this section is right for you. diff --git a/sources/academy/tutorials/python/process_data_using_python.md b/sources/academy/tutorials/python/process_data_using_python.md index 785f66529..5e72eaddb 100644 --- a/sources/academy/tutorials/python/process_data_using_python.md +++ b/sources/academy/tutorials/python/process_data_using_python.md @@ -21,7 +21,7 @@ In this tutorial, we will use the Actor we created in the [previous tutorial](/a In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](/platform/storage/dataset) on the Apify platform. -Now, we need to process the scraped data and make a simple visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. +Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor {#setting-up-the-actor} @@ -29,7 +29,7 @@ First, we need to create another Actor. You can do it the same way as before - g In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -77,7 +77,7 @@ dataset_client = client.dataset(scraper_run['defaultDatasetId']) ### Processing the data -Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we just create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. +Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ```py # Load the dataset items into a pandas dataframe diff --git a/sources/academy/tutorials/python/scrape_data_python.md b/sources/academy/tutorials/python/scrape_data_python.md index 7be6866f6..47a7ae39a 100644 --- a/sources/academy/tutorials/python/scrape_data_python.md +++ b/sources/academy/tutorials/python/scrape_data_python.md @@ -11,7 +11,7 @@ slug: /python/scrape-data-python --- -Web scraping is not limited to the JavaScript world. The Python ecosystem contains some pretty powerful scraping tools as well. One of those is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), a library for parsing HTML and easy navigation or modification of a DOM tree. +Web scraping is not limited to the JavaScript world. The Python ecosystem contains some pretty powerful scraping tools as well. One of those is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), a library for parsing HTML and navigating or modifying of its DOM tree. This tutorial shows you how to write a Python [Actor](../../platform/getting_started/actors.md) for scraping the weather forecast from [BBC Weather](https://www.bbc.com/weather) and process the scraped data using [Pandas](https://pandas.pydata.org/). @@ -61,7 +61,7 @@ First, we need to create a new Actor. To do this, go to [Apify Console](https:// In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-scraper`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `requests` package for downloading the BBC Weather pages, and the `beautifulsoup4` package for parsing and processing the downloaded pages. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `requests` package for downloading the BBC Weather pages, and the `beautifulsoup4` package for parsing and processing the downloaded pages. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -221,7 +221,7 @@ Earlier in this tutorial, we learned how to scrape data from the web in Python u In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](/platform/storage/dataset) on the Apify platform. -Now, we need to process the scraped data and make a simple visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. +Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor {#setting-up-the-actor} @@ -229,7 +229,7 @@ First, we need to create another Actor. You can do it the same way as before - g In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. -First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't particularly care about the specific versions of these packages, so we just list them in the file: +First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ```py # Add your dependencies here. @@ -277,7 +277,7 @@ dataset_client = client.dataset(scraper_run['defaultDatasetId']) ### Processing the data -Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we just create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. +Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ```py # Load the dataset items into a pandas dataframe diff --git a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md index badda5bb0..a1f9b42a1 100644 --- a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md +++ b/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md @@ -30,7 +30,7 @@ This is usually the first solution that comes to mind. You traverse the smallest 1. Any subcategory might be bigger than the pagination limit. 2. Some listings from the parent category might not be present in any subcategory. -While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is just not enough. It can be used as a first step of the solution but not as the solution itself. +While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is not enough. It can be used as a first step of the solution but not as the solution itself. ### Using filters {#using-filters} @@ -49,7 +49,7 @@ This has several benefits: 1. All listings can eventually be found in a range. 2. The ranges do not overlap, so we scrape the smallest possible number of pages and avoid duplicate listings. -3. Ranges can be controlled by a generic algorithm that is simple to reuse for different sites. +3. Ranges can be controlled by a generic algorithm that can be reused for different sites. ## Splitting pages with range filters {#splitting-pages-with-range-filters} @@ -59,7 +59,7 @@ In the previous section, we analyzed different options to split the pages to ove ### The algorithm {#the-algorithm} -The core algorithm is simple and can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later. +The core algorithm can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later. 1. We choose a few pivot ranges with a similar number of products and enqueue them. For example, **$0-$10**, **$100-$1000**, **$1000-$10000**, **$10000-**. 2. For each range, we open the page and check if the listings are below the limit. If yes, we continue to step 3. If not, we split the filter in half, e.g. **$0-$10** to **$0-$5** and **$5-$10** and enqueue those again. We recursively repeat step **2** for each range as long as needed. @@ -83,7 +83,7 @@ If the website supports only overlapping ranges (e.g. **$0-$5**, **$5–10**), i In rare cases, a listing can have more than one value that you are filtering in a range. A typical example is Amazon, where each product has several offers and those offers have different prices. If any of those offers is within the range, the product is shown. -No easy way exists to get around this but the price range split works even with duplicate listings, just use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them. +No easy way exists to get around this but the price range split works even with duplicate listings, use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them. #### How is the range passed to the URL? {#how-is-the-range-passed-to-the-url} @@ -97,7 +97,7 @@ In addition, XHRs are smaller and faster than loading an HTML page. On the other #### Does the website show the number of products for each filtered page? {#does-the-website-show-the-number-of-products-for-each-filtered-page} -If it does, it is a nice bonus. It gives us an easy way to check if we are over or below the pagination limit and helps with analytics. +If it does, it's a nice bonus. It gives us a way to check if we are over or below the pagination limit and helps with analytics. If it doesn't, we have to find a different way to check if the number of listings is within a limit. One option is to go to the last allowed page of the pagination. If that page is still full of products, we can assume the filter is over the limit. @@ -105,7 +105,7 @@ If it doesn't, we have to find a different way to check if the number of listing Logically, every full (price) range starts at 0 and ends at infinity. But the way this is encoded will differ on each site. The end of the price range can be either closed (0) or open (infinity). Open ranges require special handling when you split them (we will get to that). -Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can just assign `null` to the value. +Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can assign `null` to the value. Here are a few examples of a query parameter with an open and closed high-end range: @@ -120,7 +120,7 @@ In this rare case, you will need to add another range or other filters to combin ### Implementing a range filter {#implementing-a-range-filter} -This section shows a simple code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them. +This section shows a code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them. First, let's define our imaginary site: @@ -144,7 +144,7 @@ await Actor.init(); const MAX_PRODUCTS_PAGINATION = 1000; -// These is just an example, choose what makes sense for your site +// Just an example, choose what makes sense for your site const PIVOT_PRICE_RANGES = [ { min: 0, max: 9.99 }, { min: 10, max: 99.99 }, @@ -208,7 +208,7 @@ const crawler = new CheerioCrawler({ // The filter is either good enough of we have to split it if (numberOfProducts <= MAX_PRODUCTS_PAGINATION) { - // We just pass the URL for scraping, we could optimize it so the page is not opened again + // We pass the URL for scraping, we could optimize it so the page is not opened again await crawler.addRequests([{ url: `${request.url}&page=1`, userData: { label: 'PAGINATION' }, @@ -268,7 +268,7 @@ const { min, max } = getFiltersFromUrl(request.url); // Our generic splitFilter function doesn't account for decimal values so we will have to convert to cents and back to dollars const newFilters = splitFilter({ min: min * 100, max: max * 100 }); -// And we just enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase +// And we enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase const requestsToEnqueue = []; for (const filter of newFilters) { requestsToEnqueue.push({ @@ -283,7 +283,7 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant solution to a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md index 4d5290778..601536ddb 100644 --- a/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md +++ b/sources/academy/webscraping/advanced_web_scraping/tips_and_tricks_robustness.md @@ -36,7 +36,7 @@ async function isPaymentSuccessful() { } ``` -**Avoid**: Relying on the absence of an element that may have been simply updated or changed. +**Avoid**: Relying on the absence of an element that may have been updated or changed. ```js async function isPaymentSuccessful() { @@ -80,7 +80,7 @@ async function submitPayment() { } ``` -**Avoid**: Not verifying an outcome. It can easily fail despite output claiming otherwise. +**Avoid**: Not verifying an outcome. It can fail despite output claiming otherwise. ```js async function submitPayment() { diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 8446d0308..f7dcaa562 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -12,15 +12,15 @@ slug: /anti-scraping --- -If at any point in time you've strayed away from the Academy's demo content, and into the wild west by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. +If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more. -In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as simply doing this can fix blocking issues on the majority of websites. +In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as doing this can fix blocking issues on the majority of websites. ## Quick start {#quick-start} -If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you just need to get unblocked ASAP, here are some quick tips: +If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you need to get unblocked ASAP, here are some quick tips: - Use high-quality proxies. [Residential proxies](/platform/proxy/residential-proxy) are the least blocked. You can find many providers out there like Apify, BrightData, Oxylabs, NetNut, etc. - Set **real-user-like HTTP settings** and **browser fingerprints**. [Crawlee](https://crawlee.dev/) uses statistically generated realistic HTTP headers and browser fingerprints by default for all of its crawlers. @@ -65,8 +65,8 @@ Unfortunately for these websites, they have to make compromises and tradeoffs. W Anti-scraping protections can work on many different layers and use a large amount of bot-identification techniques. 1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot. -2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, cyphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). -3. **What you are scraping** - The same data can be extracted in many ways from a website. You can just get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. +2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). +3. **What you are scraping** - The same data can be extracted in many ways from a website. You can get the inital HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. 4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses. These are the 4 main principles that anti-scraping protections are based on. @@ -91,7 +91,7 @@ A common workflow of a website after it has detected a bot goes as follows: 2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist. 3. If the captcha is failed, the bot is added to the blacklist. -One thing to keep in mind while navigating through this course is that advanced scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. +One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them. @@ -107,11 +107,11 @@ Although the talk, given in 2021, features some outdated code examples, it still Because we here at Apify scrape for a living, we have discovered many popular and niche anti-scraping techniques. We've compiled them into a short and comprehensible list here to help understand the roadblocks before this course teaches you how to get around them. -> However, not all issues you encounter are caused by anti-scraping systems. Sometimes, it's just a simple configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). +> Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a configuration issue. Learn [how to effectively debug your programs here](/academy/node-js/analyzing-pages-and-fixing-errors). ### IP rate-limiting -This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. +This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. > Learn more about rate limiting [here](./techniques/rate_limiting.md) diff --git a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md index 5ba143ef4..6616be75b 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md @@ -1,17 +1,17 @@ --- title: Generating fingerprints -description: Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page. +description: Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page. sidebar_position: 3 slug: /anti-scraping/mitigation/generating-fingerprints --- # Generating fingerprints {#generating-fingerprints} -**Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.** +**Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.** --- -In [**Crawlee**](https://crawlee.dev), it's extremely easy to automatically generate fingerprints using the [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler. +In [**Crawlee**](https://crawlee.dev), you can use [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler to automatically generate fingerprints. ```js import { PlaywrightCrawler } from 'crawlee'; @@ -33,7 +33,7 @@ const crawler = new PlaywrightCrawler({ ## Using the fingerprint-generator package {#using-fingerprint-generator} -Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-suite) npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can easily generate browser fingerprints. +Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-suite) npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can generate browser fingerprints. > It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results. diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index b328cfebe..e1109acff 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -22,21 +22,31 @@ There are a few factors that determine the quality of a proxy IP: - How long was the proxy left to "heal" before it was resold? - What is the quality of the underlying server of the proxy? (latency) -Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer just rely on simple IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. +Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md). -## A bit about proxy links {#understanding-proxy-links} +## About proxy links {#understanding-proxy-links} -When using proxies in your crawlers, you'll most likely be using them in a format that looks like this: +To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials. ```text http://proxy.example.com:8080 ``` -This link is separated into two main components: the **host**, and the **port**. In our case, our hostname is `http://proxy.example.com`, and our port is `8080`. Sometimes, a proxy might use an IP address as the host, such as `103.130.104.33`. +The proxy link above has several parts: -If authentication (a username and a password) is required, the format will look a bit different: +- `http://` tells us we're using HTTP protocol, +- `proxy.example.com` is a hostname, i.e. an address to the proxy server, +- `8080` is a port number. + +Sometimes the proxy server has no name, so the link contains an IP address instead: + +```text +http://123.456.789.10:8080 +``` + +If proxy requires authentication, the proxy link can contain username and password: ```text http://USERNAME:PASSWORD@proxy.example.com:8080 diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index cb0e21f38..d32cf122d 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -1,19 +1,19 @@ --- title: Using proxies -description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies. +description: Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies. sidebar_position: 2 slug: /anti-scraping/mitigation/using-proxies --- # Using proxies {#using-proxies} -**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to easily obtain pools of proxies.** +**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.** --- In the [**Web scraping for beginners**](../../web_scraping_for_beginners/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg. -Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. +Because proxies are so widely used in the scraping world, Crawlee has built-in features for implementing them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. ## Implementing proxies in a scraper {#implementing-proxies} @@ -53,8 +53,8 @@ const crawler = new CheerioCrawler({ await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/on-sale', - // By labeling the Request, we can very easily - // identify it later in the requestHandler. + // By labeling the Request, we can identify it + // later in the requestHandler. label: 'START', }]); @@ -103,7 +103,7 @@ That's it! The crawler will now automatically rotate through the proxies we prov ## A bit about debugging proxies {#debugging-proxies} -At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. +At the time of writing, the scraper above utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. ```js const crawler = new CheerioCrawler({ diff --git a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md index 3a606317b..521d11c9c 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md +++ b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md @@ -13,7 +13,7 @@ slug: /anti-scraping/techniques/browser-challenges Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a JavaScript program that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as User-Agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a [canvas fingerprint](./fingerprinting.md#with-canvases)), or playing audio in the [AudioContext](./fingerprinting.md#from-audiocontext). We were covering the details in the previous [fingerprinting](./fingerprinting.md) lesson. -While some browser challenges are relatively straightforward - for example, just loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks. +While some browser challenges are relatively straightforward - for example, loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks. Many online protections incorporate browser challenges into their security measures, but the specific techniques used can vary. diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index b666f178a..1fadca91f 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -13,7 +13,7 @@ slug: /anti-scraping/techniques/fingerprinting Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings. All of this information is called the **fingerprint** of the browser, and the act of collecting it is called **fingerprinting**. -Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is easily accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique. +Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique. Based on [research](https://www.eff.org/press/archives/2010/05/13) carried out by the Electronic Frontier Foundation, 84% of collected fingerprints are globally exclusive, and they found that the next 9% were in sets with a size of two. They also stated that even though fingerprints are dynamic, new ones can be matched up with old ones with 99.1% correctness. This makes fingerprinting a very viable option for websites that want to track the online behavior of their users in order to serve hyper-personalized advertisements to them. In some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it). @@ -103,7 +103,7 @@ Here's an example of multiple WebGL scenes visibly being rendered differently on The [AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) API represents an audio-processing graph built from audio modules linked together, each represented by an [AudioNode](https://developer.mozilla.org/en-US/docs/Web/API/AudioNode) ([OscillatorNode](https://developer.mozilla.org/en-US/docs/Web/API/OscillatorNode)). -In the simplest cases, the fingerprint can be obtained by simply checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method: +In the simplest cases, the fingerprint can be obtained by checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method: 1. Audio is passed through an OscillatorNode. 2. The signal is processed and collected. @@ -176,7 +176,7 @@ The script is modified with some random JavaScript elements. Additionally, it al ### Data obfuscation -Two main data obfuscation techniues are widely employed: +Two main data obfuscation techniques are widely employed: 1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`. 2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect. diff --git a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md index 774b8f0d5..fe964d1c0 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md +++ b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md @@ -17,7 +17,7 @@ Geolocation is yet another way websites can detect and block access or show limi Certain websites might use certain location-specific/language-specific [headers](../../../glossary/concepts/http_headers.md)/[cookies](../../../glossary/concepts/http_cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/using-cloudfront-headers.html)). -On targets which are just utilizing cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. +On targets which are utilizing just cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. ## IP address {#ip-address} diff --git a/sources/academy/webscraping/anti_scraping/techniques/index.md b/sources/academy/webscraping/anti_scraping/techniques/index.md index 95ab678cf..b1dfdb3ee 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/index.md +++ b/sources/academy/webscraping/anti_scraping/techniques/index.md @@ -27,7 +27,7 @@ Probably the most common blocking method. The website gives you a chance to prov ## Redirect {#redirect} -Another common method is simply redirecting to the home page of the site (or a different location). +Another common method is redirecting to the home page of the site (or a different location). ## Request timeout/Socket hangup {#request-timeout} diff --git a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md index ce2c02976..ee239ea4c 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md @@ -11,13 +11,13 @@ slug: /anti-scraping/techniques/rate-limiting --- -When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can easily monitor how many requests they receive from a single IP address, and block it or require a [captcha](./captchas.md) test to continue making requests. +When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can monitor how many requests they receive from a single IP address, and block it or require a [captcha](./captchas.md) test to continue making requests. In the past, most websites had their own anti-scraping solutions, the most common of which was IP address rate-limiting. In recent years, the popularity of third-party specialized anti-scraping providers has dramatically increased, but a lot of websites still use rate-limiting to only allow a certain number of requests per second/minute/hour to be sent from a single IP; therefore, crawler requests have the potential of being blocked entirely quite quickly. In cases when a higher number of requests is expected for the crawler, using a [proxy](../mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked. -## Dealing rate limiting with proxy/session rotating {#dealing-with-rate-limiting} +## Dealing with rate limiting by rotating proxy or session {#dealing-with-rate-limiting} The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](../mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md index 7444657e2..8c96eb343 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md @@ -92,7 +92,7 @@ const response = await gotScraping({ For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman](../../../glossary/tools/postman.md) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud. -Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. +Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. Here is a way you could dynamically scrape the `client_id` using Puppeteer: diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index d2a87ffc9..22049bacf 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -17,11 +17,11 @@ If you've never dealt with it before, trying to scrape thousands to hundreds of ## Page-number pagination {#page-number} -The most common and rudimentary form of pagination is simply having page numbers, which can be compared to paginating through a typical e-commerce website. +The most common and rudimentary forms of pagination have page numbers. Imagine paginating through a typical e-commerce website. ![Amazon pagination](https://apify-docs.s3.amazonaws.com/master/docs/assets/tutorials/images/pagination.jpg) -This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails just incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead. +This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead. ## Offset pagination {#offset-pagination} @@ -37,7 +37,7 @@ If we were to make a request with the **limit** set to **5** and the **offset** ## Cursor pagination {#cursor-pagination} -Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit** parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided. +Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to. One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index bdf7d691e..d8909832b 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -21,7 +21,7 @@ _Here's what we can see in the Network tab after reloading the page:_ Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found. -> **Note:** The keyword/piece of data that is used in this filtered search should be a target keyword or a piece of target data that that can be assumed will most likely be a part of the endpoint. +> To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks: @@ -37,13 +37,13 @@ Here's what our target endpoint's URL looks like coming directly from the Networ https://api-v2.soundcloud.com/users/141707/tracks?representation=&client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=20&offset=0&linked_partitioning=1&app_version=1646987254&app_locale=en ``` -Since our request doesn't have any body/payload, we just need to analyze the URL. We can break this URL down into chunks that help us understand what each value does. +Since our request doesn't have any body/payload, we need to analyze the URL. We can break this URL down into chunks that help us understand what each value does. ![Breaking down the request url into understandable chunks](./images/analyzing-the-url.png) -Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for easy pagination, or easy data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. +Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for pagination, or data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. -Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than just twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: +Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: ```text https://api-v2.soundcloud.com/users/141707/tracks?client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=99999 diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md index 0a433b31b..b7aaabeb3 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md @@ -42,9 +42,9 @@ Finally, create a file called **index.js**. This is the file we will be working ## Preparations {#preparations} -If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we just copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. +If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. -Since we know requests with this header are sent right when the front page is loaded, it can be farmed by simply visiting the page and intercepting requests in Puppeteer like so: +Since we know requests with this header are sent right when the front page is loaded, it can be farmed by visiting the page and intercepting requests in Puppeteer like so: ```js // scrapeAppToken.js @@ -71,7 +71,7 @@ const scrapeAppToken = async () => { await page.waitForNetworkIdle(); - // otherwise, just close the browser after networkidle + // otherwise, close the browser after networkidle // has been fired await browser.close(); @@ -135,7 +135,7 @@ query SearchQuery($query: String!) { } ``` -The next step is to just fill out the fields we'd like back, and we've got our final query! +The next step is to fill out the fields we'd like back, and we've got our final query! ```graphql query SearchQuery($query: String!) { diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md index 79f3a995f..c6b9e328b 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md @@ -21,7 +21,7 @@ Not only does becoming comfortable with and understanding the ins and outs of us ! Cheddar website was changed and the below example no longer works there. Nonetheless, the general approach is still viable on some websites even though introspection is disabled on most. -In order to perform introspection on our [target website](https://cheddar.com), we just need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: +In order to perform introspection on our [target website](https://cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](../../../glossary/tools/insomnia.md) or another HTTP client that supports GraphQL: > To make a GraphQL query in Insomnia, make sure you've set the HTTP method to **POST** and the request body type to **GraphQL Query**. @@ -132,7 +132,7 @@ The response body of our introspection query contains a whole lot of useful info ## Understanding the response {#understanding-the-response} -An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you just thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API. +An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API. Let's copy the response to our clipboard by clicking inside of the response body and pressing **CMD** + **A**, then subsequently **CMD** + **C**. Now, we'll head over to [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) and click on **Change Schema**. In the modal, we'll click on the **Introspection** tab and paste our data into the text area. @@ -146,9 +146,9 @@ Now that we have this visualization to work off of, it will be much easier to bu ## Building a query {#building-a-query} -In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's just get our feet wet by using the data we have from GraphQL Voyager to build a simple query. +In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query. -Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After just a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out! +Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out! ![The media field pointing to datatype slugable](./images/media-field.jpg) @@ -181,7 +181,7 @@ Let's send it! Oh, okay. That didn't work. But **why**? -Rest assured, nothing is wrong with our query. We are most likely just missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there: +Rest assured, nothing is wrong with our query. We are most likely missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there: ![Request headers back on the Cheddar website](./images/cheddar-headers.jpg) diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md index 429cae9c5..9e8da6635 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md @@ -103,7 +103,7 @@ query SearchQuery($query: String!, $count: Int!, $cursor: String) { } ``` -If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can just change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so: +If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so: ```json { @@ -112,8 +112,8 @@ If the query provided in the payload you find in the **Network** tab is good eno } ``` -Depending on the API, just doing this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or just to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons. +Depending on the API, doing just this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons. ## Next up {#next} -In the [next lesson](./introspection.md) we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection** (don't worry - it's a fancy word, but a simple concept). +In the [next lesson](./introspection.md) we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection**. diff --git a/sources/academy/webscraping/api_scraping/index.md b/sources/academy/webscraping/api_scraping/index.md index a6467f804..ddf9609f1 100644 --- a/sources/academy/webscraping/api_scraping/index.md +++ b/sources/academy/webscraping/api_scraping/index.md @@ -59,7 +59,7 @@ Since the data is coming directly from the site's API, as opposed to the parsing ### 2. Configurable -Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to easily filter and/or limit data results. +Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to filter and/or limit data results. ### 3. Fast and efficient @@ -91,7 +91,7 @@ For complex APIs that require certain headers and/or payloads in order to make a APIs come in all different shapes and sizes. That means every API will vary in not only the quality of the data that it returns, but also the format that it is in. The two most common formats are JSON and HTML. -JSON responses are the most ideal, as they are easily manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be easily filtered and formatted to fit a scraper's output schema. +JSON responses are the ideal, as they can be manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be filtered and formatted to fit a scraper's output schema. APIs which output HTML generally return the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response. diff --git a/sources/academy/webscraping/puppeteer_playwright/browser.md b/sources/academy/webscraping/puppeteer_playwright/browser.md index 38170b380..7cbc78725 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser.md @@ -82,7 +82,7 @@ There are a whole lot more options that we can pass into the `launch()` function ## Browser methods {#browser-methods} -The `launch()` function also returns an object representation of the browser that we can use to interact with the browser right from our code. This **Browser** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-class-browser), [Playwright](https://playwright.dev/docs/api/class-browser)) has many functions which make it easy to do this. One of these functions is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that. +The `launch()` function also returns a **Browser** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-class-browser), [Playwright](https://playwright.dev/docs/api/class-browser)), which is a representation of the browser. This object has many methods, which allow us to interact with the browser from our code. One of them is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that. diff --git a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md index 3fd94756c..8891772f1 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md @@ -77,7 +77,7 @@ await browser.close(); ## Using browser contexts {#using-browser-contexts} -In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android: +In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device: diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md index fc10172f6..bb1767686 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md @@ -11,7 +11,7 @@ slug: /puppeteer-playwright/common-use-cases/downloading-files --- -Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it the easy way. However, there are different techniques that work (most of the time). +Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it in a straightforward way. However, there are different techniques that work (most of the time). These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md index d4249ce37..c79c8ae33 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md @@ -11,7 +11,7 @@ slug: /puppeteer-playwright/common-use-cases --- -You can do just about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering: +You can do about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering: 1. Login flow (logging into an account) 2. Paginating through results on a website diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index e8e28d3cb..004a4a4e8 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a simple login flow from start to finish with Playwright or Puppeteer. +Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a login flow from start to finish with Playwright or Puppeteer. > In this lesson, we'll be using [yahoo.com](https://yahoo.com) as an example. Feel free to follow along using the academy Yahoo account credentials, or even deviate from the lesson a bit and try building a login flow for a different website of your choosing! @@ -124,9 +124,9 @@ const emailsToSend = [ ]; ``` -What we could do is log in 3 different times, then simply automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. +What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](../../../glossary/concepts/http_cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. -With this knowledge of cookies, it can be concluded that we can just pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. +With this knowledge of cookies, it can be concluded that we can pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. ### Retrieving cookies {#retrieving-cookies} diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index 233048d99..008590485 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even just hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. +If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. ![Amazon pagination](../../advanced_web_scraping/images/pagination.png) diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md index bcc2e68a0..760baf0c4 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md @@ -17,7 +17,7 @@ Getting information from inside iFrames is a known pain, especially for new deve If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use. -As a simple demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). +As a demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). ```js import puppeteer from 'puppeteer'; diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md index 66ec431c2..1bb718d01 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md @@ -22,7 +22,7 @@ import * as fs from 'fs/promises'; import request from 'request-promise'; ``` -The actual downloading is slightly different for text and binary files. For a text file, it can simply be done like this: +The actual downloading is slightly different for text and binary files. For a text file, it can be done like this: ```js const fileData = await request('https://some-site.com/file.txt'); diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 52b203897..8bb93b2b0 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -14,11 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. - -> Most web data extraction cases involve looping through a list of items of some sort. - -Playwright & Puppeteer offer two main methods for data extraction +Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../web_scraping_for_beginners/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction: 1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`. 2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio) @@ -142,7 +138,7 @@ This will output the same exact result as the code in the previous section. One of the most popular parsing libraries for Node.js is [Cheerio](https://www.npmjs.com/package/cheerio), which can be used in tandem with Playwright and Puppeteer. It is extremely beneficial to parse the page's HTML in the Node.js context for a number of reasons: -- You can easily port the code between headless browser data extraction and plain HTTP data extraction +- You can port the code between headless browser data extraction and plain HTTP data extraction - You don't have to worry in which context you're working (which can sometimes be confusing) - Errors are easier to handle when running in the base Node.js context @@ -306,4 +302,4 @@ await browser.close(); ## Next up {#next} -Our [next lesson](../reading_intercepting_requests.md) will be discussing something super cool - request interception and reading data from requests and responses. It's just like using DevTools, except programmatically! +Our [next lesson](../reading_intercepting_requests.md) will be discussing something super cool - request interception and reading data from requests and responses. It's like using DevTools, except programmatically! diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md index 3261e08a9..16a0a0479 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md @@ -71,7 +71,7 @@ await browser.close(); ## Exposing functions {#exposing-functions} -Here's a super awesome function we've created called `returnMessage()`, which simply returns the string **Apify Academy!**: +Here's a super awesome function we've created called `returnMessage()`, which returns the string **Apify Academy!**: ```js const returnMessage = () => 'Apify academy!'; diff --git a/sources/academy/webscraping/puppeteer_playwright/index.md b/sources/academy/webscraping/puppeteer_playwright/index.md index 672adfdc5..186b3b8d9 100644 --- a/sources/academy/webscraping/puppeteer_playwright/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/index.md @@ -17,7 +17,7 @@ import TabItem from '@theme/TabItem'; [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser. -> A headless browser is just a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them. +> A headless browser is like a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them. Both packages were developed by the same team and are very similar, which is why we have combined the Puppeteer course and the Playwright course into one super-course that shows code examples for both technologies. There are some small differences between the two, which will be highlighted in the examples. @@ -25,7 +25,7 @@ Both packages were developed by the same team and are very similar, which is why ## Advantages of using a headless browser {#advantages-of-headless-browsers} -When automating a headless browser, you can do a whole lot more in comparison to just making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. +When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). diff --git a/sources/academy/webscraping/puppeteer_playwright/page/index.md b/sources/academy/webscraping/puppeteer_playwright/page/index.md index 5344c0ed5..d96db0d00 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/index.md @@ -47,7 +47,7 @@ await browser.close(); -Then, we can visit a website with the `page.goto()` method. Let's go to [Google](https://google.com) for now. We'll also use the `page.waitForTimeout()` function, which will force the program to wait for a number of seconds before quitting (otherwise, everything will just flash before our eyes and we won't really be able to tell what's going on): +Then, we can visit a website with the `page.goto()` method. Let's go to [Google](https://google.com) for now. We'll also use the `page.waitForTimeout()` function, which will force the program to wait for a number of seconds before quitting (otherwise, everything will flash before our eyes and we won't really be able to tell what's going on): diff --git a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md index aaa689c1d..60456e2ce 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md @@ -26,7 +26,7 @@ Let's say that we want to automate searching for **hello world** on Google, then 6. Read the title of the clicked result's loaded page 7. Screenshot the page -Though it seems complex, the wonderful **Page** API makes all of these actions extremely easy to perform. +Though it seems complex, the wonderful **Page** API can help us with all the steps. ## Clicking & pressing keys {#clicking-and-pressing-keys} diff --git a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md index 6bba4dad6..7874517f6 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md @@ -14,9 +14,9 @@ import TabItem from '@theme/TabItem'; --- -Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-pagereloadoptions), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/#?product=Puppeteer&show=api-pagecontent). +Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/api/puppeteer.page.reload), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/api/puppeteer.page.content/). -Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to easily achieve both of those things. +Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to achieve both of those things. ## Grabbing the title {#grabbing-the-title} diff --git a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md index fe3eae068..a47697d48 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md @@ -58,7 +58,7 @@ Now, we won't see the error message anymore, and the first result will be succes If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-pagewaitfornavigationoptions). -> A navigation is simply when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. +> A navigation is when a new [page load](../../../glossary/concepts/dynamic_pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. Naively, you might immediately think that this is the way we should wait for navigation after clicking the first result: diff --git a/sources/academy/webscraping/puppeteer_playwright/proxies.md b/sources/academy/webscraping/puppeteer_playwright/proxies.md index 8b3d3532a..60c1d0441 100644 --- a/sources/academy/webscraping/puppeteer_playwright/proxies.md +++ b/sources/academy/webscraping/puppeteer_playwright/proxies.md @@ -169,7 +169,7 @@ const browser = await puppeteer.launch({ -However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed into the `page.authenticate()` prior to any navigations being made, while in Playwright they just need to be passed into the **proxy** option object. +However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed to the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed to the **proxy** option object. diff --git a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md index 0d2c2449a..2d7d4d007 100644 --- a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md +++ b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md @@ -231,7 +231,7 @@ Upon running this code, we'll see the API response logged into the console: One of the most popular ways of speeding up website loading in Puppeteer and Playwright is by blocking certain resources from loading. These resources are usually CSS files, images, and other miscellaneous resources that aren't super necessary (mainly because the computer doesn't have eyes - it doesn't care how the website looks!). -In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. +In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked file extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. With Playwright, request interception is a bit different. We use the [`page.route()`](https://playwright.dev/docs/api/class-page#page-route) function instead of `page.on()`, passing in a string, regular expression, or a function that will match the URL of the request we'd like to read from. The second parameter is also a callback function, but with the [**Route**](https://playwright.dev/docs/api/class-route) object passed into it instead. diff --git a/sources/academy/webscraping/switching_to_typescript/enums.md b/sources/academy/webscraping/switching_to_typescript/enums.md index 252ff2cc9..8f3b81bb4 100644 --- a/sources/academy/webscraping/switching_to_typescript/enums.md +++ b/sources/academy/webscraping/switching_to_typescript/enums.md @@ -1,13 +1,13 @@ --- title: Enums -description: Learn how to easily define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table. +description: Learn how to define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table. sidebar_position: 7.4 slug: /switching-to-typescript/enums --- # Enums! {#enums} -**Learn how to easily define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table.** +**Learn how to define, use, and manage constant values using a cool feature called "enums" that TypeScript brings to the table.** --- @@ -66,7 +66,7 @@ Because of the custom type definition for `fileExtensions` and the type annotati ## Creating enums {#creating-enums} -The [`enum`](https://www.typescriptlang.org/docs/handbook/enums.html) keyword is a new keyword brought to us by TypeScript that allows us the same functionality we implemented in the above section, plus more. To create one, simply use the keyword followed by the name you'd like to use (the naming convention is generally **CapitalizeEachFirstLetterAndSingular**). +The [`enum`](https://www.typescriptlang.org/docs/handbook/enums.html) keyword is a new keyword brought to us by TypeScript that allows us the same functionality we implemented in the above section, plus more. To create one, use the keyword followed by the name you'd like to use (the naming convention is generally **CapitalizeEachFirstLetterAndSingular**). ```ts enum FileExtension { @@ -80,7 +80,7 @@ enum FileExtension { ## Using enums {#using-enums} -Using enums is straightforward. Simply use dot notation as you normally would with a regular object. +Using enums is straightforward. Use dot notation as you would with a regular object. ```ts enum FileExtension { diff --git a/sources/academy/webscraping/switching_to_typescript/installation.md b/sources/academy/webscraping/switching_to_typescript/installation.md index 361bb16ba..330cd794d 100644 --- a/sources/academy/webscraping/switching_to_typescript/installation.md +++ b/sources/academy/webscraping/switching_to_typescript/installation.md @@ -85,7 +85,7 @@ Let's create a folder called **learning-typescript**, adding a new file within i ![Example pasted into first-lines.ts](./images/pasted-example.png) -As seen above, TypeScript has successfully recognized our code; however, there are now red underlines under the `price1` and `price2` parameters in the function declaration of `addPrices`. This is because right now, the compiler has no idea what data types we're expecting to be passed in. This can be solved with the simple addition of **type annotations** to the parameters by using a colon (`:`) and the name of the parameter's type. +As seen above, TypeScript has successfully recognized our code; however, there are now red underlines under the `price1` and `price2` parameters in the function declaration of `addPrices`. This is because right now, the compiler has no idea what data types we're expecting to be passed in. This can be solved with the addition of **type annotations** to the parameters by using a colon (`:`) and the name of the parameter's type. ```ts const products = [ diff --git a/sources/academy/webscraping/switching_to_typescript/interfaces.md b/sources/academy/webscraping/switching_to_typescript/interfaces.md index e210c19e6..b83cd9670 100644 --- a/sources/academy/webscraping/switching_to_typescript/interfaces.md +++ b/sources/academy/webscraping/switching_to_typescript/interfaces.md @@ -29,7 +29,7 @@ We can keep this just as it is, which would be totally okay, or we could use an > When working with object types, it usually just comes down to preference whether you decide to use an interface or a type alias. -Using the `interface` keyword, we can easily turn our `Person` type into an interface. +Using the `interface` keyword, we can turn our `Person` type into an interface. ```ts // Interfaces don't need an "=" sign diff --git a/sources/academy/webscraping/switching_to_typescript/mini_project.md b/sources/academy/webscraping/switching_to_typescript/mini_project.md index 211432d38..014bc2cd6 100644 --- a/sources/academy/webscraping/switching_to_typescript/mini_project.md +++ b/sources/academy/webscraping/switching_to_typescript/mini_project.md @@ -21,7 +21,7 @@ Here's a rundown of what our project should be able to do: 2. Fetch the data and get full TypeScript support on the response object (no `any`!). 3. Sort and modify the data, receiving TypeScript support for the new modified data. -We'll be using a single external package called [**Axios**](https://www.npmjs.com/package/axios) to easily fetch the data from the API, which can be installed with the following command: +We'll be using a single external package called [**Axios**](https://www.npmjs.com/package/axios) to fetch the data from the API, which can be installed with the following command: ```shell npm i axios diff --git a/sources/academy/webscraping/switching_to_typescript/type_aliases.md b/sources/academy/webscraping/switching_to_typescript/type_aliases.md index 1b70f926c..ba8910270 100644 --- a/sources/academy/webscraping/switching_to_typescript/type_aliases.md +++ b/sources/academy/webscraping/switching_to_typescript/type_aliases.md @@ -72,7 +72,7 @@ console.log(returnValueAsString(myValue)); ## Function types {#function-types} -Before we learn about how to write function types, let's learn about a problem they can solve. We have a simple function called `addAll` which takes in array of numbers, adds them all up, and then returns the result. +Before we learn about how to write function types, let's learn about a problem they can solve. We have a function called `addAll` which takes in array of numbers, adds them all up, and then returns the result. ```ts const addAll = (nums: number[]) => { diff --git a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md index 080fc907d..1c961e072 100644 --- a/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md +++ b/sources/academy/webscraping/switching_to_typescript/unknown_and_type_assertions.md @@ -80,7 +80,7 @@ This works, and in fact, it's the most optimal solution for this use case. But w ## Type assertions {#type-assertions} -Despite the fancy name, [type assertions](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#type-assertions) are a simple concept based around a single keyword: `as`. We usually use this on values that we can't control the return type of, or values that we're sure have a certain type, but TypeScript needs a bit of help understanding that. +Despite the fancy name, [type assertions](https://www.typescriptlang.org/docs/handbook/2/everyday-types.html#type-assertions) are a concept based around a single keyword: `as`. We usually use this on values that we can't control the return type of, or values that we're sure have a certain type, but TypeScript needs a bit of help understanding that. ```ts @@ -107,7 +107,7 @@ let job: undefined | string; const chars = job.split(''); ``` -TypeScript will yell at you when trying to compile this code, stating that **Object is possibly 'undefined'**, which is true. In order to assert that `job` will not be `undefined` in this case, we can simply add an exclamation mark before the dot. +TypeScript will yell at you when trying to compile this code, stating that **Object is possibly 'undefined'**, which is true. To assert that `job` will not be `undefined` in this case, we can add an exclamation mark before the dot. ```ts let job: undefined | string; diff --git a/sources/academy/webscraping/switching_to_typescript/using_types_continued.md b/sources/academy/webscraping/switching_to_typescript/using_types_continued.md index ab68ab5e7..eda345b70 100644 --- a/sources/academy/webscraping/switching_to_typescript/using_types_continued.md +++ b/sources/academy/webscraping/switching_to_typescript/using_types_continued.md @@ -84,7 +84,7 @@ const course2 = { }; ``` -Then, in the type definition, we can add a `typesLearned` key. Then, by simply writing the type that the array's elements are followed by two square brackets (`[]`), we can form an array type. +Then, in the type definition, we can add a `typesLearned` key. Then, by writing the type that the array's elements are followed by two square brackets (`[]`), we can form an array type. ```ts const course: { diff --git a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md index a6fb8ec61..b3e1540cc 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/best_practices.md @@ -90,7 +90,7 @@ When allowing your users to pass input properties which could break the scraper Validate the input provided by the user! This should be the very first thing your scraper does. If the fields in the input are missing or in an incorrect type/format, either parse the value and correct it programmatically or throw an informative error telling the user how to fix the error. -> On the Apify platform, you can use the [input schema](../../platform/deploying_your_code/input_schema.md) to both easily validate inputs and generate a clean UI for those using your scraper. +> On the Apify platform, you can use the [input schema](../../platform/deploying_your_code/input_schema.md) to both validate inputs and generate a clean UI for those using your scraper. ## Error handling {#error-handling} @@ -124,7 +124,7 @@ This really depends on your use case though. If you want 100% clean data, you mi ## Recap {#recap} -Wow, that's a whole lot of things to abide by! How will you remember all of them? Well, to simplify everything, just try to follow these three points: +Wow, that's a whole lot of things to abide by! How will you remember all of them? Try to follow these three points: 1. Describe your code as you write it with good naming, constants, and comments. It **should read like a book**. 2. Add log messages at points throughout your code so that when it's running, you (and everyone else) know what's going on. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md index a0e351b68..0db3f48a1 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/index.md @@ -85,4 +85,4 @@ From this course, you should have all the knowledge to build this scraper by you The challenge can be completed using either [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [PlaywrightCrawler](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler). Playwright is significantly slower but doesn't get blocked as much. You will learn the most by implementing both. -Let's start off this section easy by [initializing and setting up](./initializing_and_setting_up.md) our project with the Crawlee CLI (don't worry, no additional installation is required). +Let's start off this section by [initializing and setting up](./initializing_and_setting_up.md) our project with the Crawlee CLI (don't worry, no additional installation is required). diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md index 7021eff60..c0cf40bc1 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/initializing_and_setting_up.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/challenge/initializing-and-setting-up --- -The Crawlee CLI makes it extremely easy for us to set up a project in Crawlee and hit the ground running. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command: +The Crawlee CLI speeds up the process of setting up a Crawlee project. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command: ```shell npx crawlee create amazon-crawler diff --git a/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md b/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md index eebeb0962..de17ebc4f 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/challenge/scraping_amazon.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/challenge/scraping-amazon --- -In our quick chat about modularity, we finished the code for the results page and added a request for each product to the crawler's **RequestQueue**. Here, we just need to scrape the description, so it shouldn't be too hard: +In our quick chat about modularity, we finished the code for the results page and added a request for each product to the crawler's **RequestQueue**. Here, we need to scrape the description, so it shouldn't be too hard: ```js // routes.js @@ -99,7 +99,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => { ## Final code {#final-code} -That should be it! Let's just make sure we've all got the same code: +That should be it! Let's make sure we've all got the same code: ```js // constants.js diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md index 8113e34d2..31969fc26 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/exporting_data.md @@ -20,7 +20,7 @@ But when we look inside the folder, we see that there are a lot of files, and we ## Exporting data to CSV {#export-csv} -Crawlee's `Dataset` provides an easy way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you simply need to call [`Dataset.exportToCSV()`](https://crawlee.dev/api/core/class/Dataset#exportToCSV) after collecting all the data. That means, after your crawler run finishes. +Crawlee's `Dataset` provides a way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you need to call [`Dataset.exportToCSV()`](https://crawlee.dev/api/core/class/Dataset#exportToCSV) after collecting all the data. That means, after your crawler run finishes. ```js title=browser.js // ... @@ -43,7 +43,7 @@ After you add this one line and run the code, you'll find your CSV with all the ## Exporting data to JSON {#export-json} -Exporting to JSON is very similar to exporting to CSV, we just need to use a different function: [`Dataset.exportToJSON`](https://crawlee.dev/api/core/class/Dataset#exportToJSON). Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results. +Exporting to JSON is very similar to exporting to CSV, but we'll use a different function: [`Dataset.exportToJSON`](https://crawlee.dev/api/core/class/Dataset#exportToJSON). Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results. ```js title=browser.js // ... diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md index 551947377..34d4961aa 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/filtering_links.md @@ -18,7 +18,7 @@ Web pages are full of links, but frankly, most of them are useless to us when sc ## Filtering with unique CSS selectors {#css-filtering} -In the previous lesson, we simply grabbed all the links from the HTML document. +In the previous lesson, we grabbed all the links from the HTML document. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md index 77f8c3c88..1e74eddf2 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/finding_links.md @@ -25,9 +25,7 @@ On a webpage, the link above will look like this: [This is a link to example.com ## Extracting links 🔗 {#extracting-links} -So, if a link is just an HTML element, and the URL is just an attribute, this means that we can extract links exactly the same way as we extracted data.💡 Easy! - -To test this theory in the browser, we can try running the following code in our DevTools console on any website. +If a link is an HTML element, and the URL is an attribute, this means that we can extract links the same way as we extracted data. To test this theory in the browser, we can try running the following code in our DevTools console on any website. ```js // Select all the elements. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md index 588f4177f..9123e9dab 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/first_crawl.md @@ -107,7 +107,7 @@ for (const url of productUrls) { console.log(productPageTitle); } catch (error) { // In the catch block, we handle errors. - // This time, we will just print + // This time, we will print // the error message and the url. console.error(error.message, url); } diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md index bdfc3d13e..ef3fdd00b 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md @@ -14,13 +14,13 @@ import TabItem from '@theme/TabItem'; --- -A headless browser is simply a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/). **Crawlee** supports both. +A headless browser is a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/). **Crawlee** supports both. ## Building a Playwright scraper {#playwright-scraper} > Our focus will be on Playwright, which boasts additional features and better documentation. Notably, it originates from the same team responsible for Puppeteer. -Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper. +Crawlee has a built-in support for building Playwright scrapers. Let's reuse code of the Cheerio scraper from the previous lesson. It'll take us just a few changes to turn it into a full headless scraper. First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large as it bundles all the browsers. @@ -85,7 +85,7 @@ The `parseWithCheerio` function is available even in `CheerioCrawler` and all th When you run the code with `node browser.js`, you'll see a browser window open and then the individual pages getting scraped, each in a new browser tab. -So, that's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs exactly the same as before, but uses a Chromium browser instead of plain HTTP requests. This simply is not possible without Crawlee. +So, that's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs exactly the same as before, but uses a Chromium browser instead of plain HTTP requests. This isn't possible without Crawlee. Using Playwright in combination with Cheerio like this is only one of many ways how you can utilize Playwright (and Puppeteer) with Crawlee. In the advanced courses of the Academy, we will go deeper into using headless browsers for scraping and web automation (RPA) use cases. @@ -121,7 +121,7 @@ $env:CRAWLEE_HEADLESS=1; & node browser.js ## Dynamically loaded data {#dynamic-data} -One of the important benefits of using a browser is that it allows you to easily extract data that's dynamically loaded, such as data that's only fetched after a user scrolls or interacts with the page. In our case, it's the "**You may also like**" section of the product detail pages. Those products aren't available in the initial HTML, but the browser loads them later using an API. +One of the important benefits of using a browser is that it allows you to extract data that's dynamically loaded, such as data that's only fetched after a user scrolls or interacts with the page. In our case, it's the "**You may also like**" section of the product detail pages. Those products aren't available in the initial HTML, but the browser loads them later using an API. ![headless-dynamic-data.png](./images/headless-dynamic-data.png) diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md index 45e1ed02d..e69994a5c 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md @@ -152,7 +152,7 @@ const crawler = new CheerioCrawler({ }, }); -// Instead of using a simple URL string, we're now +// Instead of using a string with URL, we're now // using a request object to add more options. await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', @@ -170,7 +170,7 @@ When you run the code, you'll see the names and URLs of all the products printed ## Extracting data {#extracting-data} -We have the crawler in place, and it's time to extract data. We already have the extraction code from the previous lesson, so we can just copy and paste it into the `requestHandler` with tiny changes. Instead of printing results to the terminal, we will save it to disk. +We have the crawler in place, and it's time to extract data. We already have the extraction code from the previous lesson, so we can copy and paste it into the `requestHandler` with tiny changes. Instead of printing results to the terminal, we will save it to disk. ```js title=crawlee.js // To save data to disk, we need to import Dataset. @@ -224,7 +224,7 @@ When you run the code as usual, you'll see the product URLs printed to the termi ./storage/datasets/default/*.json ``` -Thanks to **Crawlee**, we were able to create a **faster and more robust scraper**, but **with less code** than what was needed for the simple scraper in the earlier lessons. +Thanks to **Crawlee**, we were able to create a **faster and more robust scraper**, but **with less code** than what was needed for the scraper in the earlier lessons. ## Next up {#next} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md index 1d44782c7..fdcf2e93d 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/recap_extraction_basics.md @@ -11,7 +11,7 @@ slug: /web-scraping-for-beginners/crawling/recap-extraction-basics --- -We finished off the [first section](../data_extraction/index.md) of the _Web Scraping for Beginners_ course by creating a simple web scraper in Node.js. The scraper collected all the on-sale products from [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). Let's see the code with some comments added. +We finished off the [first section](../data_extraction/index.md) of the _Web Scraping for Beginners_ course by creating a web scraper in Node.js. The scraper collected all the on-sale products from [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). Let's see the code with some comments added. ```js // First, we imported all the libraries we needed to diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md index 8ab7b5e52..f9487c80a 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/crawling/relative_urls.md @@ -52,7 +52,7 @@ for (const link of productLinks) { } ``` -When you run this file in your terminal, you'll immediately see the difference. Unlike in the browser, where looping over elements produced absolute URLs, here in Node.js it only produces the relative ones. This is bad, because we can't use the relative URLs to crawl. They simply don't include all the necessary information. +When you run this file in your terminal, you'll immediately see the difference. Unlike in the browser, where looping over elements produced absolute URLs, here in Node.js it only produces the relative ones. This is bad, because we can't use the relative URLs to crawl. They don't include all the necessary information. ## Resolving URLs {#resolving-urls} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md index 38583801c..8afb17ce6 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/browser_devtools.md @@ -9,9 +9,7 @@ slug: /web-scraping-for-beginners/data-extraction/browser-devtools --- -Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. - -Now go to [Wikipedia](https://wikipedia.com) and open your DevTools there. Inspecting the same website as us will make this lesson easier to follow. +Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. Now go to [Wikipedia](https://wikipedia.com) and open your DevTools there. ![Wikipedia with Chrome DevTools open](./images/browser-devtools-wikipedia.png) @@ -25,7 +23,7 @@ When you first open Chrome DevTools on Wikipedia, you will start on the Elements Each element is enclosed in an HTML tag. For example `
`, `

`, and `` are all tags. When you add something inside of those tags, like `

Hello!

` you create an element. You can also see elements inside other elements in the **Elements** tab. This is called nesting, and it gives the page its structure. -At the bottom, there's the **JavaScript console**, which is a powerful tool which can be used to manipulate the website. If the console is not there, you can press **ESC** to toggle it. All of this might look super complicated at first, but don't worry, there's no need to understand everything just yet - we'll walk you through all the important things you need to know. +At the bottom, there's the **JavaScript console**, which is a powerful tool which can be used to manipulate the website. If the console is not there, you can press **ESC** to toggle it. All of this might look super complicated at first, but don't worry, there's no need to understand everything yet - we'll walk you through all the important things you need to know. ![Console in Chrome DevTools](./images/browser-devtools-console.png) @@ -71,4 +69,4 @@ By changing HTML elements from the Console, you can change what's displayed on t In this lesson, we learned the absolute basics of interaction with a page using the DevTools. In the [next lesson](./using_devtools.md), you will learn how to extract data from it. We will extract data about the on-sale products on the [Warehouse store](https://warehouse-theme-metal.myshopify.com). -It isn't a real store, but a full-featured demo of a Shopify online store. And that is just perfect for our purposes. Shopify is one of the largest e-commerce platforms in the world, and it uses all the latest technologies that a real e-commerce web application would use. Learning to scrape a Shopify store is useful, because you can immediately apply the learnings to millions of websites. +It isn't a real store, but a full-featured demo of a Shopify online store. And that is perfect for our purposes. Shopify is one of the largest e-commerce platforms in the world, and it uses all the latest technologies that a real e-commerce web application would use. Learning to scrape a Shopify store is useful, because you can immediately apply the learnings to millions of websites. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md index 724df66e7..c4b9baf78 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md @@ -15,13 +15,13 @@ Before you can start writing scraper code, you need to have your computer set up ## Install Node.js {#install-node} -Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Simply put, because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. +Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. -If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, just use your package manager to install `nodejs`. +If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, use your package manager to install `nodejs`. ## Install a text editor {#install-an-editor} -Many text editors are available for you to choose from when programming. You might already have a preferred one so feel free to use that. Just make sure it has syntax highlighting and support for Node.js. If you don't have a text editor, we suggest starting with VSCode. It's free, very popular, and well maintained. [Download it here](https://code.visualstudio.com/download). +Many text editors are available for you to choose from when programming. You might already have a preferred one so feel free to use that. Make sure it has syntax highlighting and support for Node.js. If you don't have a text editor, we suggest starting with VSCode. It's free, very popular, and well maintained. [Download it here](https://code.visualstudio.com/download). Once you downloaded and installed it, you can open a folder where we will build your scraper. We recommend starting with a new, empty folder. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md index e10d149af..79278386a 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/devtools_continued.md @@ -30,7 +30,7 @@ The `length` property of `products` tells us how many products we have in the li > [Visit this tutorial](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration) if you need to refresh the concept of loops in programming. -Now, we will loop over each product and print their titles. We will use a so-called `for..of` loop to do it. It is a simple loop that iterates through all items of an array. +Now, we will loop over each product and print their titles. We will use a so-called `for..of` loop to do it. It is a loop that iterates through all items of an array. Run the following command in the Console. Some notes: @@ -52,7 +52,7 @@ for (const product of products) { ## Extracting more data {#extracting-data-in-loop} -We will add the price extraction from the previous lesson to the loop. We will also save all the data to an array so that we can easily work with it. Run this in the Console: +We will add the price extraction from the previous lesson to the loop. We will also save all the data to an array so that we can work with it. Run this in the Console: > The `results.push()` function takes its argument and pushes (adds) it to the `results` array. [Learn more about it here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push). diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md index 05abd5211..be86277dd 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/project_setup.md @@ -41,19 +41,19 @@ Node.js and npm support two types of projects, let's call them legacy and modern ## Installing necessary libraries {#install-libraries} -Now that we have a project set up, we can install npm modules into the project. Let's install libraries that will help us easily download and process websites' HTML. In the project directory, run the following command, which will install two libraries into your project. **got-scraping** and Cheerio. +Now that we have a project set up, we can install npm modules into the project. Let's install libraries that will help us with downloading and processing websites' HTML. In the project directory, run the following command, which will install two libraries into your project. **got-scraping** and Cheerio. ```shell npm install got-scraping cheerio ``` -[**got-scraping**](https://github.com/apify/got-scraping) is a library that's made especially for scraping and downloading page's HTML. It's based on the very popular [**got** library](https://github.com/sindresorhus/got), which means any features of **got** are also available in **got-scraping**. Both **got** and **got-scraping** are HTTP clients. To learn more about HTTP, [visit this MDN tutorial](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP). +[**got-scraping**](https://github.com/apify/got-scraping) is a library that's made especially for scraping and downloading page's HTML. It's based on the popular [**got** library](https://github.com/sindresorhus/got), which means any features of **got** are also available in **got-scraping**. Both **got** and **got-scraping** are HTTP clients. To learn more about HTTP, [visit this MDN tutorial](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP). -[Cheerio](https://github.com/cheeriojs/cheerio) is a very popular Node.js library for parsing (processing) HTML. If you're familiar with good old [jQuery](https://jquery.com/), you'll find working with Cheerio really easy. +[Cheerio](https://github.com/cheeriojs/cheerio) is a popular Node.js library for parsing and processing HTML. If you know how to work with [jQuery](https://jquery.com/), you'll find Cheerio familiar. ## Test everything {#testing} -With the libraries installed, create a new file in the project's folder called **main.js**. This is where we will put all our code. Before we start scraping, though, let's do a simple check that everything was installed correctly. Add this piece of code inside **main.js**. +With the libraries installed, create a new file in the project's folder called **main.js**. This is where we will put all our code. Before we start scraping, though, let's do a check that everything was installed correctly. Add this piece of code inside **main.js**. ```js import { gotScraping } from 'got-scraping'; diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md index cb7c721c4..5cd197f39 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md @@ -151,7 +151,7 @@ price.textContent; ![Extract product price](./images/devtools-extract-product-price.png) -It worked, but the price was not alone in the result. We extracted it together with some extra text. This is very common in web scraping. Sometimes it's not possible to easily separate the data we need by element selection alone, and we have to clean the data using other methods. +It worked, but the price was not alone in the result. We extracted it together with some extra text. This is very common in web scraping. Sometimes it's impossible to separate the data we need by element selection alone, and we have to clean the data using other methods. ### Cleaning extracted data {#cleaning-extracted-data} @@ -192,7 +192,7 @@ price.textContent.split('$')[1]; And there you go. Notice that this time we extracted the price without the `$` dollar sign. This could be desirable, because we wanted to convert the price from a string to a number, or not, depending on individual circumstances of the scraping project. -Which method to choose? Neither is the perfect solution. The first method could easily break if the website's developers change the structure of the `` elements and the price will no longer be in the third position - a very small change that can happen at any moment. +Which method to choose? Neither is the perfect solution. The first method could break if the website's developers change the structure of the `` elements and the price will no longer be in the third position - a very small change that can happen at any moment. The second method seems more reliable, but only until the website adds prices in other currency or decides to replace `$` with `USD`. It's up to you, the scraping developer to decide which of the methods will be more resilient on the website you scrape. diff --git a/sources/academy/webscraping/web_scraping_for_beginners/index.md b/sources/academy/webscraping/web_scraping_for_beginners/index.md index c0ede3221..6842ca4c6 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/index.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/index.md @@ -65,7 +65,7 @@ Throughout the next lessons, we will sometimes use certain technologies and term ### jQuery or Cheerio {#jquery-or-cheerio} -We'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides a simple API using jQuery syntax to help traverse downloaded HTML within Node.js. +We'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides an API using jQuery syntax to help traverse downloaded HTML within Node.js. ## Next up {#next} diff --git a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md index bf4504aaf..aff6571d1 100644 --- a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md +++ b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md @@ -12,11 +12,11 @@ slug: /web-scraping-for-beginners/introduction --- -Web scraping or crawling? Web data extraction, mining, or collection? You can find various definitions on the web. Let's agree on simple explanations that we will use throughout this beginner course on web scraping. +Web scraping or crawling? Web data extraction, mining, or collection? You can find various definitions on the web. Let's agree on explanations that we will use throughout this beginner course on web scraping. ## What is web data extraction? {#what-is-data-extraction} -Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be easily processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, and so on. +Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, etc. ![product data extraction from Amazon](./images/beginners-data-extraction.png)