diff --git a/.github/styles/config/vocabularies/Docs/accept.txt b/.github/styles/config/vocabularies/Docs/accept.txt index c25f6c7ab0..dfb5437af3 100644 --- a/.github/styles/config/vocabularies/Docs/accept.txt +++ b/.github/styles/config/vocabularies/Docs/accept.txt @@ -45,7 +45,7 @@ idempotency backoff Authy -reCaptcha +reCAPTCHA? OAuth untrusted unencrypted diff --git a/.vale.ini b/.vale.ini index 8178ef411d..282c31ca5a 100644 --- a/.vale.ini +++ b/.vale.ini @@ -26,6 +26,7 @@ Microsoft.Foreign = NO Microsoft.We = NO Microsoft.Quotes = NO Microsoft.Auto = NO +Microsoft.Units = NO Microsoft.URLFormat = NO Microsoft.GeneralURL = NO diff --git a/sources/academy/glossary/concepts/dynamic_pages.md b/sources/academy/glossary/concepts/dynamic_pages.md index ba38d1cc7c..e85f1e9bed 100644 --- a/sources/academy/glossary/concepts/dynamic_pages.md +++ b/sources/academy/glossary/concepts/dynamic_pages.md @@ -1,12 +1,10 @@ --- -title: Dynamic pages +title: Dynamic pages and single-page applications description: Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it. sidebar_position: 8.3 slug: /concepts/dynamic-pages --- -# Dynamic pages and single-page applications (SPAs) {#dynamic-pages} - **Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it.** --- diff --git a/sources/academy/glossary/concepts/http_cookies.md b/sources/academy/glossary/concepts/http_cookies.md index 8f7b79501b..472d1f0a86 100644 --- a/sources/academy/glossary/concepts/http_cookies.md +++ b/sources/academy/glossary/concepts/http_cookies.md @@ -5,8 +5,6 @@ sidebar_position: 8.2 slug: /concepts/http-cookies --- -# HTTP cookies {#cookies} - **Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking.** --- diff --git a/sources/academy/glossary/concepts/http_headers.md b/sources/academy/glossary/concepts/http_headers.md index 2fce1b8339..7b5fec6b3e 100644 --- a/sources/academy/glossary/concepts/http_headers.md +++ b/sources/academy/glossary/concepts/http_headers.md @@ -5,8 +5,6 @@ sidebar_position: 8.1 slug: /concepts/http-headers --- -# HTTP headers {#headers} - **Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.** --- diff --git a/sources/academy/glossary/concepts/index.md b/sources/academy/glossary/concepts/index.md index c0a8436477..c61d8ed237 100644 --- a/sources/academy/glossary/concepts/index.md +++ b/sources/academy/glossary/concepts/index.md @@ -6,8 +6,6 @@ category: glossary slug: /concepts --- -# Concepts ๐Ÿค” {#concepts} - **Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.** --- diff --git a/sources/academy/glossary/concepts/robot_process_automation.md b/sources/academy/glossary/concepts/robot_process_automation.md index 27d61dcdee..2fc55424b6 100644 --- a/sources/academy/glossary/concepts/robot_process_automation.md +++ b/sources/academy/glossary/concepts/robot_process_automation.md @@ -5,8 +5,6 @@ sidebar_position: 8.7 slug: /concepts/robotic-process-automation --- -# What is robotic process automation (RPA)? {#what-is-robotic-process-automation-rpa} - **Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.** --- diff --git a/sources/academy/glossary/glossary.md b/sources/academy/glossary/glossary.md index 5608bf6a8d..9e86844db6 100644 --- a/sources/academy/glossary/glossary.md +++ b/sources/academy/glossary/glossary.md @@ -6,8 +6,6 @@ category: glossary slug: /glossary --- -# Why a glossary? {#why-a-glossary} - **Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context.** --- diff --git a/sources/academy/glossary/tools/apify_cli.md b/sources/academy/glossary/tools/apify_cli.md index 82cb187e77..e4e5bbee6a 100644 --- a/sources/academy/glossary/tools/apify_cli.md +++ b/sources/academy/glossary/tools/apify_cli.md @@ -5,8 +5,6 @@ sidebar_position: 9.1 slug: /tools/apify-cli --- -# The Apify CLI {#the-apify-cli} - **Learn about, install, and log into the Apify CLI - your best friend for interacting with the Apify platform via your terminal.** --- @@ -15,7 +13,7 @@ The [Apify CLI](/cli) helps you create, develop, build and run Apify Actors, and ## Installing {#installing} -To install the Apfiy CLI, you'll first need npm, which comes preinstalled with Node.js. If you haven't yet installed Node, learn how to do that [here](../../webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md). Additionally, make sure you've got an Apify account, as you will need to log in to the CLI to gain access to its full potential. +To install the Apfiy CLI, you'll first need npm, which comes preinstalled with Node.js. If you haven't yet installed Node, [learn how to do that](../../webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md). Additionally, make sure you've got an Apify account, as you will need to log in to the CLI to gain access to its full potential. Open up a terminal instance and run the following command: diff --git a/sources/academy/glossary/tools/edit_this_cookie.md b/sources/academy/glossary/tools/edit_this_cookie.md index fbcd2e8856..47aea1f2c5 100644 --- a/sources/academy/glossary/tools/edit_this_cookie.md +++ b/sources/academy/glossary/tools/edit_this_cookie.md @@ -5,8 +5,6 @@ sidebar_position: 9.7 slug: /tools/edit-this-cookie --- -# What's EditThisCookie? {#what-is-it} - **Learn how to add, delete, and modify different cookies in your browser for testing purposes using the EditThisCookie Chrome extension.** --- diff --git a/sources/academy/glossary/tools/index.md b/sources/academy/glossary/tools/index.md index 6a4cf4f716..393b8dd6c5 100644 --- a/sources/academy/glossary/tools/index.md +++ b/sources/academy/glossary/tools/index.md @@ -6,8 +6,6 @@ category: glossary slug: /tools --- -# Tools ๐Ÿ”ง {#tools} - **Discover a variety of tools that can be used to enhance the scraper development process, or even unlock doors to new scraping possibilities.** --- diff --git a/sources/academy/glossary/tools/insomnia.md b/sources/academy/glossary/tools/insomnia.md index 143e57a4ec..f0e9058a85 100644 --- a/sources/academy/glossary/tools/insomnia.md +++ b/sources/academy/glossary/tools/insomnia.md @@ -5,8 +5,6 @@ sidebar_position: 9.2 slug: /tools/insomnia --- -# What is Insomnia {#what-is-insomnia} - **Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- diff --git a/sources/academy/glossary/tools/modheader.md b/sources/academy/glossary/tools/modheader.md index 581e6628dd..e8c92eac5e 100644 --- a/sources/academy/glossary/tools/modheader.md +++ b/sources/academy/glossary/tools/modheader.md @@ -5,8 +5,6 @@ sidebar_position: 9.5 slug: /tools/modheader --- -# What is ModHeader? {#what-is-modheader} - **Discover a super useful Chrome extension called ModHeader, which allows you to modify your browser's HTTP request headers.** --- diff --git a/sources/academy/glossary/tools/postman.md b/sources/academy/glossary/tools/postman.md index 5f37b8f4e0..d897a92dbb 100644 --- a/sources/academy/glossary/tools/postman.md +++ b/sources/academy/glossary/tools/postman.md @@ -5,8 +5,6 @@ sidebar_position: 9.3 slug: /tools/postman --- -# What is Postman? {#what-is-postman} - **Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- diff --git a/sources/academy/glossary/tools/proxyman.md b/sources/academy/glossary/tools/proxyman.md index 52dd194b66..3d48028dc1 100644 --- a/sources/academy/glossary/tools/proxyman.md +++ b/sources/academy/glossary/tools/proxyman.md @@ -5,8 +5,6 @@ sidebar_position: 9.4 slug: /tools/proxyman --- -# What's Proxyman? {#what-is-proxyman} - **Learn about Proxyman, a tool for viewing all network requests that are coming through your system. Filter by response type, by a keyword, or by application.** --- diff --git a/sources/academy/glossary/tools/quick_javascript_switcher.md b/sources/academy/glossary/tools/quick_javascript_switcher.md index eca2c21b34..543771697e 100644 --- a/sources/academy/glossary/tools/quick_javascript_switcher.md +++ b/sources/academy/glossary/tools/quick_javascript_switcher.md @@ -5,8 +5,6 @@ sidebar_position: 9.9 slug: /tools/quick-javascript-switcher --- -# Quick JavaScript Switcher - **Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** --- diff --git a/sources/academy/glossary/tools/switchyomega.md b/sources/academy/glossary/tools/switchyomega.md index 8a0eb4b9c5..60c72afdce 100644 --- a/sources/academy/glossary/tools/switchyomega.md +++ b/sources/academy/glossary/tools/switchyomega.md @@ -5,8 +5,6 @@ sidebar_position: 9.6 slug: /tools/switchyomega --- -# What is SwitchyOmega? {#what-is-switchyomega} - **Discover SwitchyOmega, a Chrome extension to manage and switch between proxies, which is extremely useful when testing proxies for a scraper.** --- diff --git a/sources/academy/glossary/tools/user_agent_switcher.md b/sources/academy/glossary/tools/user_agent_switcher.md index 65a1445a7b..3fa3211bcc 100644 --- a/sources/academy/glossary/tools/user_agent_switcher.md +++ b/sources/academy/glossary/tools/user_agent_switcher.md @@ -5,8 +5,6 @@ sidebar_position: 9.8 slug: /tools/user-agent-switcher --- -# User-Agent Switcher - **Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** --- diff --git a/sources/academy/platform/apify_platform.md b/sources/academy/platform/apify_platform.md index afaf819dbe..8b56843984 100644 --- a/sources/academy/platform/apify_platform.md +++ b/sources/academy/platform/apify_platform.md @@ -1,5 +1,5 @@ --- -title: Introduction to Apify platform +title: Introduction to the Apify platform description: Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience. sidebar_position: 7 category: apify platform diff --git a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md index 53814c0033..97f2a8ca3c 100644 --- a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md @@ -5,8 +5,6 @@ sidebar_position: 6.1 slug: /expert-scraping-with-apify/actors-webhooks --- -# Webhooks & advanced Actor overview {#webhooks-and-advanced-actors} - **Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.** --- @@ -15,7 +13,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh ## Advanced Actor overview {#advanced-actors} -In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in three short lessons [here](../../webscraping/scraping_basics_javascript/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. +In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in [three short lessons](../../webscraping/scraping_basics_javascript/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). diff --git a/sources/academy/platform/expert_scraping_with_apify/apify_api_and_client.md b/sources/academy/platform/expert_scraping_with_apify/apify_api_and_client.md index 02e55777f3..cb5a1fd734 100644 --- a/sources/academy/platform/expert_scraping_with_apify/apify_api_and_client.md +++ b/sources/academy/platform/expert_scraping_with_apify/apify_api_and_client.md @@ -5,8 +5,6 @@ sidebar_position: 6.4 slug: /expert-scraping-with-apify/apify-api-and-client --- -# Apify API & client {#api-and-client} - **Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/bypassing_anti_scraping.md b/sources/academy/platform/expert_scraping_with_apify/bypassing_anti_scraping.md index ccc9c62f3e..4f6f8757b5 100644 --- a/sources/academy/platform/expert_scraping_with_apify/bypassing_anti_scraping.md +++ b/sources/academy/platform/expert_scraping_with_apify/bypassing_anti_scraping.md @@ -5,8 +5,6 @@ sidebar_position: 6.6 slug: /expert-scraping-with-apify/bypassing-anti-scraping --- -# Bypassing anti-scraping methods {#bypassing-anti-scraping-methods} - **Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/index.md b/sources/academy/platform/expert_scraping_with_apify/index.md index 95bc0a92c7..6c1be153b5 100644 --- a/sources/academy/platform/expert_scraping_with_apify/index.md +++ b/sources/academy/platform/expert_scraping_with_apify/index.md @@ -6,8 +6,6 @@ category: apify platform slug: /expert-scraping-with-apify --- -# Expert scraping with Apify {#expert-scraping} - **After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md index 38d3fdaa95..f8e4255e90 100644 --- a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md +++ b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md @@ -5,8 +5,6 @@ sidebar_position: 6.2 slug: /expert-scraping-with-apify/managing-source-code --- -# Managing source code {#managing-source-code} - **Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md index c59da80531..fa6eb61f1b 100644 --- a/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md +++ b/sources/academy/platform/expert_scraping_with_apify/migrations_maintaining_state.md @@ -5,8 +5,6 @@ sidebar_position: 6.5 slug: /expert-scraping-with-apify/migrations-maintaining-state --- -# Migrations & maintaining state {#migrations-maintaining-state} - **Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md index bcb7cee71d..1cd8feca90 100644 --- a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md @@ -5,15 +5,13 @@ sidebar_position: 6.7 slug: /expert-scraping-with-apify/saving-useful-stats --- -# Saving useful run statistics {#savings-useful-run-statistics} - **Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.** --- Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results or potentially include statistics about each request within its corresponding dataset item. -The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of captchas hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important. +The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of CAPTCHAs hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important. ## Learning ๐Ÿง  {#learning} diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md index 635971ff65..056bcb736f 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md @@ -5,8 +5,6 @@ sidebar_position: 5 slug: /expert-scraping-with-apify/solutions/handling-migrations --- -# Handling migrations {#handling-migrations} - **Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/index.md b/sources/academy/platform/expert_scraping_with_apify/solutions/index.md index e9d71e69db..171cfd8dc3 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/index.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/index.md @@ -5,8 +5,6 @@ sidebar_position: 6.7 slug: /expert-scraping-with-apify/solutions --- -# Solutions - **View all of the solutions for all of the activities and tasks of this course. Please try to complete each task on your own before reading the solution!** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md index 301eab24d2..d90ba7f76f 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/integrating_webhooks.md @@ -5,8 +5,6 @@ sidebar_position: 1 slug: /expert-scraping-with-apify/solutions/integrating-webhooks --- -# Integrating webhooks {#integrating-webhooks} - **Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything!** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/managing_source.md b/sources/academy/platform/expert_scraping_with_apify/solutions/managing_source.md index 2273af81ed..5ccc2b42a9 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/managing_source.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/managing_source.md @@ -5,8 +5,6 @@ sidebar_position: 2 slug: /expert-scraping-with-apify/solutions/managing-source --- -# Managing source - **View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md index 04fdd869d6..7b0193f248 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md @@ -5,8 +5,6 @@ sidebar_position: 6 slug: /expert-scraping-with-apify/solutions/rotating-proxies --- -# Rotating proxies/sessions {#rotating-proxy-sessions} - **Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md index a730536c7d..b46ec068cc 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md @@ -5,8 +5,6 @@ sidebar_position: 7 slug: /expert-scraping-with-apify/solutions/saving-stats --- -# Saving run stats {#saving-stats} - **Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md index c5633be937..dce2ba016f 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md @@ -5,8 +5,6 @@ sidebar_position: 4 slug: /expert-scraping-with-apify/solutions/using-api-and-client --- -# Using the Apify API & JavaScript client {#using-api-and-client} - **Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.** --- diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md index 5c01c45a8f..91528d5c5c 100644 --- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md +++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_storage_creating_tasks.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /expert-scraping-with-apify/solutions/using-storage-creating-tasks --- -# Using storage & creating tasks {#using-storage-creating-tasks} - ## Quiz answers ๐Ÿ“ {#quiz-answers} **Q: What is the relationship between Actors and tasks?** diff --git a/sources/academy/platform/expert_scraping_with_apify/tasks_and_storage.md b/sources/academy/platform/expert_scraping_with_apify/tasks_and_storage.md index d18009c241..22c15a5eaa 100644 --- a/sources/academy/platform/expert_scraping_with_apify/tasks_and_storage.md +++ b/sources/academy/platform/expert_scraping_with_apify/tasks_and_storage.md @@ -5,8 +5,6 @@ sidebar_position: 6.3 slug: /expert-scraping-with-apify/tasks-and-storage --- -# Tasks & storage {#tasks-and-storage} - **Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers.** --- diff --git a/sources/academy/platform/getting_started/apify_api.md b/sources/academy/platform/getting_started/apify_api.md index ca5a6a2350..5038f9f23e 100644 --- a/sources/academy/platform/getting_started/apify_api.md +++ b/sources/academy/platform/getting_started/apify_api.md @@ -1,5 +1,5 @@ --- -title: Apify API +title: The Apify API description: Learn how to use the Apify API to programmatically call your Actors, retrieve data stored on the platform, view Actor logs, and more! sidebar_position: 4 slug: /getting-started/apify-api diff --git a/sources/academy/tutorials/api/index.md b/sources/academy/tutorials/api/index.md index 8c1f212c93..4cd998bdd9 100644 --- a/sources/academy/tutorials/api/index.md +++ b/sources/academy/tutorials/api/index.md @@ -6,8 +6,6 @@ category: tutorials slug: /api --- -# Using Apify API - **A collection of various tutorials explaining how to interact with the Apify platform programmatically using its API.** --- diff --git a/sources/academy/tutorials/api/using_apify_from_php.md b/sources/academy/tutorials/api/using_apify_from_php.md index 3a1575d1fb..d890390429 100644 --- a/sources/academy/tutorials/api/using_apify_from_php.md +++ b/sources/academy/tutorials/api/using_apify_from_php.md @@ -4,8 +4,6 @@ description: Learn how to access Apify's REST API endpoints from your PHP projec slug: /php/use-apify-from-php --- -# How to use Apify from PHP - Apify's [RESTful API](https://docs.apify.com/api/v2#) allows you to use the platform from basically anywhere. Many projects are and will continue to be built using [PHP](https://www.php.net/). This tutorial enables you to use Apify in these projects in PHP and frameworks built on it. Apify does not have an official PHP client (yet), so we are going to use [guzzle](https://github.com/guzzle/guzzle), a great library for HTTP requests. By covering a few fundamental endpoints, this tutorial will show you the principles you can use for all Apify API endpoints. diff --git a/sources/academy/tutorials/apify_scrapers/index.md b/sources/academy/tutorials/apify_scrapers/index.md index a28b2587ad..f22f4999ef 100644 --- a/sources/academy/tutorials/apify_scrapers/index.md +++ b/sources/academy/tutorials/apify_scrapers/index.md @@ -5,8 +5,6 @@ sidebar_position: 5 slug: /apify-scrapers --- -# Using ready-made Apify scrapers - **Discover Apify's ready-made web scraping and automation tools. Compare Web Scraper, Cheerio Scraper and Puppeteer Scraper to decide which is right for you.** --- diff --git a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md index 892a3dd59b..316beb5ee2 100644 --- a/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md +++ b/sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md @@ -5,8 +5,6 @@ sidebar_position: 14.1 slug: /node-js/analyzing-pages-and-fixing-errors --- -# How to analyze and fix errors when scraping a website {#scraping-with-sitemaps} - **Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.** --- diff --git a/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.md b/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.md index 70b0f2ab07..3ce0a59d2c 100644 --- a/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.md +++ b/sources/academy/tutorials/node_js/caching_responses_in_puppeteer.md @@ -7,8 +7,6 @@ slug: /node-js/caching-responses-in-puppeteer import Example from '!!raw-loader!roa-loader!./caching_responses_in_puppeteer.js'; -# How to optimize Puppeteer by caching responses {#caching-responses-in-puppeteer} - **Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.** --- @@ -99,7 +97,7 @@ After implementing this code, we can run the scraper again. ![Good run results](./images/good-run-results.png) -Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [here](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL). +Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [in the Apify storage](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL). It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy. diff --git a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md index 1af2b43118..40bec3fa39 100644 --- a/sources/academy/tutorials/node_js/choosing_the_right_scraper.md +++ b/sources/academy/tutorials/node_js/choosing_the_right_scraper.md @@ -5,8 +5,6 @@ sidebar_position: 14.3 slug: /node-js/choosing-the-right-scraper --- -# How to choose the right scraper for the job {#choosing-the-right-scraper} - **Learn basic web scraping concepts to help you analyze a website and choose the best scraper for your particular use case.** --- @@ -30,7 +28,7 @@ Some websites do not load any data without a browser, as they need to execute so ## Making the choice {#making-the-choice} -When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. +When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering a zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users. diff --git a/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md b/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md index 21b8aee9a7..f5a691fdeb 100644 --- a/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md +++ b/sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md @@ -7,8 +7,6 @@ slug: /node-js/dealing-with-dynamic-pages import Example from '!!raw-loader!roa-loader!./dealing_with_dynamic_pages.js'; -# How to scrape from dynamic pages {#dealing-with-dynamic-pages} - **Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?** --- diff --git a/sources/academy/tutorials/node_js/debugging_web_scraper.md b/sources/academy/tutorials/node_js/debugging_web_scraper.md index de668ba9a7..d3bf5015e0 100644 --- a/sources/academy/tutorials/node_js/debugging_web_scraper.md +++ b/sources/academy/tutorials/node_js/debugging_web_scraper.md @@ -11,8 +11,6 @@ What beginners are missing are basic tools and tricks to get things done quickly Pressing F12 while browsing with Chrome, Firefox, or other popular browsers opens up the browser console, the magic toolbox of any web developer. The console allows you to run a code in the context of the website you are in. Don't worry, you cannot mess the site up (well, unless you start doing really nasty tricks) as the page content is downloaded on your computer and any change is only local to your PC. -# Running code in a browser console - > Test your Page Function's code directly in your browser's console. First, you need to inject jQuery. You can try to paste and run this snippet. diff --git a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md index bf0fc4b30a..da57305df1 100644 --- a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md +++ b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md @@ -180,7 +180,7 @@ const gotoFunction = async ({ request, page }) => { }; ``` -Now we have access to the session in the `handlePageFunction` and the rest of the logic is the same as in the first example. We extract the session from the userData, try/catch the whole code and on success we add the session and on error we delete it. Also it is useful to retire the browser completely (check [here](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer) for reference) since the other requests will probably have similar problem. +Now we have access to the session in the `handlePageFunction` and the rest of the logic is the same as in the first example. We extract the session from the userData, try/catch the whole code and on success we add the session and on error we delete it. Also it is useful to retire the browser completely (check the [handling blocked requests guide](/academy/node-js/handle-blocked-requests-puppeteer) for reference) since the other requests will probably have similar problem. ```js const handlePageFunction = async ({ request, page, puppeteerPool }) => { diff --git a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md index 313802e0be..e762716b0b 100644 --- a/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md +++ b/sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md @@ -7,8 +7,6 @@ slug: /node-js/handle-blocked-requests-puppeteer One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://docs.apify.com/platform/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). -# BasicCrawler - > Getting around website defense mechanisms when crawling. You can use `handleRequestFunction` to set up proxy rotation for a [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler). The following example shows how to use a fresh proxy on each request if you make requests through the popular [request-promise](https://www.npmjs.com/package/request-promise) npm package: @@ -33,7 +31,7 @@ const crawler = new Apify.BasicCrawler({ Each time `handleRequestFunction` is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. -# Puppeteer Crawler +## Puppeteer Crawler With [PuppeteerCrawler](/sdk/js/docs/api/puppeteer-crawler) the situation is a little more complicated. That's because you have to restart the browser to change the proxy the browser is using. By default, PuppeteerCrawler restarts the browser every 100 requests, which can lead to a number of requests being wasted because the IP address the browser is using is already blocked by the website. diff --git a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md index ac42f16009..d8eeae8fdc 100644 --- a/sources/academy/tutorials/node_js/how_to_fix_target_closed.md +++ b/sources/academy/tutorials/node_js/how_to_fix_target_closed.md @@ -1,12 +1,10 @@ --- -title: How to fix the 'Target closed' error in Puppeteer and Playwright +title: How to fix 'Target closed' error in Puppeteer and PlaywrightTarget closed' error in Puppeteer and Playwright description: Learn about common causes for the 'Target closed' error in your browser automation workflow and what you can do to fix it. sidebar_position: 14.2 slug: /node-js/how_to_fix_target-closed --- -# How to fix 'Target closed' error in Puppeteer and Playwright - **Learn about common causes for the 'Target closed' error in browser automation and what you can do to fix it.** --- diff --git a/sources/academy/tutorials/node_js/index.md b/sources/academy/tutorials/node_js/index.md index 9ed6c6db43..8513387ed2 100644 --- a/sources/academy/tutorials/node_js/index.md +++ b/sources/academy/tutorials/node_js/index.md @@ -6,8 +6,6 @@ category: tutorials slug: /node-js --- -# Scraping with Node.js - **A collection of various Node.js tutorials on scraping sitemaps, optimizing your scrapers, using popular Node.js web scraping libraries, and more.** --- diff --git a/sources/academy/tutorials/node_js/js_in_html.md b/sources/academy/tutorials/node_js/js_in_html.md index 1df5643639..26e4225524 100644 --- a/sources/academy/tutorials/node_js/js_in_html.md +++ b/sources/academy/tutorials/node_js/js_in_html.md @@ -5,8 +5,6 @@ sidebar_position: 14.5 slug: /node-js/js-in-html --- -# How to scrape hidden JavaScript objects in HTML {#what-is-js-in-html} - **Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience.** --- diff --git a/sources/academy/tutorials/node_js/multiple-runs-scrape.md b/sources/academy/tutorials/node_js/multiple-runs-scrape.md index a4f0895046..f56edce6ae 100644 --- a/sources/academy/tutorials/node_js/multiple-runs-scrape.md +++ b/sources/academy/tutorials/node_js/multiple-runs-scrape.md @@ -5,8 +5,6 @@ sidebar_position: 15.10 slug: /node-js/multiple-runs-scrape --- -# Scrape website in parallel with multiple Actor runs - **Learn how to run multiple instances of an Actor to scrape a website faster. This tutorial will guide you through the process of setting up your scraper.** --- diff --git a/sources/academy/tutorials/node_js/optimizing_scrapers.md b/sources/academy/tutorials/node_js/optimizing_scrapers.md index 6160096780..7b60f90904 100644 --- a/sources/academy/tutorials/node_js/optimizing_scrapers.md +++ b/sources/academy/tutorials/node_js/optimizing_scrapers.md @@ -5,8 +5,6 @@ sidebar_position: 14.6 slug: /node-js/optimizing-scrapers --- -# How to optimize and speed up your web scraper {#optimizing-scrapers} - **We all want our scrapers to run as cost-effective as possible. Learn how to think about performance in the context of web scraping and automation.** --- diff --git a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md index bb825da96e..347acbff59 100644 --- a/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md +++ b/sources/academy/tutorials/node_js/processing_multiple_pages_web_scraper.md @@ -9,11 +9,11 @@ Sometimes you need to process the same URL several times, but each time with a d Let's illustrate a solution to this problem by creating a scraper which starts with an array of keywords and inputs each of them to Google, one by one. Then it retrieves the results. -> This isn't an efficient solution to searching keywords on Google. You could directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD`. +:::note Tutorial focus -# Enqueuing start pages for all keywords +This tutorial demonstrates how to handle a common scenario where scrapers automatically deduplicate URLs. For the most efficient Google searches in production, directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD` instead of the form-submission approach shown here. -> Solving a common problem with scraper automatically deduplicating the same URLs. +::: First, we need to start the scraper on the page from which we're going to do our enqueuing. To do that, we create one start URL with the label "enqueue" and URL "https://example.com/". Now we can proceed to enqueue all the pages. The first part of our `pageFunction` will look like this: @@ -47,7 +47,7 @@ To set the keywords, we're using the customData scraper parameter. This is usefu Since we're enqueuing the same page more than once, we need to set our own uniqueKey so the page will be added to the queue (by default uniqueKey is set to be the same as the URL). The label for the next page will be "fill-form". We're passing the keyword to the next page in the userData field (this can contain any data). -# Inputting the keyword into Google +## Inputting the keyword into Google Now we come to the next page (Google). We need to retrieve the keyword and input it into the Google search bar. This will be the next part of the pageFunction: diff --git a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md index 4222bba2a3..2e67ef3540 100644 --- a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md +++ b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md @@ -7,8 +7,6 @@ slug: /node-js/scraping-from-sitemaps import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; -# How to scrape from sitemaps {#scraping-with-sitemaps} - :::tip Processing sitemaps automatically with Crawlee Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code. diff --git a/sources/academy/tutorials/node_js/scraping_shadow_doms.md b/sources/academy/tutorials/node_js/scraping_shadow_doms.md index bf45a76839..cff1012c1e 100644 --- a/sources/academy/tutorials/node_js/scraping_shadow_doms.md +++ b/sources/academy/tutorials/node_js/scraping_shadow_doms.md @@ -5,8 +5,6 @@ sidebar_position: 14.8 slug: /node-js/scraping-shadow-doms --- -# How to scrape sites with a shadow DOM {#scraping-shadow-doms} - **The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround.** --- diff --git a/sources/academy/tutorials/node_js/submitting_form_with_file_attachment.md b/sources/academy/tutorials/node_js/submitting_form_with_file_attachment.md index 3977545a90..f600fe66f6 100644 --- a/sources/academy/tutorials/node_js/submitting_form_with_file_attachment.md +++ b/sources/academy/tutorials/node_js/submitting_form_with_file_attachment.md @@ -7,8 +7,6 @@ slug: /node-js/submitting-form-with-file-attachment When doing web automation with Apify, it can sometimes be necessary to submit an HTML form with a file attachment. This article will cover a situation where the file is publicly accessible (e.g. hosted somewhere) and will use an Apify Actor. If it's impossible to use request-promise, it might be necessary to use [Puppeteer](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment). -# Downloading the file to memory - **How to submit a form with attachment using request-promise.** --- @@ -36,7 +34,7 @@ const fileData = await request({ In this case, fileData will be a Buffer instead of a String. -# Submitting the form +## Submitting the form When the file is ready, we can submit the form as follows: diff --git a/sources/academy/tutorials/python/index.md b/sources/academy/tutorials/python/index.md index 5fc75f6af4..3ddb3360e8 100644 --- a/sources/academy/tutorials/python/index.md +++ b/sources/academy/tutorials/python/index.md @@ -6,8 +6,6 @@ category: tutorials slug: /python --- -# Scraping with Python - **A collection of various Python tutorials to aid you in your journey to becoming a master web scraping and automation developer.** --- diff --git a/sources/academy/tutorials/python/process_data_using_python.md b/sources/academy/tutorials/python/process_data_using_python.md index 5e72eaddb9..d6030597c2 100644 --- a/sources/academy/tutorials/python/process_data_using_python.md +++ b/sources/academy/tutorials/python/process_data_using_python.md @@ -1,12 +1,10 @@ --- -title: Process scraped data with Python +title: How to process data in Python using Pandas description: Learn how to process the resulting data of a web scraper in Python using the Pandas library, and how to visualize the processed data using Matplotlib. sidebar_position: 2 # should be after scrape_data_python.md slug: /python/process-data-using-python --- -# How to process data in Python using Pandas - **Learn how to process the resulting data of a web scraper in Python using the Pandas library, and how to visualize the processed data using Matplotlib.** --- diff --git a/sources/academy/tutorials/python/scrape_data_python.md b/sources/academy/tutorials/python/scrape_data_python.md index df8dfcdbfd..440d296958 100644 --- a/sources/academy/tutorials/python/scrape_data_python.md +++ b/sources/academy/tutorials/python/scrape_data_python.md @@ -1,12 +1,10 @@ --- -title: How to scrape and process data using Python +title: How to scrape data in Python using Beautiful Soup description: Learn how to create a Python Actor and use Python libraries to scrape, process and visualize data extracted from the web. sidebar_position: 1 slug: /python/scrape-data-python --- -# How to scrape data in Python using Beautiful Soup - **Learn how to create a Python Actor and use Python libraries to scrape, process and visualize data extracted from the web.** --- diff --git a/sources/academy/tutorials/tutorials/index.md b/sources/academy/tutorials/tutorials/index.md index 18384a75fb..459483ef0e 100644 --- a/sources/academy/tutorials/tutorials/index.md +++ b/sources/academy/tutorials/tutorials/index.md @@ -6,8 +6,6 @@ category: tutorials slug: /tutorials --- -# Tutorials ๐Ÿ“š - **Learn about various different specific topics related to web-scraping and web-automation with the Apify Academy tutorial lessons!** --- diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index 1388b31329..618b605fb8 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /advanced-web-scraping/crawling/crawling-with-search --- -# Scraping websites with search - In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results โ€“ only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 5cb7f63466..2c3bf31698 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -6,8 +6,6 @@ category: web scraping & automation slug: /anti-scraping --- -# Anti-scraping protections {#anti-scraping-protections} - **Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.** --- @@ -113,7 +111,7 @@ Because we here at Apify scrape for a living, we have discovered many popular an This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. -> Learn more about rate limiting [here](./techniques/rate_limiting.md) +> Learn more about rate limiting in our [rate limiting guide](./techniques/rate_limiting.md) ### Header checking @@ -135,7 +133,7 @@ This technique is commonly used to entirely block the bot from accessing the web ### Interval analysis -This technique is based on analyzing the time intervals of the visit of a website. If the times are very similar, the entity is added to the greylist. This methodโ€™s premise is that the bot runs in regular intervals by, for example, a CRON job that starts every Monday. It is a long-term strategy, so it should be used as an extension. This technique needs only the information from the HTTP request to identify the frequency of the visits. +This technique is based on analyzing the time intervals of the visit of a website. If the times are very similar, the entity is added to the greylist. This methodโ€™s premise is that the bot runs in regular intervals by, for example, a Cron job that starts every Monday. It is a long-term strategy, so it should be used as an extension. This technique needs only the information from the HTTP request to identify the frequency of the visits. ### Browser fingerprinting diff --git a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md index 516f1ff3bd..789a629a50 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/cloudflare_challenge.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /anti-scraping/mitigation/cloudflare-challenge.md --- -# Bypassing Cloudflare browser check {#cloudflare-challenge} - **Learn how to bypass Cloudflare browser challenge with Crawlee.** --- @@ -28,7 +26,7 @@ const crawler = new PlaywrightCrawler({ It's important to note that by removing default blocked status code handling, you should also add custom session retire logic on blocked pages to reduce retries. Additionally, you should add waiting logic to start the automation logic only after the Cloudflare challenge is solved and the page is redirected. This can be accomplished by waiting for a common selector that is available on all pages, such as a header logo. -In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However, if you have a small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. You can find various captcha-solving services, such as [AntiCaptcha](https://anti-captcha.com/), that you can use for this purpose. For more info check the section about [Captchas](../techniques/captchas.md). +In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However, if you have a small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. You can find various captcha-solving services, such as [AntiCaptcha](https://anti-captcha.com/), that you can use for this purpose. For more info check the section about [CAPTCHAs](../techniques/captchas.md). ![Cloudflare captcha](https://images.ctfassets.net/slt3lc6tev37/6sN2VXiUaJpjxqVfTbZEJd/9a4e13cbf08ce29797167c133c534e1f/image1.png) diff --git a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md index 4756662235..188260bc98 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/generating_fingerprints.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /anti-scraping/mitigation/generating-fingerprints --- -# Generating fingerprints {#generating-fingerprints} - **Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.** --- diff --git a/sources/academy/webscraping/anti_scraping/mitigation/index.md b/sources/academy/webscraping/anti_scraping/mitigation/index.md index c61efdcebe..e1e391dab2 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/index.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/index.md @@ -1,12 +1,10 @@ --- -title: Mitigation +title: Anti-scraping mitigation description: After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques. sidebar_position: 3.2 slug: /anti-scraping/mitigation --- -# Anti-scraping mitigation {#anti-scraping-mitigation} - **After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.** --- diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index 2498f1c401..7ca6c95880 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -5,8 +5,6 @@ sidebar_position: 1 slug: /anti-scraping/mitigation/proxies --- -# Proxies {#about-proxies} - **Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.** --- diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index 819a50c6fc..f824153761 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -5,8 +5,6 @@ sidebar_position: 2 slug: /anti-scraping/mitigation/using-proxies --- -# Using proxies {#using-proxies} - **Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.** --- diff --git a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md index 521d11c9c4..148a7a915f 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md +++ b/sources/academy/webscraping/anti_scraping/techniques/browser_challenges.md @@ -5,8 +5,6 @@ sidebar_position: 5 slug: /anti-scraping/techniques/browser-challenges --- -# Browser challenges {#fingerprinting} - > Learn how to navigate browser challenges like Cloudflare's to effectively scrape data from protected websites. ## Browser challenges @@ -31,4 +29,4 @@ If you want to learn how to bypass Cloudflare challenge visit the [Bypassing Clo ## Next up {#next} -In the [next lesson](./captchas.md), we'll be covering **captchas**, which were mentioned throughout this lesson. It's important to note that attempting to solve a captcha programmatically is the last resort - always try to avoid being presented with the captcha in the first place by using the techniques mentioned in this lesson. +In the [next lesson](./captchas.md), we'll be covering **CAPTCHAs**, which were mentioned throughout this lesson. It's important to note that attempting to solve a captcha programmatically is the last resort - always try to avoid being presented with the captcha in the first place by using the techniques mentioned in this lesson. diff --git a/sources/academy/webscraping/anti_scraping/techniques/captchas.md b/sources/academy/webscraping/anti_scraping/techniques/captchas.md index b8d24fb55e..466f947a89 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/captchas.md +++ b/sources/academy/webscraping/anti_scraping/techniques/captchas.md @@ -1,13 +1,11 @@ --- title: Captchas -description: Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them. +description: Learn about the reasons a bot might be presented a captcha, the best ways to avoid CAPTCHASs in the first place, and how to programmatically solve them. sidebar_position: 5 slug: /anti-scraping/techniques/captchas --- -# Captchas {#captchas} - -**Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them.** +**Learn about the reasons a bot might be presented a captcha, the best ways to avoid CAPTCHASs in the first place, and how to programmatically solve them.** --- @@ -16,7 +14,7 @@ In general, a website will present a user (or scraper) a captcha for 2 main reas 1. The website always does captcha checks to access the desired content. 2. One of the website's anti-bot measures (or the [WAF](./firewalls.md)) has flagged the user as suspicious. -## Dealing with captchas {#dealing-with-captchas} +## Dealing with CAPTCHAs {#dealing-with-captchas} When you've hit a captcha, your first thought should not be how to programmatically solve it. Rather, you should consider the factors as to why you received the captcha in the first place: your bot didn't appear enough like a real user to avoid being presented the challenge. @@ -27,17 +25,17 @@ Have you expended all of the possible options to make your scraper appear more h - Generating and using a custom [browser fingerprint](./fingerprinting.md)? - Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser? -## Solving captchas {#solving-captchas} +## Solving CAPTCHASs {#solving-captchas} -If you've tried everything you can to avoid being presented the captcha and are still facing this roadblock, there are methods to programmatically solve captchas. +If you've tried everything you can to avoid being presented the captcha and are still facing this roadblock, there are methods to programmatically solve CAPTCHASs. -Tons of different types of captchas exist, but one of the most popular is Google's [**reCAPTCHA**](https://www.google.com/recaptcha/about/). +Tons of different types of CAPTCHASs exist, but one of the most popular is Google's [**reCAPTCHA**](https://www.google.com/recaptcha/about/). ![Google's reCAPTCHA](https://miro.medium.com/max/1400/1*4NhFKMxr-qXodjYpxtiE0w.gif) -**reCAPTCHA**s can be solved using the [Anti Captcha Recaptcha](https://apify.com/petr_cermak/anti-captcha-recaptcha) Actor on the Apify platform (note that this method requires an account on [anti-captcha.com](https://anti-captcha.com)). +**reCAPTCHA**s can be solved using the [Anti CAPTCHA reCAPTCHA](https://apify.com/petr_cermak/anti-captcha-recaptcha) Actor on the Apify platform (note that this method requires an account on [anti-captcha.com](https://anti-captcha.com)). -Another popular captcha is the [Geetest slider captcha](https://www.geetest.com/en/adaptive-captcha-demo). You can learn how to solve these types of captchas in Puppeteer by reading this [guide on solving Geetest slider captchas](https://filipvitas.medium.com/how-to-solve-geetest-slider-captcha-with-js-ac764c4e9905). Amazon's captcha can similarly also be solved programmatically. +Another popular captcha is the [Geetest slider captcha](https://www.geetest.com/en/adaptive-captcha-demo). You can learn how to solve these types of CAPTCHASs in Puppeteer by reading this [guide on solving Geetest slider CAPTCHAs](https://filipvitas.medium.com/how-to-solve-geetest-slider-captcha-with-js-ac764c4e9905). Amazon's captcha can similarly also be solved programmatically. ## Wrap up diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index df882f4e17..788de4d0e8 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -5,8 +5,6 @@ sidebar_position: 2 slug: /anti-scraping/techniques/fingerprinting --- -# Fingerprinting {#fingerprinting} - **Understand browser fingerprinting, an advanced technique used by browsers to track user data and even block bots from accessing them.** --- @@ -87,7 +85,7 @@ navigator.permissions.query('some_permission'); ### With canvases {#with-canvases} -This technique is based on rendering [WebGL](https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API) scenes to a canvas element and observing the pixels rendered. WebGL rendering is tightly connected with the hardware, and therefore provides high entropy. Here's a quick breakdown of how it works: +This technique is based on rendering [WebGL](https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API) scenes to a canvas element and observing the rendered pixels. WebGL rendering is tightly connected with the hardware, and therefore provides high entropy. Here's a quick breakdown of how it works: 1. A JavaScript script creates a [`` element](https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API) and renders some font or a custom shape. 2. The script then gets the pixel-map from the `` element. diff --git a/sources/academy/webscraping/anti_scraping/techniques/firewalls.md b/sources/academy/webscraping/anti_scraping/techniques/firewalls.md index 2855adca15..cf190817ab 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/firewalls.md +++ b/sources/academy/webscraping/anti_scraping/techniques/firewalls.md @@ -5,8 +5,6 @@ sidebar_position: 4 slug: /anti-scraping/techniques/firewalls --- -# Firewalls {#firewalls} - **Understand what a web-application firewall is, how they work, and the various common techniques for avoiding them altogether.** --- diff --git a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md index d2603c4ea9..1364ba58dd 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/geolocation.md +++ b/sources/academy/webscraping/anti_scraping/techniques/geolocation.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /anti-scraping/techniques/geolocation --- -# Geolocation {#geolocation} - **Learn about the geolocation techniques to determine where requests are coming from, and a bit about how to avoid being blocked based on geolocation.** --- diff --git a/sources/academy/webscraping/anti_scraping/techniques/index.md b/sources/academy/webscraping/anti_scraping/techniques/index.md index b1dfdb3ee3..ebea926b7a 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/index.md +++ b/sources/academy/webscraping/anti_scraping/techniques/index.md @@ -5,8 +5,6 @@ sidebar_position: 3.1 slug: /anti-scraping/techniques --- -# Anti-scraping techniques {#anti-scraping-techniques} - **Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.** --- @@ -23,7 +21,7 @@ This is a complete block which usually has a response status code of **403**. Us ## Captcha page {#captcha} -Probably the most common blocking method. The website gives you a chance to prove that you are not a bot by presenting you with a captcha. We'll be covering captchas within this course. +Probably the most common blocking method. The website gives you a chance to prove that you are not a bot by presenting you with a captcha. We'll be covering CAPTCHAs within this course. ## Redirect {#redirect} diff --git a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md index b61cd06abf..1706e28983 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md @@ -5,8 +5,6 @@ sidebar_position: 1 slug: /anti-scraping/techniques/rate-limiting --- -# Rate-limiting {#rate-limiting} - **Learn about rate-limiting, a common tactic used by websites to avoid a large and non-human rate of requests coming from a single IP address.** --- diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md index a639d45512..8afd602af8 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/cookies_headers_tokens.md @@ -1,12 +1,10 @@ --- -title: Cookies, headers, and tokens +title: Dealing with headers, cookies, and tokens description: Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received. sidebar_position: 2 slug: /api-scraping/general-api-scraping/cookies-headers-tokens --- -# Dealing with headers, cookies, and tokens {#challenges} - **Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received.** --- @@ -17,8 +15,8 @@ Luckily, there are ways to retrieve and set cookies for requests prior to sendin ## Cookies {#cookies} -1. For sites that heavily rely on cookies for user-verification and request authorization, certain generic requests (such as to the website's main page, or to the target page) will return back a (or multiple) `set-cookie` header(s). -2. The `set-cookie` response header(s) can be parsed and used as the `cookie` header in the headers of a request. A great package for parsing these values from a response's headers is [`set-cookie-parser`](https://www.npmjs.com/package/set-cookie-parser). With this package, cookies can be parsed from headers like so: +1. For sites that heavily rely on cookies for user-verification and request authorization, certain generic requests (such as to the website's main page, or to the target page) will return back a (or multiple) `set-cookie` headers. +2. The `set-cookie` response headers can be parsed and used as the `cookie` header in the headers of a request. A great package for parsing these values from a response's headers is [`set-cookie-parser`](https://www.npmjs.com/package/set-cookie-parser). With this package, cookies can be parsed from headers like so: ```js import axios from 'axios'; diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 57c43b040d..ec626f9703 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /api-scraping/general-api-scraping/handling-pagination --- -# Handling pagination {#handling-pagination} - **Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination.** --- diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/index.md b/sources/academy/webscraping/api_scraping/general_api_scraping/index.md index d6e60950b2..075d73b39c 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/index.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/index.md @@ -5,8 +5,6 @@ sidebar_position: 4.1 slug: /api-scraping/general-api-scraping --- -# General API scraping {#general-api-scraping} - **Learn the benefits and drawbacks of API scraping, how to locate an API, how to utilize its features, and how to work around common roadblocks.** --- diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index e7c4062d1e..156ea516d3 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -5,8 +5,6 @@ sidebar_position: 1 slug: /api-scraping/general-api-scraping/locating-and-learning --- -# Locating API endpoints {#locating-endpoints} - **Learn how to effectively locate a website's API endpoints, and learn how to use them to get the data you want faster and more reliably.** --- diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md index 16bdd5f8bc..56f51716e7 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/custom_queries.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /api-scraping/graphql-scraping/custom-queries --- -# Custom queries {#custom-queries} - **Learn how to write custom GraphQL queries, how to pass input values into GraphQL requests as variables, and how to retrieve and output the data from a scraper.** --- diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/index.md b/sources/academy/webscraping/api_scraping/graphql_scraping/index.md index e3ed00b095..16588b5b5d 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/index.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/index.md @@ -5,8 +5,6 @@ sidebar_position: 4.2 slug: /api-scraping/graphql-scraping --- -# GraphQL scraping {#graphql-scraping} - **Dig into the topic of scraping APIs which use the latest and greatest API technology - GraphQL. GraphQL APIs are very different from regular REST APIs.** --- diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md index 2cb6bc46de..2f4638ba71 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md @@ -5,8 +5,6 @@ sidebar_position: 2 slug: /api-scraping/graphql-scraping/introspection --- -# Introspection {#introspection} - **Understand what introspection is, and how it can help you understand a GraphQL API to take advantage of the features it has to offer before writing any code.** --- diff --git a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md index 8a5dccdd68..c90a3450b2 100644 --- a/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md +++ b/sources/academy/webscraping/api_scraping/graphql_scraping/modifying_variables.md @@ -5,8 +5,6 @@ sidebar_position: 1 slug: /api-scraping/graphql-scraping/modifying-variables --- -# Modifying variables {#modifying-variables} - **Learn how to modify the variables of a JSON format GraphQL query to use the API without needing to write any GraphQL language or create custom queries.** --- diff --git a/sources/academy/webscraping/api_scraping/index.md b/sources/academy/webscraping/api_scraping/index.md index 20df96efc0..67c1b81a08 100644 --- a/sources/academy/webscraping/api_scraping/index.md +++ b/sources/academy/webscraping/api_scraping/index.md @@ -6,8 +6,6 @@ category: web scraping & automation slug: /api-scraping --- -# API scraping - **Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/browser.md b/sources/academy/webscraping/puppeteer_playwright/browser.md index 490c972fd8..6970210e63 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/browser import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Browser {#browser} - **Understand what the Browser object is in Puppeteer/Playwright, how to create one, and a bit about how to interact with one.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md index 8891772f17..4d2d48a029 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/browser-contexts import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Creating multiple browser contexts {#creating-browser-contexts} - **Learn what a browser context is, how to create one, how to emulate devices, and how to use browser contexts to automate multiple sessions at one time.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md index bb17676869..21f967203b 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/downloading_files.md @@ -5,8 +5,6 @@ sidebar_position: 3 slug: /puppeteer-playwright/common-use-cases/downloading-files --- -# Downloading files - **Learn how to automatically download and save files to the disk using two of the most popular web automation libraries, Puppeteer and Playwright.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md index c79c8ae338..d4fa435fac 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/index.md @@ -5,8 +5,6 @@ sidebar_position: 7.7 slug: /puppeteer-playwright/common-use-cases --- -# Common use cases {#common-use-cases} - **Learn about some of the most common use cases of Playwright and Puppeteer, and how to handle these use cases when you run into them.** --- @@ -15,9 +13,9 @@ You can do about anything with a headless browser, but, there are some extremely 1. Login flow (logging into an account) 2. Paginating through results on a website -3. Solving browser challenges (ex. captchas) +3. Solving browser challenges (ex. CAPTCHAs) 4. More! -# Next up {#next} +## Next up {#next} The [first lesson](./logging_into_a_website.md) of this section is all about logging into a website and running multiple concurrent operations within a user's account. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md index 6cd7ef0435..a3fad312e0 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/logging_into_a_website.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/common-use-cases/logging-into-a-website import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Logging into a website {#logging-into-a-website} - **Understand the "login flow" - logging into a website, then maintaining a logged in status within different browser contexts for an efficient automation process.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index b4833ddad6..6ded8c03b4 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/common-use-cases/paginating-through-results import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Paginating through results {#paginating-through-results} - **Learn how to paginate through results on websites that use either pagination based on page numbers or dynamic lazy loading.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md index 760baf0c49..85f218e80a 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/scraping_iframes.md @@ -5,8 +5,6 @@ sidebar_position: 5 slug: /puppeteer-playwright/common-use-cases/scraping-iframes --- -# Scraping iFrames - **Extracting data from iFrames can be frustrating. In this tutorial, we will learn how to scrape information from iFrames using Puppeteer or Playwright.** --- @@ -17,7 +15,7 @@ Getting information from inside iFrames is a known pain, especially for new deve If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use. -As a demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). +As a demonstration, we'll scrape the Twitter widget iFrame from [IMDb](https://www.imdb.com/). ```js import puppeteer from 'puppeteer'; diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md index ce4f733086..8a20b90150 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md @@ -5,8 +5,6 @@ sidebar_position: 4 slug: /puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment --- -# Submitting a form with a file attachment - **Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 4fb52aa83a..95313816cb 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/executing-scripts/collecting-data import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Extracting data {#extracting-data} - **Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/index.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/index.md index e8e08e2546..a43bd32dcc 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/index.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/executing-scripts import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Executing scripts {#executing-scripts} - **Understand the two different contexts which your code can be run in, and how to run custom scripts in the context of the browser.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md index b132d0ac51..dfb942a3a0 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/injecting_code.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/executing-scripts/injecting-code import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Injecting code {#injecting-code} - **Learn how to inject scripts prior to a page's load (pre-injecting), as well as how to expose functions to be run at a later time on the page.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/index.md b/sources/academy/webscraping/puppeteer_playwright/index.md index 77f8781993..18343b5cea 100644 --- a/sources/academy/webscraping/puppeteer_playwright/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/index.md @@ -1,5 +1,5 @@ --- -title: Puppeteer & Playwright +title: Puppeteer and Playwright course description: Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright. sidebar_position: 3 category: web scraping & automation @@ -9,8 +9,6 @@ slug: /puppeteer-playwright import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Puppeteer & Playwright course {#puppeteer-playwright-course} - **Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/page/index.md b/sources/academy/webscraping/puppeteer_playwright/page/index.md index d96db0d003..e0f9391cfd 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/index.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/index.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/page import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Opening a page {#opening-a-page} - **Learn how to create and open a Page with a Browser, and how to use it to visit and programmatically interact with a website.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md index ec1c5d0db7..99d0ecbbbb 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/interacting_with_a_page.md @@ -8,15 +8,13 @@ slug: /puppeteer-playwright/page/interacting-with-a-page import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Interacting with a page {#interacting-with-a-page} - **Learn how to programmatically do actions on a page such as clicking, typing, and pressing keys. Also, discover a common roadblock that comes up when automating.** --- The **Page** object has a whole boat-load of functions which can be used to interact with the loaded page. We're not going to go over every single one of them right now, but we _will_ use a few of the most common ones to add some functionality to our current project. -Let's say that we want to automate searching for **hello world** on Google, then click on the first result and log the title of the page to the console, then take a screenshot and write it it to the filesystem. In order to understand how we're going to automate this, let's break down how we would do it manually: +Let's say that we want to automate searching for **hello world** on Google, then click on the first result and log the title of the page to the console, then take a screenshot and write it to the filesystem. In order to understand how we're going to automate this, let's break down how we would do it manually: 1. Click on the button which accepts Google's cookies policy (To see how it looks, open Google in an anonymous window.) 2. Type **hello world** into the search bar diff --git a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md index 7ca8026574..8e4a265cdf 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/page/page-methods import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Page methods {#page-methods} - **Understand that the Page object has many different methods to offer, and learn how to use two of them to capture a page's title and take a screenshot.** --- @@ -34,9 +32,9 @@ const title = await page.title(); // Log the title to the console console.log(title); ``` - + ## Screenshotting {#screenshotting} - + The `page.screenshot()` function will return a buffer which can be written to the filesystem as an image: ```js diff --git a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md index 768a13fe98..fee881f213 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/waiting.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/waiting.md @@ -1,5 +1,5 @@ --- -title: Waiting for content & events +title: Waiting for elements and events description: Learn the importance of waiting for content and events before running interaction or extraction code, as well as the best practices for doing so. sidebar_position: 2 slug: /puppeteer-playwright/page/waiting @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/page/waiting import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Waiting for elements and events {#waiting-for-elements-and-events} - **Learn the importance of waiting for content and events before running interaction or extraction code, as well as the best practices for doing so.** --- @@ -67,7 +65,7 @@ await page.click('.g a'); await page.waitForNavigation(); ``` -Though in theory this is correct, it can result in a race condition in which the page navigates quickly before the `page.waitForNavigation()` function is ever run, which means that once it is finally called, it will hang and wait forever for the [`load` event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event) event to fire even though it already fired. To solve this, we can stick the waiting logic and the clicking logic into a `Promise.all()` call (placing `page.waitForNavigation()` first). +Though in theory this is correct, it can result in a race condition in which the page navigates quickly before the `page.waitForNavigation()` function is ever run, which means that once it is finally called, it will hang and wait forever for the [`load` event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event) to fire even though it already fired. To solve this, we can stick the waiting logic and the clicking logic into a `Promise.all()` call (placing `page.waitForNavigation()` first). ```js await Promise.all([page.waitForNavigation(), page.click('.g a')]); diff --git a/sources/academy/webscraping/puppeteer_playwright/proxies.md b/sources/academy/webscraping/puppeteer_playwright/proxies.md index 60c1d04416..0e81f41db4 100644 --- a/sources/academy/webscraping/puppeteer_playwright/proxies.md +++ b/sources/academy/webscraping/puppeteer_playwright/proxies.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/proxies import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Using proxies {#using-proxies} - **Understand how to use proxies in your Puppeteer and Playwright requests, as well as a couple of the most common use cases for proxies.** --- diff --git a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md index 2d7d4d0072..cf873efa79 100644 --- a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md +++ b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md @@ -8,8 +8,6 @@ slug: /puppeteer-playwright/reading-intercepting-requests import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Reading & intercepting requests {#reading-intercepting-requests} - **You can use DevTools, but did you know that you can do all the same stuff (plus more) programmatically? Read and intercept requests in Puppeteer/Playwright.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/best_practices.md b/sources/academy/webscraping/scraping_basics_javascript/best_practices.md index b3e1540cc4..754f44ea43 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/best_practices.md +++ b/sources/academy/webscraping/scraping_basics_javascript/best_practices.md @@ -1,12 +1,10 @@ --- -title: Best practices +title: Best practices when writing scrapers description: Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code. sidebar_position: 1.5 slug: /web-scraping-for-beginners/best-practices --- -# Best practices when writing scrapers {#best-practices} - **Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/challenge/index.md b/sources/academy/webscraping/scraping_basics_javascript/challenge/index.md index 3ab9ca4ee1..ae1e8bf294 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/challenge/index.md +++ b/sources/academy/webscraping/scraping_basics_javascript/challenge/index.md @@ -5,8 +5,6 @@ sidebar_position: 1.4 slug: /web-scraping-for-beginners/challenge --- -# Challenge - **Test your knowledge acquired in the previous sections of this course by building an Amazon scraper using Crawlee's CheerioCrawler!** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/challenge/initializing_and_setting_up.md b/sources/academy/webscraping/scraping_basics_javascript/challenge/initializing_and_setting_up.md index c0cf40bc11..cc3e05ac1b 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/challenge/initializing_and_setting_up.md +++ b/sources/academy/webscraping/scraping_basics_javascript/challenge/initializing_and_setting_up.md @@ -1,12 +1,10 @@ --- -title: Initializing & setting up +title: Initialization and setting up description: When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need. sidebar_position: 1 slug: /web-scraping-for-beginners/challenge/initializing-and-setting-up --- -# Initialization & setting up - **When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/challenge/modularity.md b/sources/academy/webscraping/scraping_basics_javascript/challenge/modularity.md index e6d62c7b32..da46a9d4a7 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/challenge/modularity.md +++ b/sources/academy/webscraping/scraping_basics_javascript/challenge/modularity.md @@ -5,8 +5,6 @@ sidebar_position: 2 slug: /web-scraping-for-beginners/challenge/modularity --- -# Modularity - **Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/challenge/scraping_amazon.md b/sources/academy/webscraping/scraping_basics_javascript/challenge/scraping_amazon.md index fa82915930..2b46d66e61 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/challenge/scraping_amazon.md +++ b/sources/academy/webscraping/scraping_basics_javascript/challenge/scraping_amazon.md @@ -5,8 +5,6 @@ sidebar_position: 4 slug: /web-scraping-for-beginners/challenge/scraping-amazon --- -# Scraping Amazon - **Build your first web scraper with Crawlee. Let's extract product information from Amazon to give you an idea of what real-world scraping looks like.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/exporting_data.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/exporting_data.md index d0d4baad8d..8101d3d19c 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/exporting_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/exporting_data.md @@ -5,8 +5,6 @@ sidebar_position: 9 slug: /web-scraping-for-beginners/crawling/exporting-data --- -# Exporting data {#exporting-data} - **Learn how to export the data you scraped using Crawlee to CSV or JSON.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/filtering_links.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/filtering_links.md index 34d4961aaa..c54918496e 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/filtering_links.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/filtering_links.md @@ -8,8 +8,6 @@ slug: /web-scraping-for-beginners/crawling/filtering-links import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Filtering links {#filtering-links} - **When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/first_crawl.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/first_crawl.md index 432d06f646..557b0a4877 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/first_crawl.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/first_crawl.md @@ -5,8 +5,6 @@ sidebar_position: 5 slug: /web-scraping-for-beginners/crawling/first-crawl --- -# Your first crawl {#your-first-crawl} - **Learn how to crawl the web using Node.js, Cheerio and an HTTP client. Extract URLs from pages and use them to visit more websites.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/headless_browser.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/headless_browser.md index b57a810645..8d067b06d5 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/headless_browser.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/headless_browser.md @@ -8,8 +8,6 @@ slug: /web-scraping-for-beginners/crawling/headless-browser import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Headless browsers {#headless-browser} - **Learn how to scrape the web with a headless browser using only a few lines of code. Chrome, Firefox, Safari, Edge - all are supported.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/pro_scraping.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/pro_scraping.md index b4b1616417..7985a0da0a 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/pro_scraping.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/pro_scraping.md @@ -5,8 +5,6 @@ sidebar_position: 7 slug: /web-scraping-for-beginners/crawling/pro-scraping --- -# Professional scraping ๐Ÿ‘ท {#pro-scraping} - **Learn how to build scrapers quicker and get better and more robust results by using Crawlee, an open-source library for scraping in Node.js.** --- @@ -53,7 +51,7 @@ To use Crawlee, we have to install it from npm. Let's add it to our project from npm install crawlee ``` -After the installation completes, create a new file called **crawlee.js** and add the following code to it: +After the installation completes, create a new file called `crawlee.js` and add the following code to it: ```js title=crawlee.js import { CheerioCrawler } from 'crawlee'; diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/recap_extraction_basics.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/recap_extraction_basics.md index cdeea8cd58..fa6899b64b 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/recap_extraction_basics.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/recap_extraction_basics.md @@ -1,12 +1,10 @@ --- -title: Recap - Data extraction +title: Recap of data extraction basics description: Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data. sidebar_position: 1 slug: /web-scraping-for-beginners/crawling/recap-extraction-basics --- -# Recap of data extraction basics {#quick-recap} - **Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/relative_urls.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/relative_urls.md index f9487c80a8..2f06cf1570 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/relative_urls.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/relative_urls.md @@ -5,8 +5,6 @@ sidebar_position: 4 slug: /web-scraping-for-beginners/crawling/relative-urls --- -# Relative URLs {#filtering-links} - **Learn about absolute and relative URLs used on web pages and how to work with them when parsing HTML with Cheerio in your scraper.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/crawling/scraping_the_data.md b/sources/academy/webscraping/scraping_basics_javascript/crawling/scraping_the_data.md index 734c637d67..6de6df472a 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/crawling/scraping_the_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript/crawling/scraping_the_data.md @@ -5,8 +5,6 @@ sidebar_position: 6 slug: /web-scraping-for-beginners/crawling/scraping-the-data --- -# Scraping data {#scraping-data} - **Learn how to add data extraction logic to your crawler, which will allow you to extract data from all the websites you crawled.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md index c4b9baf78b..eb27b7658a 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/computer_preparation.md @@ -1,12 +1,10 @@ --- -title: Computer preparation +title: Prepare your computer for programming description: Set up your computer to be able to code scrapers with Node.js and JavaScript. Download Node.js and npm and run a Hello World script. sidebar_position: 4 slug: /web-scraping-for-beginners/data-extraction/computer-preparation --- -# Prepare your computer for programming {#prepare-computer} - **Set up your computer to be able to code scrapers with Node.js and JavaScript. Download Node.js and npm and run a Hello World script.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/index.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/index.md index 0482b5eb38..1c4e2c2f4c 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/index.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/index.md @@ -6,8 +6,6 @@ category: courses slug: /web-scraping-for-beginners/data-extraction --- -# Basics of data extraction {#basics} - **Learn about HTML, CSS, and JavaScript, the basic building blocks of a website, and how to use them in web scraping and data extraction.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/node_continued.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/node_continued.md index 1fdb51e7e7..f5e10bed50 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/node_continued.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/node_continued.md @@ -173,6 +173,4 @@ Congratulations! You completed the **Basics of data extraction** section of the Great job! ๐Ÿ‘๐ŸŽ‰ -# Next up {#next} - What's next? While we were able to extract the data, it's not super useful to have it printed to the terminal. In the [next, bonus lesson](./save_to_csv.md), we will learn how to convert the data to a CSV and save it to a file. diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/project_setup.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/project_setup.md index 72b146a408..acd1a18422 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/project_setup.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/project_setup.md @@ -1,12 +1,10 @@ --- -title: Project setup +title: Setting up your project description: Create a new project with npm and Node.js. Install necessary libraries, and test that everything works before starting the next lesson. sidebar_position: 5 slug: /web-scraping-for-beginners/data-extraction/project-setup --- -# Setting up your project {#setting-up} - **Create a new project with npm and Node.js. Install necessary libraries, and test that everything works before starting the next lesson.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/save_to_csv.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/save_to_csv.md index b6ec1b7df4..0956a0dccf 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/save_to_csv.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/save_to_csv.md @@ -5,8 +5,6 @@ sidebar_position: 8 slug: /web-scraping-for-beginners/data-extraction/save-to-csv --- -# Saving results to CSV {#saving-to-csv} - **Learn how to save the results of your scraper's collected data to a CSV file that can be opened in Excel, Google Sheets, or any other spreadsheets program.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/index.md b/sources/academy/webscraping/scraping_basics_javascript/index.md index 4ffd9c8497..1443e481b2 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/index.md +++ b/sources/academy/webscraping/scraping_basics_javascript/index.md @@ -7,8 +7,6 @@ category: web scraping slug: /web-scraping-for-beginners --- -# Web scraping basics for JavaScript devs {#welcome} - **Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.** --- diff --git a/sources/academy/webscraping/scraping_basics_javascript/introduction.md b/sources/academy/webscraping/scraping_basics_javascript/introduction.md index aff6571d1f..2aeab97644 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/introduction.md +++ b/sources/academy/webscraping/scraping_basics_javascript/introduction.md @@ -6,8 +6,6 @@ category: courses slug: /web-scraping-for-beginners/introduction --- -# Introduction {#introduction} - **Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.** ---