From fada6bf5cd9109b1bd4b306d831387254fcb11d5 Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Tue, 17 Sep 2024 00:18:40 +0200 Subject: [PATCH 1/9] feat(academy): add advanced crawling section with sitemaps and search --- .../node_js/scraping_from_sitemaps.md | 10 +++ .../crawling/crawling-sitemaps.md | 71 +++++++++++++++++++ .../crawling-with-search.md} | 17 +++-- .../crawling/sitemaps-vs-search.md | 55 ++++++++++++++ .../advanced_web_scraping/index.md | 29 +++++--- 5 files changed, 165 insertions(+), 17 deletions(-) create mode 100644 sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md rename sources/academy/webscraping/advanced_web_scraping/{scraping_paginated_sites.md => crawling/crawling-with-search.md} (95%) create mode 100644 sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md diff --git a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md index f8cbdb6953..e212d0a009 100644 --- a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md +++ b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md @@ -9,6 +9,16 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; # How to scrape from sitemaps {#scraping-with-sitemaps} +>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code: + +```javascript +import { RobotsFile } from 'crawlee'; + +const robots = await RobotsFile.find('https://www.mysite.com'); + +const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); +``` + **The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.** --- diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md new file mode 100644 index 0000000000..407c12a362 --- /dev/null +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -0,0 +1,71 @@ +--- +title: Crawling sitemaps +description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. +menuWeight: 2 +paths: +- advanced-web-scraping/crawling/crawling-sitemaps +--- + +In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps. + +We will look at the following topics: +- How to find sitemap URLs +- How to set up HTTP requests to download sitemaps +- How to parse URLs from sitemaps +- Using Crawlee to get all URLs in a few lines of code + +## [](#how-to-find-sitemap-urls) How to find sitemap URLs +Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. + +### [](#google) Google +You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon. + +### [](#robots-txt) robots.txt +If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. + +### [](#common-url-paths) Common URL paths +You can try to iterate over common URL paths like: +``` +/sitemap.xml +/product_index.xml +/product_template.xml +/sitemap_index.xml +/sitemaps/sitemap_index.xml +/sitemap/product_index.xml +/media/sitemap.xml +/media/sitemap/sitemap.xml +/media/sitemap/index.xml +``` + +Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`). + +Some websites also provide an HTML version, to help indexing bots find new content. Those include: + +``` +/sitemap +/category-sitemap +/sitemap.html +/sitemap_index +``` + +Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. + +## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps +For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. + +## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps +The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. + +## [](#using-crawlee) Using Crawlee +Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: + +```javascript +import { RobotsFile } from 'crawlee'; + +const robots = await RobotsFile.find('https://www.mysite.com'); + +const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); +``` + +## [](#next) Next up +That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination. diff --git a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md similarity index 95% rename from sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md rename to sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index badda5bb01..d6b351016e 100644 --- a/sources/academy/webscraping/advanced_web_scraping/scraping_paginated_sites.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -1,17 +1,16 @@ --- -title: Overcoming pagination limits +title: Crawling with search description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -sidebar_position: 1 -slug: /advanced-web-scraping/scraping-paginated-sites +menuWeight: 3 +paths: +- advanced-web-scraping/crawling/crawling-with-search --- -# Scraping websites with limited pagination +# Scraping websites with search -**Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.** +In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. ---- - -Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results, only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. +Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. ![Pagination in on Google search results page](./images/pagination.png) @@ -283,7 +282,7 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md new file mode 100644 index 0000000000..50922778c4 --- /dev/null +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -0,0 +1,55 @@ +--- +title: Sitemaps vs search +description: Learn how to extract all of a website's listings even if they limit the number of results pages. +menuWeight: 1 +paths: +- advanced-web-scraping/crawling/sitemaps-vs-search +--- + +The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. + +Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. + +There are two main approaches to solving this problem: +- Extracting all page URLs from the website's **sitemap**. +- Using **categories, search and filters** to split the website so we get under the pagination limit. + +Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why. + +## Pros and cons of sitemaps +Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson. + +### Pros +- **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. +- **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. +- **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. + +### Cons +- **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. +- **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. +- **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. +- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson. + +## Pros and cons of categories, search, and filters +This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit. + +The pros and cons of this approach are pretty much the opposite of the sitemaps approach. + +### Pros +- **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. +- **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages. +- **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. + +### Cons +- **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons. +- **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. +- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this. + +## Do we know how many products there are? +Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. + +Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. + +## [](#next) Next up + +First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 0b20f907ef..a89fea233e 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -1,21 +1,34 @@ --- title: Advanced web scraping -description: Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. -sidebar_position: 6 +description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. +menuWeight: 6 category: web scraping & automation -slug: /advanced-web-scraping +paths: +- advanced-web-scraping --- # Advanced web scraping -**Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.** +In [**Web scraping for beginners**](/academy/webscraping/scraping_basics_javascript/index.md) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. ---- +In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. + +## [](#what-does-production-ready-mean) What does production-ready mean? + +To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are. -In this course, we'll be tackling some of the most challenging and advanced web-scraping cases, such as mobile-app scraping, scraping sites with limited pagination, and handling large-scale cases where millions of items are scraped. Are **you** ready to take your scrapers to the next level? + + +We will also touch on monitoring, performance, anti-scraping protections, and debugging. If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎 -## First up {#first-up} +## [](#first-up) First up + +First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website. + -This course's [first lesson](./scraping_paginated_sites.md) dives head-first into one of the most valuable skills you can have as a scraper developer: **Scraping paginated sites**. From a6f7150fa4ead5a95a5caf0c92c0ba6d09a0051c Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Wed, 18 Sep 2024 15:34:01 +0200 Subject: [PATCH 2/9] fix(academy): try to resolve bad links and redirect in nginx --- nginx.conf | 3 ++ .../crawling/crawling-sitemaps.md | 42 ++++++++++-------- .../crawling/crawling-with-search.md | 1 - .../images/pagination-filters.png | Bin .../{ => crawling}/images/pagination.png | Bin .../crawling/sitemaps-vs-search.md | 7 +++ .../advanced_web_scraping/index.md | 2 +- 7 files changed, 35 insertions(+), 20 deletions(-) rename sources/academy/webscraping/advanced_web_scraping/{ => crawling}/images/pagination-filters.png (100%) rename sources/academy/webscraping/advanced_web_scraping/{ => crawling}/images/pagination.png (100%) diff --git a/nginx.conf b/nginx.conf index 04d15af4bf..d8c203ee21 100644 --- a/nginx.conf +++ b/nginx.conf @@ -302,6 +302,9 @@ server { rewrite ^/platform/actors/development/actor-definition/output-schema$ /platform/actors/development/actor-definition/dataset-schema permanent; rewrite ^academy/deploying-your-code/output-schema$ /academy/deploying-your-code/dataset-schema permanent; + # Academy restructuring + rewrite ^academy/advanced-web-scraping/scraping-paginated-sites$ /academy/advanced-web-scraping/crawling/crawling-with-search permanent; + # Removed pages # GPT plugins were discontinued April 9th, 2024 - https://help.openai.com/en/articles/8988022-winding-down-the-chatgpt-plugins-beta rewrite ^/platform/integrations/chatgpt-plugin$ https://blog.apify.com/add-custom-actions-to-your-gpts/ redirect; diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index 407c12a362..8912f87b92 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -9,54 +9,59 @@ paths: In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps. We will look at the following topics: + - How to find sitemap URLs - How to set up HTTP requests to download sitemaps - How to parse URLs from sitemaps - Using Crawlee to get all URLs in a few lines of code ## [](#how-to-find-sitemap-urls) How to find sitemap URLs + Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. ### [](#google) Google + You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon. ### [](#robots-txt) robots.txt + If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. ### [](#common-url-paths) Common URL paths + You can try to iterate over common URL paths like: -``` -/sitemap.xml -/product_index.xml -/product_template.xml -/sitemap_index.xml -/sitemaps/sitemap_index.xml -/sitemap/product_index.xml -/media/sitemap.xml -/media/sitemap/sitemap.xml -/media/sitemap/index.xml -``` + +- /sitemap.xml +- /product_index.xml +- /product_template.xml +- /sitemap_index.xml +- /sitemaps/sitemap_index.xml +- /sitemap/product_index.xml +- /media/sitemap.xml +- /media/sitemap/sitemap.xml +- /media/sitemap/index.xml Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`). Some websites also provide an HTML version, to help indexing bots find new content. Those include: -``` -/sitemap -/category-sitemap -/sitemap.html -/sitemap_index -``` +- /sitemap +- /category-sitemap +- /sitemap.html +- /sitemap_index Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. ## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps + For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. ## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. + +The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. ## [](#using-crawlee) Using Crawlee + Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: ```javascript @@ -68,4 +73,5 @@ const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); ``` ## [](#next) Next up + That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination. diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index 3e971e53eb..39d6c6e520 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -285,4 +285,3 @@ await crawler.addRequests(requestsToEnqueue); And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). - diff --git a/sources/academy/webscraping/advanced_web_scraping/images/pagination-filters.png b/sources/academy/webscraping/advanced_web_scraping/crawling/images/pagination-filters.png similarity index 100% rename from sources/academy/webscraping/advanced_web_scraping/images/pagination-filters.png rename to sources/academy/webscraping/advanced_web_scraping/crawling/images/pagination-filters.png diff --git a/sources/academy/webscraping/advanced_web_scraping/images/pagination.png b/sources/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png similarity index 100% rename from sources/academy/webscraping/advanced_web_scraping/images/pagination.png rename to sources/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 50922778c4..2aac814718 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -17,35 +17,42 @@ There are two main approaches to solving this problem: Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why. ## Pros and cons of sitemaps + Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson. ### Pros + - **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. - **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. - **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. ### Cons + - **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. - **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. - **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. - **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson. ## Pros and cons of categories, search, and filters + This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit. The pros and cons of this approach are pretty much the opposite of the sitemaps approach. ### Pros + - **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. - **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages. - **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. ### Cons + - **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons. - **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. - **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this. ## Do we know how many products there are? + Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index a89fea233e..6bb6566a50 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -9,7 +9,7 @@ paths: # Advanced web scraping -In [**Web scraping for beginners**](/academy/webscraping/scraping_basics_javascript/index.md) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. +In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. From 58f576fc7a0f5ede316c5b5001f4aaf4eb1dff75 Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Wed, 18 Sep 2024 15:47:13 +0200 Subject: [PATCH 3/9] more link fixes --- .../advanced_web_scraping/crawling/crawling-sitemaps.md | 2 +- .../advanced_web_scraping/crawling/crawling-with-search.md | 2 +- .../advanced_web_scraping/crawling/sitemaps-vs-search.md | 3 ++- .../common_use_cases/paginating_through_results.md | 2 +- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index 8912f87b92..adb279f181 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -58,7 +58,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X ## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. +The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. ## [](#using-crawlee) Using Crawlee diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index 39d6c6e520..e73c90d53d 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -282,6 +282,6 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 2aac814718..8f57044e2e 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -6,11 +6,12 @@ paths: - advanced-web-scraping/crawling/sitemaps-vs-search --- -The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. +The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: + - Extracting all page URLs from the website's **sitemap**. - Using **categories, search and filters** to split the website so we get under the pagination limit. diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index 4bab157650..2342f40d4c 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. -![Amazon pagination](../../advanced_web_scraping/images/pagination.png) +![Amazon pagination](/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png) ## Page number-based pagination {#page-number-based-pagination} From 9d531da39590d371c3c3ef8efad2f46c972b7529 Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Wed, 18 Sep 2024 15:54:50 +0200 Subject: [PATCH 4/9] another try --- .../advanced_web_scraping/crawling/crawling-sitemaps.md | 2 +- .../advanced_web_scraping/crawling/crawling-with-search.md | 2 +- sources/academy/webscraping/advanced_web_scraping/index.md | 2 -- .../common_use_cases/paginating_through_results.md | 2 +- 4 files changed, 3 insertions(+), 5 deletions(-) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index adb279f181..7564518317 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -58,7 +58,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X ## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. +The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. ## [](#using-crawlee) Using Crawlee diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index e73c90d53d..13b82386b6 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -282,6 +282,6 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 6bb6566a50..1ba959a517 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -30,5 +30,3 @@ If you've managed to follow along with all of the courses prior to this one, the ## [](#first-up) First up First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website. - - diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index 2342f40d4c..55c1682a05 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. -![Amazon pagination](/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png) +![Amazon pagination](/academy/advanced_web_scraping/crawling/images/pagination.png) ## Page number-based pagination {#page-number-based-pagination} From 9bae1a3a55a17737efb8f6863713e14cefe1e986 Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Sat, 1 Feb 2025 12:41:14 +0100 Subject: [PATCH 5/9] docs: address Michal's remarks --- .../node_js/scraping_from_sitemaps.md | 8 +++++-- .../crawling/crawling-sitemaps.md | 23 +++++++++---------- .../crawling/crawling-with-search.md | 5 ++-- .../crawling/sitemaps-vs-search.md | 9 ++++---- .../advanced_web_scraping/index.md | 9 ++++---- 5 files changed, 27 insertions(+), 27 deletions(-) diff --git a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md index 54ac997979..7027ee93be 100644 --- a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md +++ b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md @@ -9,9 +9,13 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; # How to scrape from sitemaps {#scraping-with-sitemaps} ->Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code: +:::note -```javascript +Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code. + +::: + +```js import { RobotsFile } from 'crawlee'; const robots = await RobotsFile.find('https://www.mysite.com'); diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index 7564518317..bc8b1fce65 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -1,9 +1,8 @@ --- title: Crawling sitemaps description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -menuWeight: 2 -paths: -- advanced-web-scraping/crawling/crawling-sitemaps +sidebar_position:: 2 +slug: /advanced-web-scraping/crawling/crawling-sitemaps --- In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps. @@ -15,19 +14,19 @@ We will look at the following topics: - How to parse URLs from sitemaps - Using Crawlee to get all URLs in a few lines of code -## [](#how-to-find-sitemap-urls) How to find sitemap URLs +## How to find sitemap URLs Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. -### [](#google) Google +### Google You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon. -### [](#robots-txt) robots.txt +### robots.txt {#robots-txt} If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. -### [](#common-url-paths) Common URL paths +### Common URL paths You can try to iterate over common URL paths like: @@ -52,19 +51,19 @@ Some websites also provide an HTML version, to help indexing bots find new conte Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. -## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps +## How to set up HTTP requests to download sitemaps For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. -## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps +## How to parse URLs from sitemaps The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. -## [](#using-crawlee) Using Crawlee +## Using Crawlee Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: -```javascript +```js import { RobotsFile } from 'crawlee'; const robots = await RobotsFile.find('https://www.mysite.com'); @@ -72,6 +71,6 @@ const robots = await RobotsFile.find('https://www.mysite.com'); const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); ``` -## [](#next) Next up +## Next up That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination. diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index 13b82386b6..d982372898 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -1,9 +1,8 @@ --- title: Crawling with search description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -menuWeight: 3 -paths: -- advanced-web-scraping/crawling/crawling-with-search +sidebar_position:: 3 +slug: /advanced-web-scraping/crawling/crawling-with-search --- # Scraping websites with search diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 8f57044e2e..5fab8a50b0 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -1,14 +1,13 @@ --- title: Sitemaps vs search description: Learn how to extract all of a website's listings even if they limit the number of results pages. -menuWeight: 1 -paths: -- advanced-web-scraping/crawling/sitemaps-vs-search +sidebar_position:: 1 +slug: /advanced-web-scraping/crawling/sitemaps-vs-search --- The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. -Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. +Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: @@ -58,6 +57,6 @@ Fortunately, most websites list a total number of detail pages somewhere. It mig Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. -## [](#next) Next up +## Next up First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 1ba959a517..ba08f7d90b 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -1,10 +1,9 @@ --- title: Advanced web scraping description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. -menuWeight: 6 +sidebar_position:: 6 category: web scraping & automation -paths: -- advanced-web-scraping +slug: /advanced-web-scraping --- # Advanced web scraping @@ -13,7 +12,7 @@ In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. -## [](#what-does-production-ready-mean) What does production-ready mean? +## What does production-ready mean To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are. @@ -27,6 +26,6 @@ We will also touch on monitoring, performance, anti-scraping protections, and de If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎 -## [](#first-up) First up +## First up First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website. From 009f6fddb40a20f5e9d6158440c3ae097577ef8a Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Sat, 1 Feb 2025 14:29:19 +0100 Subject: [PATCH 6/9] fix: handle all broken links --- .../advanced_web_scraping/crawling/crawling-sitemaps.md | 2 +- .../advanced_web_scraping/crawling/crawling-with-search.md | 6 +++--- sources/academy/webscraping/advanced_web_scraping/index.md | 2 +- .../general_api_scraping/handling_pagination.md | 2 +- .../common_use_cases/paginating_through_results.md | 2 +- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index bc8b1fce65..c32992e22a 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -57,7 +57,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X ## How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. +The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps) provides code examples for parsing sitemaps. ## Using Crawlee diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index d982372898..ad61573285 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -5,9 +5,9 @@ sidebar_position:: 3 slug: /advanced-web-scraping/crawling/crawling-with-search --- -# Scraping websites with search +# Scraping websites with search -In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. +In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. @@ -281,6 +281,6 @@ await crawler.addRequests(requestsToEnqueue); ## Summary {#summary} -And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. +And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](../../../platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index ba08f7d90b..1388e4b80c 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -28,4 +28,4 @@ If you've managed to follow along with all of the courses prior to this one, the ## First up -First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website. +First, we will explore [advanced crawling section](./crawling/sitemaps-vs-search.md) that will help us to find all pages or products on the website. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 412f777e8a..1ecaebbbbf 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -198,7 +198,7 @@ Here's what the output of this code looks like: ## Final note {#final-note} -Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/scraping-paginated-sites). +Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/crawling/crawling-with-search). ## Next up {#next} diff --git a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md index 55c1682a05..b4833ddad6 100644 --- a/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md +++ b/sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem'; If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. -![Amazon pagination](/academy/advanced_web_scraping/crawling/images/pagination.png) +![Amazon pagination](../../advanced_web_scraping/crawling/images/pagination.png) ## Page number-based pagination {#page-number-based-pagination} From ad77d26f2d85146b626b176a7d66379b832f2263 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Luk=C3=A1=C5=A1=20K=C5=99ivka?= Date: Mon, 3 Feb 2025 23:35:48 +0100 Subject: [PATCH 7/9] Apply suggestions from code review Commit all suggestions from Honza Co-authored-by: Honza Javorek --- .../node_js/scraping_from_sitemaps.md | 2 +- .../crawling/crawling-sitemaps.md | 14 ++++++------- .../crawling/crawling-with-search.md | 2 +- .../crawling/sitemaps-vs-search.md | 20 +++++++++---------- .../advanced_web_scraping/index.md | 2 +- .../handling_pagination.md | 2 +- 6 files changed, 21 insertions(+), 21 deletions(-) diff --git a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md index 7027ee93be..4222bba2a3 100644 --- a/sources/academy/tutorials/node_js/scraping_from_sitemaps.md +++ b/sources/academy/tutorials/node_js/scraping_from_sitemaps.md @@ -9,7 +9,7 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; # How to scrape from sitemaps {#scraping-with-sitemaps} -:::note +:::tip Processing sitemaps automatically with Crawlee Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code. diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index c32992e22a..040d2fe4c4 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -1,7 +1,7 @@ --- title: Crawling sitemaps description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -sidebar_position:: 2 +sidebar_position: 2 slug: /advanced-web-scraping/crawling/crawling-sitemaps --- @@ -16,7 +16,7 @@ We will look at the following topics: ## How to find sitemap URLs -Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. +Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in `robots.txt` and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. ### Google @@ -24,7 +24,7 @@ You can try your luck on Google by searching for `site:example.com sitemap.xml` ### robots.txt {#robots-txt} -If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. +If the website has a `robots.txt` file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. ### Common URL paths @@ -49,19 +49,19 @@ Some websites also provide an HTML version, to help indexing bots find new conte - /sitemap.html - /sitemap_index -Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. +Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually. ## How to set up HTTP requests to download sitemaps -For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. +For most sitemaps, you can make a single HTTP request and parse the downloaded XML text. Some sitemaps are compressed and have to be streamed and decompressed. The code can get fairly complicated, but scraping frameworks, such as [Crawlee](#using-crawlee), can do this out of the box. ## How to parse URLs from sitemaps -The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps) provides code examples for parsing sitemaps. +Use your favorite XML parser to extract the URLs from inside the `` tags. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. `/about`, `/contact`, or various special category sections). For specific code examples, see [our Node.js guide](/academy/node-js/scraping-from-sitemaps). ## Using Crawlee -Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: +Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev), a scraping framework, which has rich traversing and parsing support for sitemap. It can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all the URLs in a few lines of code: ```js import { RobotsFile } from 'crawlee'; diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index ad61573285..fbd471bb26 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -1,7 +1,7 @@ --- title: Crawling with search description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. -sidebar_position:: 3 +sidebar_position: 3 slug: /advanced-web-scraping/crawling/crawling-with-search --- diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 5fab8a50b0..90238546f6 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -1,13 +1,13 @@ --- title: Sitemaps vs search description: Learn how to extract all of a website's listings even if they limit the number of results pages. -sidebar_position:: 1 +sidebar_position: 1 slug: /advanced-web-scraping/crawling/sitemaps-vs-search --- The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. -Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. +Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: @@ -31,13 +31,13 @@ Sitemap is usually a simple XML file that contains a list of all pages on the we - **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. - **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. - **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. -- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson. +- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework. ## Pros and cons of categories, search, and filters -This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit. +This approach means traversing the website like a normal user does by going through categories, setting up different filters, ranges, and sorting options. The goal is to ensure that we cover all categories or ranges where products can be located, and that for each of those we stay under the pagination limit. -The pros and cons of this approach are pretty much the opposite of the sitemaps approach. +The pros and cons of this approach are pretty much the opposite of relying on sitemaps. ### Pros @@ -47,16 +47,16 @@ The pros and cons of this approach are pretty much the opposite of the sitemaps ### Cons -- **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons. +- **Complex to set up** - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons. - **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. -- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this. +- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this. ## Do we know how many products there are? -Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. +Most websites list a total number of detail pages somewhere. It might be displayed on the home page, search results, or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. -Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. +Some sites, like Amazon, do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the following lessons as well. ## Next up -First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls. +Next, we will look into [sitemap crawling](./crawling-sitemaps.md). After that we will go through all the intricacies of the category, search and filter crawling, and build up tools implementing a generic approach that we can use on any website. At last, we will combine the results of both and set up monitoring and persistence to ensure we can run this regularly without any manual controls. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 1388e4b80c..3e41abb0a5 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -1,7 +1,7 @@ --- title: Advanced web scraping description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. -sidebar_position:: 6 +sidebar_position: 6 category: web scraping & automation slug: /advanced-web-scraping --- diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 1ecaebbbbf..e734e7be40 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -198,7 +198,7 @@ Here's what the output of this code looks like: ## Final note {#final-note} -Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at [this short article](/academy/advanced-web-scraping/crawling/crawling-with-search). +Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the [Crawling with search](/academy/advanced-web-scraping/crawling/crawling-with-search) article. ## Next up {#next} From ed1e9e091d3be99fcd99a6f1ac10f2c353c34cff Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Luk=C3=A1=C5=A1=20K=C5=99ivka?= Date: Wed, 5 Feb 2025 14:03:55 +0100 Subject: [PATCH 8/9] Fix formatting suggested by Michal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: MichaΕ‚ Olender <92638966+TC-MO@users.noreply.github.com> --- .../crawling/sitemaps-vs-search.md | 32 +++++++++---------- .../advanced_web_scraping/index.md | 2 +- .../handling_pagination.md | 2 +- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index 90238546f6..b1a12896fd 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -7,14 +7,14 @@ slug: /advanced-web-scraping/crawling/sitemaps-vs-search The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. -Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. +Unfortunately, _most modern websites restrict pagination_ only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: -- Extracting all page URLs from the website's **sitemap**. +- Extracting all page URLs from the website's _sitemap_. - Using **categories, search and filters** to split the website so we get under the pagination limit. -Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why. +Both of these approaches have their pros and cons so the best solution is to _use both and combine the results_. Here we will learn why. ## Pros and cons of sitemaps @@ -22,16 +22,16 @@ Sitemap is usually a simple XML file that contains a list of all pages on the we ### Pros -- **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. -- **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. -- **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. +- _Quick to set up_ - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. +- _Fast to run_ - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. +- _Usually complete_ - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. ### Cons -- **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. -- **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. -- **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. -- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework. +- _Does not directly reflect the website_ - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. +- _Updated in intervals_ - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. +- _Hard to find or unavailable_ - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. +- _Streamed, compressed, and archived_ - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework. ## Pros and cons of categories, search, and filters @@ -41,15 +41,15 @@ The pros and cons of this approach are pretty much the opposite of relying on si ### Pros -- **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. -- **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages. -- **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. +- _Directly reflects the website_ - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. +- _Updated in real-time_ - The website is updated in real-time so we can be sure that we are getting all pages. +- _Often contain detailed data_ - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. ### Cons -- **Complex to set up** - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons. -- **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. -- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this. +- _Complex to set up_ - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons. +- _Slow to run_ - The traversing can require a lot of requests. Some filters or categories will have products we already found. +- _Not always complete_ - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this. ## Do we know how many products there are? diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 3e41abb0a5..8867bc551c 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -8,7 +8,7 @@ slug: /advanced-web-scraping # Advanced web scraping -In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. +In [Web scraping for beginners](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index e734e7be40..57c43b040d 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -196,7 +196,7 @@ Here's what the output of this code looks like: 105 ``` -## Final note {#final-note} +## Final note Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the [Crawling with search](/academy/advanced-web-scraping/crawling/crawling-with-search) article. From 4276746e8499a4259411d86bfdf2498a5c3cf69e Mon Sep 17 00:00:00 2001 From: metalwarrior665 Date: Wed, 5 Feb 2025 14:11:31 +0100 Subject: [PATCH 9/9] apply the rest of non-suggestion feedback --- .../crawling/crawling-sitemaps.md | 30 +++++++++---------- .../crawling/crawling-with-search.md | 2 +- .../crawling/sitemaps-vs-search.md | 2 +- .../advanced_web_scraping/index.md | 4 +-- 4 files changed, 18 insertions(+), 20 deletions(-) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md index 040d2fe4c4..4a7c265eeb 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md @@ -28,26 +28,26 @@ If the website has a `robots.txt` file, it often contains sitemap URLs. The site ### Common URL paths -You can try to iterate over common URL paths like: - -- /sitemap.xml -- /product_index.xml -- /product_template.xml -- /sitemap_index.xml -- /sitemaps/sitemap_index.xml -- /sitemap/product_index.xml -- /media/sitemap.xml -- /media/sitemap/sitemap.xml -- /media/sitemap/index.xml +You can check some common URL paths, such as the following: + +/sitemap.xml +/product_index.xml +/product_template.xml +/sitemap_index.xml +/sitemaps/sitemap_index.xml +/sitemap/product_index.xml +/media/sitemap.xml +/media/sitemap/sitemap.xml +/media/sitemap/index.xml Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`). Some websites also provide an HTML version, to help indexing bots find new content. Those include: -- /sitemap -- /category-sitemap -- /sitemap.html -- /sitemap_index +/sitemap +/category-sitemap +/sitemap.html +/sitemap_index Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually. diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md index fbd471bb26..1388b31329 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md @@ -9,7 +9,7 @@ slug: /advanced-web-scraping/crawling/crawling-with-search In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. -Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. +Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. ![Pagination in on Google search results page](./images/pagination.png) diff --git a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md index b1a12896fd..943b40fa1f 100644 --- a/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md +++ b/sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md @@ -5,7 +5,7 @@ sidebar_position: 1 slug: /advanced-web-scraping/crawling/sitemaps-vs-search --- -The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. +The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the [Web Scraping for Beginners course](/academy/web-scraping-for-beginners). Unfortunately, _most modern websites restrict pagination_ only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. diff --git a/sources/academy/webscraping/advanced_web_scraping/index.md b/sources/academy/webscraping/advanced_web_scraping/index.md index 8867bc551c..44d7047cfd 100644 --- a/sources/academy/webscraping/advanced_web_scraping/index.md +++ b/sources/academy/webscraping/advanced_web_scraping/index.md @@ -6,8 +6,6 @@ category: web scraping & automation slug: /advanced-web-scraping --- -# Advanced web scraping - In [Web scraping for beginners](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. @@ -16,7 +14,7 @@ In this course, we will take all of that knowledge, add a few more advanced conc To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are. -