Skip to content

Commit 9bae1a3

Browse files
docs: address Michal's remarks
1 parent cd978a5 commit 9bae1a3

File tree

5 files changed

+27
-27
lines changed

5 files changed

+27
-27
lines changed

sources/academy/tutorials/node_js/scraping_from_sitemaps.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,13 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';
99

1010
# How to scrape from sitemaps {#scraping-with-sitemaps}
1111

12-
>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code:
12+
:::note
1313

14-
```javascript
14+
Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code.
15+
16+
:::
17+
18+
```js
1519
import { RobotsFile } from 'crawlee';
1620

1721
const robots = await RobotsFile.find('https://www.mysite.com');

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
---
22
title: Crawling sitemaps
33
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
4-
menuWeight: 2
5-
paths:
6-
- advanced-web-scraping/crawling/crawling-sitemaps
4+
sidebar_position:: 2
5+
slug: /advanced-web-scraping/crawling/crawling-sitemaps
76
---
87

98
In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.
@@ -15,19 +14,19 @@ We will look at the following topics:
1514
- How to parse URLs from sitemaps
1615
- Using Crawlee to get all URLs in a few lines of code
1716

18-
## [](#how-to-find-sitemap-urls) How to find sitemap URLs
17+
## How to find sitemap URLs
1918

2019
Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.
2120

22-
### [](#google) Google
21+
### Google
2322

2423
You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon.
2524

26-
### [](#robots-txt) robots.txt
25+
### robots.txt {#robots-txt}
2726

2827
If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.
2928

30-
### [](#common-url-paths) Common URL paths
29+
### Common URL paths
3130

3231
You can try to iterate over common URL paths like:
3332

@@ -52,26 +51,26 @@ Some websites also provide an HTML version, to help indexing bots find new conte
5251

5352
Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually.
5453

55-
## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
54+
## How to set up HTTP requests to download sitemaps
5655

5756
For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default.
5857

59-
## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
58+
## How to parse URLs from sitemaps
6059

6160
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
6261

63-
## [](#using-crawlee) Using Crawlee
62+
## Using Crawlee
6463

6564
Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code:
6665

67-
```javascript
66+
```js
6867
import { RobotsFile } from 'crawlee';
6968

7069
const robots = await RobotsFile.find('https://www.mysite.com');
7170

7271
const allWebsiteUrls = await robots.parseUrlsFromSitemaps();
7372
```
7473

75-
## [](#next) Next up
74+
## Next up
7675

7776
That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination.

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
---
22
title: Crawling with search
33
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
4-
menuWeight: 3
5-
paths:
6-
- advanced-web-scraping/crawling/crawling-with-search
4+
sidebar_position:: 3
5+
slug: /advanced-web-scraping/crawling/crawling-with-search
76
---
87

98
# Scraping websites with search

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
---
22
title: Sitemaps vs search
33
description: Learn how to extract all of a website's listings even if they limit the number of results pages.
4-
menuWeight: 1
5-
paths:
6-
- advanced-web-scraping/crawling/sitemaps-vs-search
4+
sidebar_position:: 1
5+
slug: /advanced-web-scraping/crawling/sitemaps-vs-search
76
---
87

98
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
109

11-
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
10+
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
1211

1312
There are two main approaches to solving this problem:
1413

@@ -58,6 +57,6 @@ Fortunately, most websites list a total number of detail pages somewhere. It mig
5857

5958
Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well.
6059

61-
## [](#next) Next up
60+
## Next up
6261

6362
First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls.

sources/academy/webscraping/advanced_web_scraping/index.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
---
22
title: Advanced web scraping
33
description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
4-
menuWeight: 6
4+
sidebar_position:: 6
55
category: web scraping & automation
6-
paths:
7-
- advanced-web-scraping
6+
slug: /advanced-web-scraping
87
---
98

109
# Advanced web scraping
@@ -13,7 +12,7 @@ In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course,
1312

1413
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.
1514

16-
## [](#what-does-production-ready-mean) What does production-ready mean?
15+
## What does production-ready mean
1716

1817
To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are.
1918

@@ -27,6 +26,6 @@ We will also touch on monitoring, performance, anti-scraping protections, and de
2726

2827
If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
2928

30-
## [](#first-up) First up
29+
## First up
3130

3231
First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website.

0 commit comments

Comments
 (0)