Skip to content

Commit 275cf7d

Browse files
committed
feat(tutorials): add sitemap scraping tutorial
1 parent 7e50535 commit 275cf7d

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Scraping with sitemaps
3+
description: The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn a much easier way to extract data from websites using Crawlee.
4+
menuWeight: 3
5+
paths:
6+
- tutorials/scraping-with-sitemaps
7+
---
8+
9+
# [](#scraping-with-sitemaps) Scraping with sitemaps
10+
11+
Let's say we want to scrape a database of craft beers ([brewbound.com](https://brewbound.com)) before summer starts. If we are lucky, the website will contain a sitemap at [https://www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
12+
13+
> Check out [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), which can discover sitemaps in hidden locations!
14+
15+
## [](#analyzing-the-sitemap) Analyzing the sitemap
16+
17+
The sitemap is usually located at the path **/sitemap.xml**. It is always worth trying that URL, as it is rarely linked anywhere on the site. It usually contains a list of all pages in [XML format](https://www.w3.org/standards/xml/core).
18+
19+
```XML
20+
<?xml version="1.0" encoding="UTF-8"?>
21+
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
22+
<url>
23+
<loc>http://www.brewbound.com/advertise</loc>
24+
<lastmod>2015-03-19</lastmod>
25+
<changefreq>daily</changefreq>
26+
</url>
27+
<url>
28+
...
29+
```
30+
31+
The URLs of breweries take this form:
32+
33+
```text
34+
http://www.brewbound.com/breweries/[BREWERY_NAME]
35+
```
36+
37+
And the URLs of craft beers look like this:
38+
39+
```text
40+
http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME]
41+
```
42+
43+
They can be matched using the following regular expression:
44+
45+
```regexp
46+
http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/<]+
47+
```
48+
49+
Note the two parts of the regular expression `[^\/<]` containing the `<` symbol. This is because we want to exclude the `</loc>` tag, which closes each URL.
50+
51+
## [](#scraping-the-sitemap) Scraping the sitemap in Crawlee
52+
53+
If you're scraping sitemaps (or anything else, really), [Crawlee](https://crawlee.dev) is perfect for the job.
54+
55+
First, let's add the beer URLs from the sitemap to the [`RequestList`](https://crawlee.dev/api/core/class/RequestList) using our regular expression to match only the (craft!!) beer URLs and not pages of breweries, contact page, etc.
56+
57+
```JavaScript
58+
const requestList = await RequestList.open(null, [{
59+
requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
60+
regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/<]+\/[^\/<]+/gm,
61+
}]);
62+
```
63+
64+
Now, let's use [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) to scrape the created `RequestList` with [Puppeteer](https://pptr.dev/) and push it to the final dataset.
65+
66+
```JavaScript
67+
const crawler = new PuppeteerCrawler({
68+
requestList,
69+
async requestHandler({ page }) {
70+
const beerPage = await page.evaluate(() => {
71+
return document.getElementsByClassName('productreviews').length;
72+
});
73+
if (!beerPage) return;
74+
75+
const data = await page.evaluate(() => {
76+
const title = document.getElementsByTagName('h1')[0].innerText;
77+
const [brewery, beer] = title.split(':');
78+
const description = document.getElementsByClassName('productreviews')[0].innerText;
79+
80+
return { brewery, beer, description };
81+
});
82+
83+
await Dataset.pushData(data);
84+
},
85+
});
86+
```
87+
88+
## [](#full-code) Full code
89+
90+
If we create a new actor using the code below on the [Apify platform]({{@link apify_platform.md}}), it returns a nicely formatted spreadsheet containing a list of breweries with their beers with descriptions.
91+
92+
Make sure to use the **apify/actor-node-puppeteer-chrome** image for your Dockerfile, otherwise the run will fail.
93+
94+
```JavaScript
95+
import { Dataset, RequestList, PuppeteerCrawler } from 'crawlee';
96+
97+
const requestList = await RequestList.open(null, [{
98+
requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
99+
regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/<]+\/[^\/<]+/gm,
100+
}]);
101+
102+
const crawler = new PuppeteerCrawler({
103+
requestList,
104+
async requestHandler({ page }) {
105+
const beerPage = await page.evaluate(() => {
106+
return document.getElementsByClassName('productreviews').length;
107+
});
108+
if (!beerPage) return;
109+
110+
const data = await page.evaluate(() => {
111+
const title = document.getElementsByTagName('h1')[0].innerText;
112+
const [brewery, beer] = title.split(':');
113+
const description = document.getElementsByClassName('productreviews')[0].innerText;
114+
115+
return { brewery, beer, description };
116+
});
117+
118+
await Dataset.pushData(data);
119+
},
120+
});
121+
122+
await crawler.run();
123+
```

0 commit comments

Comments
 (0)