Skip to content

Commit a6f7150

Browse files
fix(academy): try to resolve bad links and redirect in nginx
1 parent ddd3a3b commit a6f7150

File tree

7 files changed

+35
-20
lines changed

7 files changed

+35
-20
lines changed

nginx.conf

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,9 @@ server {
302302
rewrite ^/platform/actors/development/actor-definition/output-schema$ /platform/actors/development/actor-definition/dataset-schema permanent;
303303
rewrite ^academy/deploying-your-code/output-schema$ /academy/deploying-your-code/dataset-schema permanent;
304304

305+
# Academy restructuring
306+
rewrite ^academy/advanced-web-scraping/scraping-paginated-sites$ /academy/advanced-web-scraping/crawling/crawling-with-search permanent;
307+
305308
# Removed pages
306309
# GPT plugins were discontinued April 9th, 2024 - https://help.openai.com/en/articles/8988022-winding-down-the-chatgpt-plugins-beta
307310
rewrite ^/platform/integrations/chatgpt-plugin$ https://blog.apify.com/add-custom-actions-to-your-gpts/ redirect;

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,54 +9,59 @@ paths:
99
In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.
1010

1111
We will look at the following topics:
12+
1213
- How to find sitemap URLs
1314
- How to set up HTTP requests to download sitemaps
1415
- How to parse URLs from sitemaps
1516
- Using Crawlee to get all URLs in a few lines of code
1617

1718
## [](#how-to-find-sitemap-urls) How to find sitemap URLs
19+
1820
Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.
1921

2022
### [](#google) Google
23+
2124
You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon.
2225

2326
### [](#robots-txt) robots.txt
27+
2428
If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.
2529

2630
### [](#common-url-paths) Common URL paths
31+
2732
You can try to iterate over common URL paths like:
28-
```
29-
/sitemap.xml
30-
/product_index.xml
31-
/product_template.xml
32-
/sitemap_index.xml
33-
/sitemaps/sitemap_index.xml
34-
/sitemap/product_index.xml
35-
/media/sitemap.xml
36-
/media/sitemap/sitemap.xml
37-
/media/sitemap/index.xml
38-
```
33+
34+
- /sitemap.xml
35+
- /product_index.xml
36+
- /product_template.xml
37+
- /sitemap_index.xml
38+
- /sitemaps/sitemap_index.xml
39+
- /sitemap/product_index.xml
40+
- /media/sitemap.xml
41+
- /media/sitemap/sitemap.xml
42+
- /media/sitemap/index.xml
3943

4044
Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).
4145

4246
Some websites also provide an HTML version, to help indexing bots find new content. Those include:
4347

44-
```
45-
/sitemap
46-
/category-sitemap
47-
/sitemap.html
48-
/sitemap_index
49-
```
48+
- /sitemap
49+
- /category-sitemap
50+
- /sitemap.html
51+
- /sitemap_index
5052

5153
Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually.
5254

5355
## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
56+
5457
For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default.
5558

5659
## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
57-
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
60+
61+
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
5862

5963
## [](#using-crawlee) Using Crawlee
64+
6065
Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code:
6166

6267
```javascript
@@ -68,4 +73,5 @@ const allWebsiteUrls = await robots.parseUrlsFromSitemaps();
6873
```
6974

7075
## [](#next) Next up
76+
7177
That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination.

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -285,4 +285,3 @@ await crawler.addRequests(requestsToEnqueue);
285285
And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.
286286

287287
Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters).
288-

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,35 +17,42 @@ There are two main approaches to solving this problem:
1717
Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why.
1818

1919
## Pros and cons of sitemaps
20+
2021
Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson.
2122

2223
### Pros
24+
2325
- **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code.
2426
- **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds.
2527
- **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website.
2628

2729
### Cons
30+
2831
- **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs.
2932
- **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week.
3033
- **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all.
3134
- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson.
3235

3336
## Pros and cons of categories, search, and filters
37+
3438
This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit.
3539

3640
The pros and cons of this approach are pretty much the opposite of the sitemaps approach.
3741

3842
### Pros
43+
3944
- **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users.
4045
- **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages.
4146
- **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages.
4247

4348
### Cons
49+
4450
- **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons.
4551
- **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found.
4652
- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this.
4753

4854
## Do we know how many products there are?
55+
4956
Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it.
5057

5158
Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well.

sources/academy/webscraping/advanced_web_scraping/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ paths:
99

1010
# Advanced web scraping
1111

12-
In [**Web scraping for beginners**](/academy/webscraping/scraping_basics_javascript/index.md) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.
12+
In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.
1313

1414
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.
1515

0 commit comments

Comments
 (0)