Skip to content

Commit 4276746

Browse files
apply the rest of non-suggestion feedback
1 parent ed1e9e0 commit 4276746

File tree

4 files changed

+18
-20
lines changed

4 files changed

+18
-20
lines changed

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -28,26 +28,26 @@ If the website has a `robots.txt` file, it often contains sitemap URLs. The site
2828

2929
### Common URL paths
3030

31-
You can try to iterate over common URL paths like:
32-
33-
- /sitemap.xml
34-
- /product_index.xml
35-
- /product_template.xml
36-
- /sitemap_index.xml
37-
- /sitemaps/sitemap_index.xml
38-
- /sitemap/product_index.xml
39-
- /media/sitemap.xml
40-
- /media/sitemap/sitemap.xml
41-
- /media/sitemap/index.xml
31+
You can check some common URL paths, such as the following:
32+
33+
/sitemap.xml
34+
/product_index.xml
35+
/product_template.xml
36+
/sitemap_index.xml
37+
/sitemaps/sitemap_index.xml
38+
/sitemap/product_index.xml
39+
/media/sitemap.xml
40+
/media/sitemap/sitemap.xml
41+
/media/sitemap/index.xml
4242

4343
Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).
4444

4545
Some websites also provide an HTML version, to help indexing bots find new content. Those include:
4646

47-
- /sitemap
48-
- /category-sitemap
49-
- /sitemap.html
50-
- /sitemap_index
47+
/sitemap
48+
/category-sitemap
49+
/sitemap.html
50+
/sitemap_index
5151

5252
Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually.
5353

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ slug: /advanced-web-scraping/crawling/crawling-with-search
99

1010
In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.
1111

12-
Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
12+
Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
1313

1414
![Pagination in on Google search results page](./images/pagination.png)
1515

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar_position: 1
55
slug: /advanced-web-scraping/crawling/sitemaps-vs-search
66
---
77

8-
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
8+
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the [Web Scraping for Beginners course](/academy/web-scraping-for-beginners).
99

1010
Unfortunately, _most modern websites restrict pagination_ only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
1111

sources/academy/webscraping/advanced_web_scraping/index.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ category: web scraping & automation
66
slug: /advanced-web-scraping
77
---
88

9-
# Advanced web scraping
10-
119
In [Web scraping for beginners](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.
1210

1311
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.
@@ -16,7 +14,7 @@ In this course, we will take all of that knowledge, add a few more advanced conc
1614

1715
To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are.
1816

19-
<!--
17+
<!-- WIP: We want to split this into crawling and data extraction
2018
The following sections will cover the core concepts that will ensure that your scraper is production-ready:
2119
The advanced crawling section will cover how to ensure we find all pages or products on the website.
2220
- The advanced data extraction will cover how to efficiently extract data from a particular page or API.

0 commit comments

Comments
 (0)