Skip to content

Commit 58f576f

Browse files
more link fixes
1 parent a6f7150 commit 58f576f

File tree

4 files changed

+5
-4
lines changed

4 files changed

+5
-4
lines changed

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X
5858

5959
## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
6060

61-
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
61+
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
6262

6363
## [](#using-crawlee) Using Crawlee
6464

sources/academy/webscraping/advanced_web_scraping/crawling/crawling-with-search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,6 @@ await crawler.addRequests(requestsToEnqueue);
282282

283283
## Summary {#summary}
284284

285-
And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.
285+
And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.
286286

287287
Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters).

sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,12 @@ paths:
66
- advanced-web-scraping/crawling/sitemaps-vs-search
77
---
88

9-
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
9+
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course.
1010

1111
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.
1212

1313
There are two main approaches to solving this problem:
14+
1415
- Extracting all page URLs from the website's **sitemap**.
1516
- Using **categories, search and filters** to split the website so we get under the pagination limit.
1617

sources/academy/webscraping/puppeteer_playwright/common_use_cases/paginating_through_results.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem';
1616

1717
If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content.
1818

19-
![Amazon pagination](../../advanced_web_scraping/images/pagination.png)
19+
![Amazon pagination](/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png)
2020

2121
## Page number-based pagination {#page-number-based-pagination}
2222

0 commit comments

Comments
 (0)