You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+23-14Lines changed: 23 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,25 +6,34 @@ sidebar_position: 12
6
6
slug: /scraping-basics-python/framework
7
7
---
8
8
9
-
:::danger Work in progress
9
+
**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.**
10
10
11
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11
+
---
12
12
13
-
:::
13
+
Before rewriting our code, let's point out several caveats in our current solution:
14
+
15
+
-**Hard to maintain:** All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages.
16
+
-**Slow:** The program runs sequentially, which is considerate toward the target website, but downloading even two product pages in parallel could improve speed by 200%.
17
+
-**No logging:** The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs.
18
+
-**Boilerplate code:** We implement tasks like downloading and parsing HTML or exporting to CSV with custom code that feels like [boilerplate](https://en.wikipedia.org/wiki/Boilerplate_code). We could replace it with standardized solutions.
19
+
-**Prone to anti-scraping:** If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working.
20
+
-**Browser means rewrite:** We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program.
21
+
-**No error handling:** The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors.
22
+
23
+
In this lesson, we'll tackle all the above issues by using a scraping framework while keeping the code concise.
14
24
15
-
<!--
25
+
:::info Why Crawlee and not Scrapy
16
26
17
-
Caveats which could be addressed in the rewrite:
27
+
From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
18
28
19
-
- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
29
+
We genuinely believe beginners to scraping will like it more, since it lets you create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
20
30
21
-
Caveats which are reasons for framework:
31
+
:::
32
+
33
+
## Installing Crawlee
34
+
35
+
:::danger Work in progress
22
36
23
-
- it's slow
24
-
- logging
25
-
- a lot of boilerplate code
26
-
- anti-scraping protection
27
-
- browser crawling support
28
-
- error handling
37
+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
0 commit comments