You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+89-1Lines changed: 89 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,12 +26,100 @@ In this lesson, we'll tackle all the above issues by using a scraping framework
26
26
27
27
From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
28
28
29
-
We genuinely believe beginners to scraping will like it more, since it lets you create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
29
+
We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
30
30
31
31
:::
32
32
33
33
## Installing Crawlee
34
34
35
+
When starting with the Crawlee framework, you first need to decide which approach to downloading and parsing you'll prefer. We want the one based on BeautifulSoup, hence we'll install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies of its own, so expect the installation to take a while.
Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way we can keep peeking at the original implementation when we're working on the new one. The initial content will look like this:
46
+
47
+
```py title="newmain.py"
48
+
import asyncio
49
+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
1. We perform imports and specify an asynchronous `main()` function.
67
+
1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
68
+
1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`) we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
69
+
1. The function ends with running the crawler with the products listing URL. We await until the crawler does its work.
70
+
1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery `asyncio` will run our `main()` function.
71
+
72
+
Don't worry if it's a lot of things you've never seen before. For now it's not really important to know exactly how [asyncio](https://docs.python.org/3/library/asyncio.html) works, or what decorators do. Let's stick to the practical side and see what the program does if executed:
73
+
74
+
```text
75
+
$ python newmain.py
76
+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
77
+
┌───────────────────────────────┬──────────┐
78
+
│ requests_finished │ 0 │
79
+
│ requests_failed │ 0 │
80
+
│ retry_histogram │ [0] │
81
+
│ request_avg_failed_duration │ None │
82
+
│ request_avg_finished_duration │ None │
83
+
│ requests_finished_per_minute │ 0 │
84
+
│ requests_failed_per_minute │ 0 │
85
+
│ request_total_duration │ 0.0 │
86
+
│ requests_total │ 0 │
87
+
│ crawler_runtime │ 0.010014 │
88
+
└───────────────────────────────┴──────────┘
89
+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
90
+
Sales
91
+
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
92
+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
93
+
┌───────────────────────────────┬──────────┐
94
+
│ requests_finished │ 1 │
95
+
│ requests_failed │ 0 │
96
+
│ retry_histogram │ [1] │
97
+
│ request_avg_failed_duration │ None │
98
+
│ request_avg_finished_duration │ 0.308998 │
99
+
│ requests_finished_per_minute │ 185 │
100
+
│ requests_failed_per_minute │ 0 │
101
+
│ request_total_duration │ 0.308998 │
102
+
│ requests_total │ 1 │
103
+
│ crawler_runtime │ 0.323721 │
104
+
└───────────────────────────────┴──────────┘
105
+
```
106
+
107
+
If our previous program didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for our purposes. Between all the diagnostics, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper which downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
108
+
109
+
## Crawling product detail pages
110
+
111
+
112
+
113
+
114
+
<!--
115
+
116
+
117
+
118
+
pip install 'crawlee[beautifulsoup]'
119
+
120
+
121
+
-->
122
+
35
123
:::danger Work in progress
36
124
37
125
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
0 commit comments