Skip to content

Commit f5e25c6

Browse files
committed
feat: add first crawlee example
1 parent 813ca27 commit f5e25c6

File tree

1 file changed

+89
-1
lines changed

1 file changed

+89
-1
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,100 @@ In this lesson, we'll tackle all the above issues by using a scraping framework
2626

2727
From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
2828

29-
We genuinely believe beginners to scraping will like it more, since it lets you create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
29+
We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
3030

3131
:::
3232

3333
## Installing Crawlee
3434

35+
When starting with the Crawlee framework, you first need to decide which approach to downloading and parsing you'll prefer. We want the one based on BeautifulSoup, hence we'll install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies of its own, so expect the installation to take a while.
36+
37+
```text
38+
$ pip install crawlee[beautifulsoup]
39+
...
40+
Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...
41+
```
42+
43+
## Running Crawlee
44+
45+
Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way we can keep peeking at the original implementation when we're working on the new one. The initial content will look like this:
46+
47+
```py title="newmain.py"
48+
import asyncio
49+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
50+
51+
async def main():
52+
crawler = BeautifulSoupCrawler()
53+
54+
@crawler.router.default_handler
55+
async def handle_listing(context):
56+
print(context.soup.title.text.strip())
57+
58+
await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
59+
60+
if __name__ == '__main__':
61+
asyncio.run(main())
62+
```
63+
64+
In the code we do the following:
65+
66+
1. We perform imports and specify an asynchronous `main()` function.
67+
1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
68+
1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`) we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
69+
1. The function ends with running the crawler with the products listing URL. We await until the crawler does its work.
70+
1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery `asyncio` will run our `main()` function.
71+
72+
Don't worry if it's a lot of things you've never seen before. For now it's not really important to know exactly how [asyncio](https://docs.python.org/3/library/asyncio.html) works, or what decorators do. Let's stick to the practical side and see what the program does if executed:
73+
74+
```text
75+
$ python newmain.py
76+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
77+
┌───────────────────────────────┬──────────┐
78+
│ requests_finished │ 0 │
79+
│ requests_failed │ 0 │
80+
│ retry_histogram │ [0] │
81+
│ request_avg_failed_duration │ None │
82+
│ request_avg_finished_duration │ None │
83+
│ requests_finished_per_minute │ 0 │
84+
│ requests_failed_per_minute │ 0 │
85+
│ request_total_duration │ 0.0 │
86+
│ requests_total │ 0 │
87+
│ crawler_runtime │ 0.010014 │
88+
└───────────────────────────────┴──────────┘
89+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
90+
Sales
91+
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
92+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
93+
┌───────────────────────────────┬──────────┐
94+
│ requests_finished │ 1 │
95+
│ requests_failed │ 0 │
96+
│ retry_histogram │ [1] │
97+
│ request_avg_failed_duration │ None │
98+
│ request_avg_finished_duration │ 0.308998 │
99+
│ requests_finished_per_minute │ 185 │
100+
│ requests_failed_per_minute │ 0 │
101+
│ request_total_duration │ 0.308998 │
102+
│ requests_total │ 1 │
103+
│ crawler_runtime │ 0.323721 │
104+
└───────────────────────────────┴──────────┘
105+
```
106+
107+
If our previous program didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for our purposes. Between all the diagnostics, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper which downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
108+
109+
## Crawling product detail pages
110+
111+
112+
113+
114+
<!--
115+
116+
117+
118+
pip install 'crawlee[beautifulsoup]'
119+
120+
121+
-->
122+
35123
:::danger Work in progress
36124

37125
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.

0 commit comments

Comments
 (0)