You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+63Lines changed: 63 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,6 +108,69 @@ If our previous scraper didn't give us any sense of progress, Crawlee feeds us w
108
108
109
109
## Crawling product detail pages
110
110
111
+
The code now features advanced Python concepts, so it's less accessible to beginners to programming, and the size of the program is about the same as if we worked without framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive.
112
+
113
+
As we'll rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
114
+
115
+
```py
116
+
import asyncio
117
+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector which allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links, and add them to the Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
141
+
142
+
Below, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is that it prints the request URL.
143
+
144
+
If we run the code, we should see how Crawlee first downloads the listing page, and then makes parallel requests to each of the detail pages, printing their URLs on the way:
145
+
146
+
```text
147
+
$ python newmain.py
148
+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
149
+
...
150
+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
157
+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
158
+
┌───────────────────────────────┬──────────┐
159
+
│ requests_finished │ 25 │
160
+
│ requests_failed │ 0 │
161
+
│ retry_histogram │ [25] │
162
+
│ request_avg_failed_duration │ None │
163
+
│ request_avg_finished_duration │ 0.349434 │
164
+
│ requests_finished_per_minute │ 318 │
165
+
│ requests_failed_per_minute │ 0 │
166
+
│ request_total_duration │ 8.735843 │
167
+
│ requests_total │ 25 │
168
+
│ crawler_runtime │ 4.713262 │
169
+
└───────────────────────────────┴──────────┘
170
+
```
171
+
172
+
In the final statistics you can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers can differ, but regardless it should be much faster than making the requests sequentially.
0 commit comments