Skip to content

Commit dfd23ad

Browse files
committed
feat: crawl product detail pages
1 parent 20affee commit dfd23ad

File tree

1 file changed

+63
-0
lines changed

1 file changed

+63
-0
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,69 @@ If our previous scraper didn't give us any sense of progress, Crawlee feeds us w
108108

109109
## Crawling product detail pages
110110

111+
The code now features advanced Python concepts, so it's less accessible to beginners to programming, and the size of the program is about the same as if we worked without framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive.
112+
113+
As we'll rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
114+
115+
```py
116+
import asyncio
117+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
118+
119+
async def main():
120+
crawler = BeautifulSoupCrawler()
121+
122+
@crawler.router.default_handler
123+
async def handle_listing(context):
124+
# highlight-next-line
125+
await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title")
126+
127+
# highlight-next-line
128+
@crawler.router.handler("DETAIL")
129+
# highlight-next-line
130+
async def handle_detail(context):
131+
# highlight-next-line
132+
print(context.request.url)
133+
134+
await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
135+
136+
if __name__ == '__main__':
137+
asyncio.run(main())
138+
```
139+
140+
First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector which allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links, and add them to the Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
141+
142+
Below, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is that it prints the request URL.
143+
144+
If we run the code, we should see how Crawlee first downloads the listing page, and then makes parallel requests to each of the detail pages, printing their URLs on the way:
145+
146+
```text
147+
$ python newmain.py
148+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
149+
...
150+
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
151+
https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv
152+
https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
153+
https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer
154+
https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
155+
...
156+
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
157+
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
158+
┌───────────────────────────────┬──────────┐
159+
│ requests_finished │ 25 │
160+
│ requests_failed │ 0 │
161+
│ retry_histogram │ [25] │
162+
│ request_avg_failed_duration │ None │
163+
│ request_avg_finished_duration │ 0.349434 │
164+
│ requests_finished_per_minute │ 318 │
165+
│ requests_failed_per_minute │ 0 │
166+
│ request_total_duration │ 8.735843 │
167+
│ requests_total │ 25 │
168+
│ crawler_runtime │ 4.713262 │
169+
└───────────────────────────────┴──────────┘
170+
```
171+
172+
In the final statistics you can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers can differ, but regardless it should be much faster than making the requests sequentially.
173+
111174
## Extracting data
112175

113176
## Saving data

0 commit comments

Comments
 (0)