feat: crawl product detail pages

honzajavorek · honzajavorek · commit dfd23adff7f8 · 2024-11-26T16:29:42.000+01:00
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md
@@ -108,6 +108,69 @@ If our previous scraper didn't give us any sense of progress, Crawlee feeds us w
 
 ## Crawling product detail pages
 
+The code now features advanced Python concepts, so it's less accessible to beginners to programming, and the size of the program is about the same as if we worked without framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive.
+
+As we'll rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
+
+```py
+import asyncio
+from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
+
+async def main():
+    crawler = BeautifulSoupCrawler()
+
+    @crawler.router.default_handler
+    async def handle_listing(context):
+        # highlight-next-line
+        await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title")
+
+    # highlight-next-line
+    @crawler.router.handler("DETAIL")
+    # highlight-next-line
+    async def handle_detail(context):
+        # highlight-next-line
+        print(context.request.url)
+
+    await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
+
+if __name__ == '__main__':
+    asyncio.run(main())
+```
+
+First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector which allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links, and add them to the Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
+
+Below, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is that it prints the request URL.
+
+If we run the code, we should see how Crawlee first downloads the listing page, and then makes parallel requests to each of the detail pages, printing their URLs on the way:
+
+```text
+$ python newmain.py
+[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Current request statistics:
+...
+[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
+https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv
+https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker
+https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer
+https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
+...
+[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
+[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Final request statistics:
+┌───────────────────────────────┬──────────┐
+│ requests_finished             │ 25       │
+│ requests_failed               │ 0        │
+│ retry_histogram               │ [25]     │
+│ request_avg_failed_duration   │ None     │
+│ request_avg_finished_duration │ 0.349434 │
+│ requests_finished_per_minute  │ 318      │
+│ requests_failed_per_minute    │ 0        │
+│ request_total_duration        │ 8.735843 │
+│ requests_total                │ 25       │
+│ crawler_runtime               │ 4.713262 │
+└───────────────────────────────┴──────────┘
+```
+
+In the final statistics you can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers can differ, but regardless it should be much faster than making the requests sequentially.
+
 ## Extracting data
 
 ## Saving data