You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+16-6Lines changed: 16 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,9 +108,9 @@ If our previous scraper didn't give us any sense of progress, Crawlee feeds us w
108
108
109
109
## Crawling product detail pages
110
110
111
-
The code now features advanced Python concepts, so it's less accessible to beginners to programming, and the size of the program is about the same as if we worked without framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive.
111
+
The code now features advanced Python concepts, so it's less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive.
112
112
113
-
As we'll rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
113
+
As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages:
114
114
115
115
```py
116
116
import asyncio
@@ -137,16 +137,18 @@ if __name__ == '__main__':
137
137
asyncio.run(main())
138
138
```
139
139
140
-
First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector which allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links, and add them to the Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
140
+
First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`.
141
141
142
-
Below, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is that it prints the request URL.
142
+
Below that, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is print the request URL.
143
143
144
-
If we run the code, we should see how Crawlee first downloads the listing page, and then makes parallel requests to each of the detail pages, printing their URLs on the way:
144
+
If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:
145
145
146
146
```text
147
147
$ python newmain.py
148
148
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
149
+
┌───────────────────────────────┬──────────┐
149
150
...
151
+
└───────────────────────────────┴──────────┘
150
152
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
In the final statistics you can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers can differ, but regardless it should be much faster than making the requests sequentially.
174
+
In the final statistics, you can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially.
173
175
174
176
## Extracting data
175
177
178
+
The BeautifulSoup crawler provides handlers with the `context.soup` attribute, where we can find the parsed HTML of the handled page. This is the same as the `soup` we had in our previous program.
179
+
180
+
:::danger Work in progress
181
+
182
+
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
0 commit comments