You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add lesson about using the platform (with webp images) (#1556)
I messed up #1424 trying to
remove PNG files from commit history. This is a new PR with (hopefully)
all the original commits correctly rewritten and cherry-picked.
---------
Co-authored-by: Michał Olender <[email protected]>
Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way, we can keep peeking at the original implementation while working on the new one. The initial content will look like this:
47
+
Now let's use the framework to create a new version of our scraper. Rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, create a new, empty `main.py`. The initial content will look like this:
48
48
49
-
```py title="newmain.py"
49
+
```py
50
50
import asyncio
51
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
51
+
from crawlee.crawlersimport BeautifulSoupCrawler
52
52
53
53
asyncdefmain():
54
54
crawler = BeautifulSoupCrawler()
@@ -74,8 +74,8 @@ In the code, we do the following:
74
74
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
75
75
76
76
```text
77
-
$ python newmain.py
78
-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
77
+
$ python main.py
78
+
[BeautifulSoupCrawler] INFO Current request statistics:
79
79
┌───────────────────────────────┬──────────┐
80
80
│ requests_finished │ 0 │
81
81
│ requests_failed │ 0 │
@@ -91,7 +91,7 @@ $ python newmain.py
91
91
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
92
92
Sales
93
93
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
94
-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
94
+
[BeautifulSoupCrawler] INFO Final request statistics:
95
95
┌───────────────────────────────┬──────────┐
96
96
│ requests_finished │ 1 │
97
97
│ requests_failed │ 0 │
@@ -122,7 +122,7 @@ For example, it takes a single line of code to extract and follow links to produ
122
122
123
123
```py
124
124
import asyncio
125
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
125
+
from crawlee.crawlersimport BeautifulSoupCrawler
126
126
127
127
asyncdefmain():
128
128
crawler = BeautifulSoupCrawler()
@@ -152,8 +152,8 @@ Below that, we give the crawler another asynchronous function, `handle_detail()`
152
152
If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:
153
153
154
154
```text
155
-
$ python newmain.py
156
-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
155
+
$ python main.py
156
+
[BeautifulSoupCrawler] INFO Current request statistics:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
167
-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
167
+
[BeautifulSoupCrawler] INFO Final request statistics:
168
168
┌───────────────────────────────┬──────────┐
169
169
│ requests_finished │ 25 │
170
170
│ requests_failed │ 0 │
@@ -232,7 +232,7 @@ Finally, the variants. We can reuse the `parse_variant()` function as-is, and in
232
232
```py
233
233
import asyncio
234
234
from decimal import Decimal
235
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
235
+
from crawlee.crawlersimport BeautifulSoupCrawler
236
236
237
237
asyncdefmain():
238
238
crawler = BeautifulSoupCrawler()
@@ -309,7 +309,7 @@ async def main():
309
309
await context.push_data(item)
310
310
```
311
311
312
-
That's it! If you run the program now, there should be a `storage` directory alongside the `newmain.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
312
+
That's it! If you run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
313
313
314
314

315
315
@@ -335,7 +335,7 @@ Crawlee gives us stats about HTTP requests and concurrency, but we don't get muc
335
335
```py
336
336
import asyncio
337
337
from decimal import Decimal
338
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
338
+
from crawlee.crawlersimport BeautifulSoupCrawler
339
339
340
340
asyncdefmain():
341
341
crawler = BeautifulSoupCrawler()
@@ -398,7 +398,7 @@ if __name__ == '__main__':
398
398
399
399
Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are just [standard Python loggers](https://docs.python.org/3/library/logging.html).
400
400
401
-
Even with the extra logging we've added, we've managed to cut at least 20 lines of code compared to the original program. Throughout this lesson, we've been adding features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
401
+
If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
402
402
403
403
In the next lesson, we'll use a scraping platform to set up our application to run automatically every day.
404
404
@@ -454,7 +454,7 @@ Hints:
454
454
import asyncio
455
455
from datetime import datetime
456
456
457
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
457
+
from crawlee.crawlersimport BeautifulSoupCrawler
458
458
459
459
asyncdefmain():
460
460
crawler = BeautifulSoupCrawler()
@@ -554,7 +554,7 @@ When navigating to the first search result, you might find it helpful to know th
554
554
from urllib.parse import quote_plus
555
555
556
556
from crawlee import Request
557
-
from crawlee.beautifulsoup_crawlerimport BeautifulSoupCrawler
0 commit comments