Skip to content

Commit 48a73d4

Browse files
honzajavorekTC-MO
andauthored
feat: add lesson about using the platform (with webp images) (#1556)
I messed up #1424 trying to remove PNG files from commit history. This is a new PR with (hopefully) all the original commits correctly rewritten and cherry-picked. --------- Co-authored-by: Michał Olender <[email protected]>
1 parent 40b4572 commit 48a73d4

File tree

9 files changed

+442
-25
lines changed

9 files changed

+442
-25
lines changed

.github/styles/config/vocabularies/Docs/accept.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,19 +88,20 @@ preconfigured
8888

8989
devs
9090
asyncio
91-
Langflow
9291
backlinks?
9392
captchas?
9493
Chatbot
9594
combinator
9695
deduplicating
96+
dev
9797
Fakestore
9898
Fandom('s)?
9999
IMDb
100100
influencers
101101
iPads?
102102
iPhones?
103103
jQuery
104+
Langflow
104105
learnings
105106
livestreams
106107
outro

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,11 @@ Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...
4444

4545
## Running Crawlee
4646

47-
Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way, we can keep peeking at the original implementation while working on the new one. The initial content will look like this:
47+
Now let's use the framework to create a new version of our scraper. Rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, create a new, empty `main.py`. The initial content will look like this:
4848

49-
```py title="newmain.py"
49+
```py
5050
import asyncio
51-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
51+
from crawlee.crawlers import BeautifulSoupCrawler
5252

5353
async def main():
5454
crawler = BeautifulSoupCrawler()
@@ -74,8 +74,8 @@ In the code, we do the following:
7474
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
7575

7676
```text
77-
$ python newmain.py
78-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
77+
$ python main.py
78+
[BeautifulSoupCrawler] INFO Current request statistics:
7979
┌───────────────────────────────┬──────────┐
8080
│ requests_finished │ 0 │
8181
│ requests_failed │ 0 │
@@ -91,7 +91,7 @@ $ python newmain.py
9191
[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
9292
Sales
9393
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
94-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
94+
[BeautifulSoupCrawler] INFO Final request statistics:
9595
┌───────────────────────────────┬──────────┐
9696
│ requests_finished │ 1 │
9797
│ requests_failed │ 0 │
@@ -122,7 +122,7 @@ For example, it takes a single line of code to extract and follow links to produ
122122

123123
```py
124124
import asyncio
125-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
125+
from crawlee.crawlers import BeautifulSoupCrawler
126126

127127
async def main():
128128
crawler = BeautifulSoupCrawler()
@@ -152,8 +152,8 @@ Below that, we give the crawler another asynchronous function, `handle_detail()`
152152
If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way:
153153

154154
```text
155-
$ python newmain.py
156-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
155+
$ python main.py
156+
[BeautifulSoupCrawler] INFO Current request statistics:
157157
┌───────────────────────────────┬──────────┐
158158
...
159159
└───────────────────────────────┴──────────┘
@@ -164,7 +164,7 @@ https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-s
164164
https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable
165165
...
166166
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
167-
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
167+
[BeautifulSoupCrawler] INFO Final request statistics:
168168
┌───────────────────────────────┬──────────┐
169169
│ requests_finished │ 25 │
170170
│ requests_failed │ 0 │
@@ -232,7 +232,7 @@ Finally, the variants. We can reuse the `parse_variant()` function as-is, and in
232232
```py
233233
import asyncio
234234
from decimal import Decimal
235-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
235+
from crawlee.crawlers import BeautifulSoupCrawler
236236

237237
async def main():
238238
crawler = BeautifulSoupCrawler()
@@ -309,7 +309,7 @@ async def main():
309309
await context.push_data(item)
310310
```
311311

312-
That's it! If you run the program now, there should be a `storage` directory alongside the `newmain.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
312+
That's it! If you run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
313313

314314
![Single dataset item](images/dataset-item.png)
315315

@@ -335,7 +335,7 @@ Crawlee gives us stats about HTTP requests and concurrency, but we don't get muc
335335
```py
336336
import asyncio
337337
from decimal import Decimal
338-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
338+
from crawlee.crawlers import BeautifulSoupCrawler
339339

340340
async def main():
341341
crawler = BeautifulSoupCrawler()
@@ -398,7 +398,7 @@ if __name__ == '__main__':
398398

399399
Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are just [standard Python loggers](https://docs.python.org/3/library/logging.html).
400400

401-
Even with the extra logging we've added, we've managed to cut at least 20 lines of code compared to the original program. Throughout this lesson, we've been adding features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
401+
If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but the new code is still clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about, while the framework manages the rest.
402402

403403
In the next lesson, we'll use a scraping platform to set up our application to run automatically every day.
404404

@@ -454,7 +454,7 @@ Hints:
454454
import asyncio
455455
from datetime import datetime
456456

457-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
457+
from crawlee.crawlers import BeautifulSoupCrawler
458458

459459
async def main():
460460
crawler = BeautifulSoupCrawler()
@@ -554,7 +554,7 @@ When navigating to the first search result, you might find it helpful to know th
554554
from urllib.parse import quote_plus
555555

556556
from crawlee import Request
557-
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
557+
from crawlee.crawlers import BeautifulSoupCrawler
558558

559559
async def main():
560560
crawler = BeautifulSoupCrawler()

0 commit comments

Comments
 (0)