Skip to content

Commit 1a7abc1

Browse files
authored
feat: add type hints (#1562)
Only for the framework lesson, previously discussed here #1303 (review)
1 parent 84c9388 commit 1a7abc1

File tree

1 file changed

+40
-31
lines changed

1 file changed

+40
-31
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 40 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,15 @@ Now let's use the framework to create a new version of our scraper. Rename the `
4747

4848
```py
4949
import asyncio
50-
from crawlee.crawlers import BeautifulSoupCrawler
50+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
5151

5252
async def main():
5353
crawler = BeautifulSoupCrawler()
5454

5555
@crawler.router.default_handler
56-
async def handle_listing(context):
57-
print(context.soup.title.text.strip())
56+
async def handle_listing(context: BeautifulSoupCrawlingContext):
57+
if title := context.soup.title:
58+
print(title.text.strip())
5859

5960
await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
6061

@@ -64,13 +65,15 @@ if __name__ == '__main__':
6465

6566
In the code, we do the following:
6667

67-
1. We perform imports and specify an asynchronous `main()` function.
68-
2. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
69-
3. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
70-
4. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.
71-
5. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.
68+
1. We import the necessary modules and define an asynchronous `main()` function.
69+
2. Inside `main()`, we first create a crawler object. This object manages the scraping process. In this case, it's a BeautifulSoup-based crawler.
70+
3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what you can do with the object.
71+
4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the _default handler_ for processing HTTP responses.
72+
5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace.
73+
6. At the end of the function, we run the crawler on a product listing URL and await its completion.
74+
7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop.
7275

73-
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
76+
Don't worry if some of this is new. You don't need to fully understand [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints just yet. Let's stick to the practical side and observe what the program does when executed:
7477

7578
```text
7679
$ python main.py
@@ -107,9 +110,9 @@ Sales
107110

108111
If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
109112

110-
:::tip Asynchronous code and decorators
113+
:::tip Advanced Python features
111114

112-
You don't need to be an expert in asynchronous programming or decorators to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/) and [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/).
115+
You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/).
113116

114117
:::
115118

@@ -121,20 +124,20 @@ For example, it takes a single line of code to extract and follow links to produ
121124

122125
```py
123126
import asyncio
124-
from crawlee.crawlers import BeautifulSoupCrawler
127+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
125128

126129
async def main():
127130
crawler = BeautifulSoupCrawler()
128131

129132
@crawler.router.default_handler
130-
async def handle_listing(context):
133+
async def handle_listing(context: BeautifulSoupCrawlingContext):
131134
# highlight-next-line
132135
await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title")
133136

134137
# highlight-next-line
135138
@crawler.router.handler("DETAIL")
136139
# highlight-next-line
137-
async def handle_detail(context):
140+
async def handle_detail(context: BeautifulSoupCrawlingContext):
138141
# highlight-next-line
139142
print(context.request.url)
140143

@@ -189,7 +192,7 @@ async def main():
189192
...
190193

191194
@crawler.router.handler("DETAIL")
192-
async def handle_detail(context):
195+
async def handle_detail(context: BeautifulSoupCrawlingContext):
193196
item = {
194197
"url": context.request.url,
195198
"title": context.soup.select_one(".product-meta__title").text.strip(),
@@ -198,6 +201,12 @@ async def main():
198201
print(item)
199202
```
200203

204+
:::note Fragile code
205+
206+
The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing.
207+
208+
:::
209+
201210
Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper.
202211

203212
The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
@@ -207,7 +216,7 @@ async def main():
207216
...
208217

209218
@crawler.router.handler("DETAIL")
210-
async def handle_detail(context):
219+
async def handle_detail(context: BeautifulSoupCrawlingContext):
211220
price_text = (
212221
context.soup
213222
# highlight-next-line
@@ -231,17 +240,17 @@ Finally, the variants. We can reuse the `parse_variant()` function as-is, and in
231240
```py
232241
import asyncio
233242
from decimal import Decimal
234-
from crawlee.crawlers import BeautifulSoupCrawler
243+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
235244

236245
async def main():
237246
crawler = BeautifulSoupCrawler()
238247

239248
@crawler.router.default_handler
240-
async def handle_listing(context):
249+
async def handle_listing(context: BeautifulSoupCrawlingContext):
241250
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
242251

243252
@crawler.router.handler("DETAIL")
244-
async def handle_detail(context):
253+
async def handle_detail(context: BeautifulSoupCrawlingContext):
245254
price_text = (
246255
context.soup
247256
.select_one(".product-form__info-content .price")
@@ -292,7 +301,7 @@ async def main():
292301
...
293302

294303
@crawler.router.handler("DETAIL")
295-
async def handle_detail(context):
304+
async def handle_detail(context: BeautifulSoupCrawlingContext):
296305
price_text = (
297306
...
298307
)
@@ -334,19 +343,19 @@ Crawlee gives us stats about HTTP requests and concurrency, but we don't get muc
334343
```py
335344
import asyncio
336345
from decimal import Decimal
337-
from crawlee.crawlers import BeautifulSoupCrawler
346+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
338347

339348
async def main():
340349
crawler = BeautifulSoupCrawler()
341350

342351
@crawler.router.default_handler
343-
async def handle_listing(context):
352+
async def handle_listing(context: BeautifulSoupCrawlingContext):
344353
# highlight-next-line
345354
context.log.info("Looking for product detail pages")
346355
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
347356

348357
@crawler.router.handler("DETAIL")
349-
async def handle_detail(context):
358+
async def handle_detail(context: BeautifulSoupCrawlingContext):
350359
# highlight-next-line
351360
context.log.info(f"Product detail page: {context.request.url}")
352361
price_text = (
@@ -453,17 +462,17 @@ Hints:
453462
import asyncio
454463
from datetime import datetime
455464

456-
from crawlee.crawlers import BeautifulSoupCrawler
465+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
457466

458467
async def main():
459468
crawler = BeautifulSoupCrawler()
460469

461470
@crawler.router.default_handler
462-
async def handle_listing(context):
471+
async def handle_listing(context: BeautifulSoupCrawlingContext):
463472
await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
464473

465474
@crawler.router.handler("DRIVER")
466-
async def handle_driver(context):
475+
async def handle_driver(context: BeautifulSoupCrawlingContext):
467476
info = {}
468477
for row in context.soup.select(".common-driver-info li"):
469478
name = row.select_one("span").text.strip()
@@ -531,7 +540,7 @@ async def main():
531540
...
532541

533542
@crawler.router.default_handler
534-
async def handle_netflix_table(context):
543+
async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
535544
requests = []
536545
for name_cell in context.soup.select(...):
537546
name = name_cell.text.strip()
@@ -553,13 +562,13 @@ When navigating to the first search result, you might find it helpful to know th
553562
from urllib.parse import quote_plus
554563

555564
from crawlee import Request
556-
from crawlee.crawlers import BeautifulSoupCrawler
565+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
557566

558567
async def main():
559568
crawler = BeautifulSoupCrawler()
560569

561570
@crawler.router.default_handler
562-
async def handle_netflix_table(context):
571+
async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
563572
requests = []
564573
for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
565574
name = name_cell.text.strip()
@@ -568,11 +577,11 @@ When navigating to the first search result, you might find it helpful to know th
568577
await context.add_requests(requests)
569578

570579
@crawler.router.handler("IMDB_SEARCH")
571-
async def handle_imdb_search(context):
580+
async def handle_imdb_search(context: BeautifulSoupCrawlingContext):
572581
await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
573582

574583
@crawler.router.handler("IMDB")
575-
async def handle_imdb(context):
584+
async def handle_imdb(context: BeautifulSoupCrawlingContext):
576585
rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
577586
rating_text = context.soup.select_one(rating_selector).text.strip()
578587
await context.push_data({

0 commit comments

Comments
 (0)