You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. We perform imports and specify an asynchronous `main()` function.
68
-
2. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
69
-
3. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
70
-
4. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.
71
-
5. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.
68
+
1. We import the necessary modules and define an asynchronous `main()` function.
69
+
2. Inside `main()`, we first create a crawler object. This object manages the scraping process. In this case, it's a BeautifulSoup-based crawler.
70
+
3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what you can do with the object.
71
+
4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the _default handler_ for processing HTTP responses.
72
+
5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace.
73
+
6. At the end of the function, we run the crawler on a product listing URL and await its completion.
74
+
7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop.
72
75
73
-
Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
76
+
Don't worry if some of this is new. You don't need to fully understand [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints just yet. Let's stick to the practical side and observe what the program does when executed:
74
77
75
78
```text
76
79
$ python main.py
@@ -107,9 +110,9 @@ Sales
107
110
108
111
If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
109
112
110
-
:::tip Asynchronous code and decorators
113
+
:::tip Advanced Python features
111
114
112
-
You don't need to be an expert in asynchronous programmingor decorators to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/) and [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/).
115
+
You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/).
113
116
114
117
:::
115
118
@@ -121,20 +124,20 @@ For example, it takes a single line of code to extract and follow links to produ
121
124
122
125
```py
123
126
import asyncio
124
-
from crawlee.crawlers import BeautifulSoupCrawler
127
+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing.
207
+
208
+
:::
209
+
201
210
Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper.
202
211
203
212
The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
0 commit comments