Skip to content

Commit ae5abd2

Browse files
committed
feat: section about extracting data
1 parent dcebc72 commit ae5abd2

File tree

1 file changed

+91
-4
lines changed

1 file changed

+91
-4
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 91 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -175,13 +175,100 @@ In the final statistics, you can see that we made 25 requests (1 listing page +
175175

176176
## Extracting data
177177

178-
The BeautifulSoup crawler provides handlers with the `context.soup` attribute, where we can find the parsed HTML of the handled page. This is the same as the `soup` we had in our previous program.
178+
The BeautifulSoup crawler provides handlers with the `context.soup` attribute, where we can find the parsed HTML of the handled page. This is the same as the `soup` we had in our previous program. Let's locate and extract the same data as before:
179179

180-
:::danger Work in progress
180+
```py
181+
@crawler.router.handler("DETAIL")
182+
async def handle_detail(context):
183+
item = {
184+
"url": context.request.url,
185+
"title": context.soup.select_one(".product-meta__title").text.strip(),
186+
"vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
187+
}
188+
print(item)
189+
```
181190

182-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
191+
Now the price. We won't be inventing anything new here-let's add `Decimal` import and copy-paste code from our old scraper.
183192

184-
:::
193+
The only change will be in the selector. In `main.py`, we were looking for `.price` inside a `product_soup` representing a product card. Now we're looking for `.price` inside the whole product detail page. It's safer to be more specific so that we won't match another price on the same page:
194+
195+
```py
196+
@crawler.router.handler("DETAIL")
197+
async def handle_detail(context):
198+
price_text = (
199+
context.soup
200+
# highlight-next-line
201+
.select_one(".product-form__info-content .price")
202+
.contents[-1]
203+
.strip()
204+
.replace("$", "")
205+
.replace(",", "")
206+
)
207+
item = {
208+
"url": context.request.url,
209+
"title": context.soup.select_one(".product-meta__title").text.strip(),
210+
"vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
211+
"price": Decimal(price_text),
212+
}
213+
print(item)
214+
```
215+
216+
Finally, variants. We can reuse the `parse_variant()` function as it is, and even the handler code will look similar to what we already had. The whole program will look like this:
217+
218+
```py
219+
import asyncio
220+
from decimal import Decimal
221+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
222+
223+
async def main():
224+
crawler = BeautifulSoupCrawler()
225+
226+
@crawler.router.default_handler
227+
async def handle_listing(context):
228+
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
229+
230+
@crawler.router.handler("DETAIL")
231+
async def handle_detail(context):
232+
price_text = (
233+
context.soup
234+
.select_one(".product-form__info-content .price")
235+
.contents[-1]
236+
.strip()
237+
.replace("$", "")
238+
.replace(",", "")
239+
)
240+
item = {
241+
"url": context.request.url,
242+
"title": context.soup.select_one(".product-meta__title").text.strip(),
243+
"vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
244+
"price": Decimal(price_text),
245+
"variant_name": None,
246+
}
247+
if variants := context.soup.select(".product-form__option.no-js option"):
248+
for variant in variants:
249+
print(item | parse_variant(variant))
250+
else:
251+
print(item)
252+
253+
await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
254+
255+
def parse_variant(variant):
256+
text = variant.text.strip()
257+
name, price_text = text.split(" - ")
258+
price = Decimal(
259+
price_text
260+
.replace("$", "")
261+
.replace(",", "")
262+
)
263+
return {"variant_name": name, "price": price}
264+
265+
if __name__ == '__main__':
266+
asyncio.run(main())
267+
```
268+
269+
If you run this scraper, you should see the same data about the 24 products as before. Crawlee has saved us a lot of work with downloading, parsing, logging, and parallelization. The code is also easier to follow with the two handlers separated and labeled.
270+
271+
Crawlee doesn't help much with locating and extracting the data-that code is almost identical with or without framework. That's because the detective work involved, and taking care of the extraction, are the main added value of custom-made scrapers. With Crawlee, you can focus on just that, and let the framework take care of the rest.
185272

186273
## Saving data
187274

0 commit comments

Comments
 (0)