Skip to content

Commit 75d559e

Browse files
committed
feat: add exercise
1 parent bb5b4bd commit 75d559e

File tree

1 file changed

+85
-3
lines changed

1 file changed

+85
-3
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 85 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -406,8 +406,90 @@ In the next lesson, we'll use a scraping platform to set up our application to r
406406

407407
<Exercises />
408408

409-
:::danger Work in progress
409+
### Build a Crawlee scraper of F1 Academy drivers
410410

411-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
411+
Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to the Crawlee's default dataset should contain the following data:
412412

413-
:::
413+
- URL of the driver's f1academy.com page
414+
- Name
415+
- Team
416+
- Nationality
417+
- Date of birth (as a `date()` object)
418+
- Instagram URL
419+
420+
If you export the dataset as a JSON, you should see something like this:
421+
422+
```text
423+
[
424+
{
425+
"url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus",
426+
"name": "Emely De Heus",
427+
"team": "MP Motorsport"
428+
"nationality": "Dutch",
429+
"dob": "2003-02-10",
430+
"instagram_url": "https://www.instagram.com/emely.de.heus/",
431+
},
432+
{
433+
"url": "https://www.f1academy.com/Racing-Series/Drivers/28/Hamda-Al-Qubaisi",
434+
"name": "Hamda Al Qubaisi",
435+
"team": "MP Motorsport"
436+
"nationality": "Emirati",
437+
"dob": "2002-08-08",
438+
"instagram_url": "https://www.instagram.com/hamdaalqubaisi_official/",
439+
},
440+
...
441+
]
442+
```
443+
444+
Hints:
445+
446+
- Use Python's native `datetime.strptime(text, "%d/%m/%Y").date()` to parse the `DD/MM/YYYY` date format. See [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) to learn more.
447+
- Use the attribute selector `a[href*='instagram']` to locate the Instagram URL. See [docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to learn more.
448+
449+
<details>
450+
<summary>Solution</summary>
451+
452+
```py
453+
import asyncio
454+
from datetime import datetime
455+
456+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
457+
458+
async def main():
459+
crawler = BeautifulSoupCrawler()
460+
461+
@crawler.router.default_handler
462+
async def handle_listing(context):
463+
await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
464+
465+
@crawler.router.handler("DRIVER")
466+
async def handle_driver(context):
467+
info = {}
468+
for row in context.soup.select(".common-driver-info li"):
469+
name = row.select_one("span").text.strip()
470+
value = row.select_one("h4").text.strip()
471+
info[name] = value
472+
473+
detail = {}
474+
for row in context.soup.select(".driver-detail--cta-group a"):
475+
name = row.select_one("p").text.strip()
476+
value = row.select_one("h2").text.strip()
477+
detail[name] = value
478+
479+
await context.push_data({
480+
"url": context.request.url,
481+
"name": context.soup.select_one("h1").text.strip(),
482+
"team": detail["Team"],
483+
"nationality": info["Nationality"],
484+
"dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(),
485+
"instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"),
486+
})
487+
488+
await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"])
489+
await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
490+
491+
if __name__ == '__main__':
492+
asyncio.run(main())
493+
```
494+
495+
</details>

0 commit comments

Comments
 (0)