Skip to content

Commit f947c03

Browse files
committed
feat: add exercise
1 parent 75d559e commit f947c03

File tree

2 files changed

+99
-2
lines changed

2 files changed

+99
-2
lines changed

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,7 @@ For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprin
325325

326326
Your output should look something like this:
327327

328-
```text
328+
```py
329329
{'title': 'Senior Full Stack Developer',
330330
'company': 'Baserow',
331331
'url': 'https://www.python.org/jobs/7705/',

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 98 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -419,7 +419,8 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade
419419

420420
If you export the dataset as a JSON, you should see something like this:
421421

422-
```text
422+
<!-- eslint-skip -->
423+
```json
423424
[
424425
{
425426
"url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus",
@@ -493,3 +494,99 @@ Hints:
493494
```
494495

495496
</details>
497+
498+
### Use Crawlee to find rating of the most popular Netflix films
499+
500+
The [Global Top 10](https://www.netflix.com/tudum/top10) page contains a table of the most currently popular Netflix films worldwide. Scrape the movie names, then search for each movie at the [IMDb](https://www.imdb.com/). Assume the first search result is correct and find out what's the film's rating. Each item you push to the Crawlee's default dataset should contain the following data:
501+
502+
- URL of the film's imdb.com page
503+
- Title
504+
- Rating
505+
506+
If you export the dataset as a JSON, you should see something like this:
507+
508+
<!-- eslint-skip -->
509+
```json
510+
[
511+
{
512+
"url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1",
513+
"title": "The Merry Gentlemen",
514+
"rating": "5.0/10"
515+
},
516+
{
517+
"url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1",
518+
"title": "Hot Frosty",
519+
"rating": "5.4/10"
520+
},
521+
...
522+
]
523+
```
524+
525+
For each name from the Global Top 10, you'll need to construct a `Request` object with IMDb search URL. Take the following code snippet as a hint on how to do it:
526+
527+
```py
528+
...
529+
from urllib.parse import quote_plus
530+
531+
async def main():
532+
...
533+
534+
@crawler.router.default_handler
535+
async def handle_netflix_table(context):
536+
requests = []
537+
for name_cell in context.soup.select(...):
538+
name = name_cell.text.strip()
539+
imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
540+
requests.append(Request.from_url(imdb_search_url, label="..."))
541+
await context.add_requests(requests)
542+
543+
...
544+
...
545+
```
546+
547+
When following the first search result, you may find handy to know that `context.enqueue_links()` takes a `limit` keyword argument, where you can specify the max number of HTTP requests to enqueue.
548+
549+
<details>
550+
<summary>Solution</summary>
551+
552+
```py
553+
import asyncio
554+
from urllib.parse import quote_plus
555+
556+
from crawlee import Request
557+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
558+
559+
async def main():
560+
crawler = BeautifulSoupCrawler()
561+
562+
@crawler.router.default_handler
563+
async def handle_netflix_table(context):
564+
requests = []
565+
for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
566+
name = name_cell.text.strip()
567+
imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
568+
requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
569+
await context.add_requests(requests)
570+
571+
@crawler.router.handler("IMDB_SEARCH")
572+
async def handle_imdb_search(context):
573+
await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
574+
575+
@crawler.router.handler("IMDB")
576+
async def handle_imdb(context):
577+
rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
578+
rating_text = context.soup.select_one(rating_selector).text.strip()
579+
await context.push_data({
580+
"url": context.request.url,
581+
"title": context.soup.select_one("h1").text.strip(),
582+
"rating": rating_text,
583+
})
584+
585+
await crawler.run(["https://www.netflix.com/tudum/top10"])
586+
await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
587+
588+
if __name__ == '__main__':
589+
asyncio.run(main())
590+
```
591+
592+
</details>

0 commit comments

Comments
 (0)