Skip to content

Commit f88abc8

Browse files
committed
feat: exercises
1 parent e0ecf7c commit f88abc8

File tree

3 files changed

+112
-6
lines changed

3 files changed

+112
-6
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -142,10 +142,10 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'
142142

143143
### Scrape Amazon
144144

145-
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
145+
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:
146146

147147
```text
148-
https://www.amazon.com/s?k=darth+vader
148+
https://www.aliexpress.com/w/wholesale-darth-vader.html
149149
```
150150

151151
<details>
@@ -154,13 +154,12 @@ https://www.amazon.com/s?k=darth+vader
154154
```py
155155
import httpx
156156

157-
url = "https://www.amazon.com/s?k=darth+vader"
157+
url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
158158
response = httpx.get(url)
159159
response.raise_for_status()
160160
print(response.text)
161161
```
162162

163-
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
164163
</details>
165164

166165
### Save downloaded HTML as a file

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -256,11 +256,11 @@ https://www.theguardian.com/sport/formulaone
256256
Your program should print something like the following:
257257

258258
```text
259+
Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics
259260
Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire?
260261
Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1
261262
Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull
262263
PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying
263-
Giles Richards: Red Bull must solve Verstappen’s ‘monster’ riddle or Norris will pounce
264264
...
265265
```
266266

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

Lines changed: 108 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,5 +307,112 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
307307

308308
<Exercises />
309309

310-
TODO
310+
### Build a scraper for watching Python jobs
311311

312+
You're now able to build a scraper, are you? Let's build another one, then! Python's official website features a [job board](https://www.python.org/jobs/). Scrape job postings which match the following criteria:
313+
314+
- Tagged as Database
315+
- Not older than 60 days
316+
317+
For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data:
318+
319+
- Job title
320+
- Company
321+
- URL to the job posting
322+
- Date of posting
323+
324+
Your program should print something like the following:
325+
326+
```text
327+
{'title': 'Senior Full Stack Developer',
328+
'company': 'Baserow',
329+
'url': 'https://www.python.org/jobs/7705/',
330+
'posted_on': datetime.date(2024, 9, 16)}
331+
{'title': 'Senior Python Engineer',
332+
'company': 'Active Prime',
333+
'url': 'https://www.python.org/jobs/7699/',
334+
'posted_on': datetime.date(2024, 9, 5)}
335+
...
336+
```
337+
338+
In Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module you should find everything you need for manipulating time: `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, `timedelta()`.
339+
340+
<details>
341+
<summary>Solution</summary>
342+
343+
After inspecting how the job board works, we can notice that job postings tagged as Database have their own URL. We'll use it as the starting point, as it'll save us from needing to scrape and check the tags.
344+
345+
```py
346+
from pprint import pp
347+
import httpx
348+
from bs4 import BeautifulSoup
349+
from urllib.parse import urljoin
350+
from datetime import datetime, date, timedelta
351+
352+
today = date.today()
353+
jobs_url = "https://www.python.org/jobs/type/database/"
354+
response = httpx.get(jobs_url)
355+
response.raise_for_status()
356+
soup = BeautifulSoup(response.text, "html.parser")
357+
358+
for job in soup.select(".list-recent-jobs li"):
359+
link = job.select_one(".listing-company-name a")
360+
361+
time = job.select_one(".listing-posted time")
362+
posted_at = datetime.fromisoformat(time["datetime"])
363+
posted_on = posted_at.date()
364+
posted_ago = today - posted_on
365+
366+
if posted_ago <= timedelta(days=60):
367+
title = link.text.strip()
368+
company = list(job.select_one(".listing-company-name").stripped_strings)[-1]
369+
url = urljoin(jobs_url, link["href"])
370+
pp({"title": title, "company": company, "url": url, "posted_on": posted_on})
371+
```
372+
373+
</details>
374+
375+
### Find the shortest CNN article which made it to the Sports homepage
376+
377+
Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters:
378+
379+
- Locate element which holds the main content of the article.
380+
- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to get all its content as a plain text.
381+
- Use `len()` to calculate the length.
382+
383+
Skip pages without text, e.g. those which contain only a video. Sort the results and print URL to the shortest article which made it to the homepage.
384+
385+
At the time of writing this exercise, the shortest article which made it to the CNN Sports homepage is [one about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/). It's just 1,642 characters long.
386+
387+
<details>
388+
<summary>Solution</summary>
389+
390+
```py
391+
import httpx
392+
from bs4 import BeautifulSoup
393+
from urllib.parse import urljoin
394+
395+
def download(url):
396+
response = httpx.get(url)
397+
response.raise_for_status()
398+
return BeautifulSoup(response.text, "html.parser")
399+
400+
listing_url = "https://edition.cnn.com/sport"
401+
listing_soup = download(listing_url)
402+
403+
data = []
404+
for card in listing_soup.select(".layout__main .card"):
405+
link = card.select_one(".container__link")
406+
article_url = urljoin(listing_url, link["href"])
407+
article_soup = download(article_url)
408+
if content := article_soup.select_one(".article__content"):
409+
length = len(content.get_text())
410+
data.append((length, article_url))
411+
412+
data.sort()
413+
shortest_item = data[0]
414+
item_url = shortest_item[1]
415+
print(item_url)
416+
```
417+
418+
</details>

0 commit comments

Comments
 (0)