Skip to content

Commit ffb2e20

Browse files
committed
feat: add exercises
1 parent 3d315aa commit ffb2e20

File tree

1 file changed

+83
-6
lines changed

1 file changed

+83
-6
lines changed

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 83 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -302,14 +302,91 @@ Ta-da! We've managed to get links leading to the product pages. In the next less
302302

303303
<Exercises />
304304

305-
### TODO
305+
### Scrape links to countries in Africa
306306

307-
TODO
307+
Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print links to Wikipedia pages of all the states and territories mentioned in all tables. Start with this URL:
308308

309-
### TODO
309+
```text
310+
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
311+
```
312+
313+
Your program should print the following:
314+
315+
```text
316+
https://en.wikipedia.org/wiki/Algeria
317+
https://en.wikipedia.org/wiki/Angola
318+
https://en.wikipedia.org/wiki/Benin
319+
https://en.wikipedia.org/wiki/Botswana
320+
...
321+
```
322+
323+
<details>
324+
<summary>Solution</summary>
325+
326+
```py
327+
import httpx
328+
from bs4 import BeautifulSoup
329+
from urllib.parse import urljoin
330+
331+
listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
332+
response = httpx.get(listing_url)
333+
response.raise_for_status()
334+
335+
html_code = response.text
336+
soup = BeautifulSoup(html_code, "html.parser")
337+
338+
for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
339+
link = name_cell.select_one("a")
340+
url = urljoin(listing_url, link["href"])
341+
print(url)
342+
```
343+
344+
</details>
345+
346+
### Scrape links to F1 news
347+
348+
Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print links to all the listed articles. Start with this URL:
349+
350+
```text
351+
https://www.theguardian.com/sport/formulaone
352+
```
353+
354+
Your program should print something like the following:
355+
356+
```text
357+
https://www.theguardian.com/world/2024/sep/13/africa-f1-formula-one-fans-lewis-hamilton-grand-prix
358+
https://www.theguardian.com/sport/2024/sep/12/mclaren-lando-norris-oscar-piastri-team-orders-f1-title-race-max-verstappen
359+
https://www.theguardian.com/sport/article/2024/sep/10/f1-designer-adrian-newey-signs-aston-martin-deal-after-quitting-red-bull
360+
https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-undriveable-monster-how-bad-really-is-it-and-why
361+
...
362+
```
363+
364+
<details>
365+
<summary>Solution</summary>
366+
367+
```py
368+
import httpx
369+
from bs4 import BeautifulSoup
370+
from urllib.parse import urljoin
371+
372+
url = "https://www.theguardian.com/sport/formulaone"
373+
response = httpx.get(url)
374+
response.raise_for_status()
375+
376+
html_code = response.text
377+
soup = BeautifulSoup(html_code, "html.parser")
378+
379+
for item in soup.select("#maincontent ul li"):
380+
link = item.select_one("a")
381+
url = urljoin(url, link["href"])
382+
print(url)
383+
```
310384

311-
TODO
385+
Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:
312386

313-
### TODO
387+
```text
388+
https://www.theguardian.com/sport/article/2024/sep/02/example
389+
https://www.theguardian.com/sport/article/2024/sep/02/example#comments
390+
```
314391

315-
TODO
392+
</details>

0 commit comments

Comments
 (0)