Skip to content

Commit e0ecf7c

Browse files
committed
feat: add exercises
1 parent 0749d5f commit e0ecf7c

File tree

1 file changed

+113
-1
lines changed

1 file changed

+113
-1
lines changed

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,4 +190,116 @@ In the next lesson, we'll scrape the product detail pages so that each product v
190190

191191
<Exercises />
192192

193-
TODO
193+
### Scrape calling codes of African countries
194+
195+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse code. Scrape links to Wikipedia pages of all African states and territories. Follow the links and for each country extract the calling code, which is in the info table. Print URL and the calling code for all the countries. Start with this URL:
196+
197+
```text
198+
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
199+
```
200+
201+
Your program should print the following:
202+
203+
```text
204+
https://en.wikipedia.org/wiki/Algeria +213
205+
https://en.wikipedia.org/wiki/Angola +244
206+
https://en.wikipedia.org/wiki/Benin +229
207+
https://en.wikipedia.org/wiki/Botswana +267
208+
https://en.wikipedia.org/wiki/Burkina_Faso +226
209+
https://en.wikipedia.org/wiki/Burundi None
210+
https://en.wikipedia.org/wiki/Cameroon +237
211+
...
212+
```
213+
214+
Hint: Locating cells in tables is sometimes easier if you know how to [go up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup.
215+
216+
<details>
217+
<summary>Solution</summary>
218+
219+
```py
220+
import httpx
221+
from bs4 import BeautifulSoup
222+
from urllib.parse import urljoin
223+
224+
def download(url):
225+
response = httpx.get(url)
226+
response.raise_for_status()
227+
return BeautifulSoup(response.text, "html.parser")
228+
229+
def parse_calling_code(soup):
230+
for label in soup.select("th.infobox-label"):
231+
if label.text.strip() == "Calling code":
232+
data = label.parent.select_one("td.infobox-data")
233+
return data.text.strip()
234+
return None
235+
236+
listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
237+
listing_soup = download(listing_url)
238+
for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"):
239+
link = name_cell.select_one("a")
240+
country_url = urljoin(listing_url, link["href"])
241+
country_soup = download(country_url)
242+
calling_code = parse_calling_code(country_soup)
243+
print(country_url, calling_code)
244+
```
245+
246+
</details>
247+
248+
### Scrape authors of F1 news articles
249+
250+
This is a follow-up to an exercise from the previous lesson, so feel free to reuse code. Scrape links to Guardian's latest F1 news. Follow the link for each article and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL:
251+
252+
```text
253+
https://www.theguardian.com/sport/formulaone
254+
```
255+
256+
Your program should print something like the following:
257+
258+
```text
259+
Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire?
260+
Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1
261+
Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull
262+
PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying
263+
Giles Richards: Red Bull must solve Verstappen’s ‘monster’ riddle or Norris will pounce
264+
...
265+
```
266+
267+
Hints:
268+
269+
- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on values of their attributes.
270+
- Notice that sometimes a person authors the article, but sometimes it's a contribution by a news agency.
271+
272+
<details>
273+
<summary>Solution</summary>
274+
275+
```py
276+
import httpx
277+
from bs4 import BeautifulSoup
278+
from urllib.parse import urljoin
279+
280+
def download(url):
281+
response = httpx.get(url)
282+
response.raise_for_status()
283+
return BeautifulSoup(response.text, "html.parser")
284+
285+
def parse_author(article_soup):
286+
link = article_soup.select_one('aside a[rel="author"]')
287+
if link:
288+
return link.text.strip()
289+
address = article_soup.select_one('aside address')
290+
if address:
291+
return address.text.strip()
292+
return None
293+
294+
listing_url = "https://www.theguardian.com/sport/formulaone"
295+
listing_soup = download(listing_url)
296+
for item in listing_soup.select("#maincontent ul li"):
297+
link = item.select_one("a")
298+
article_url = urljoin(listing_url, link["href"])
299+
article_soup = download(article_url)
300+
title = article_soup.select_one("h1").text.strip()
301+
author = parse_author(article_soup)
302+
print(f"{author}: {title}")
303+
```
304+
305+
</details>

0 commit comments

Comments
 (0)