Skip to content

Commit 34cdfa9

Browse files
committed
feat: add exercises
1 parent a9caa86 commit 34cdfa9

File tree

3 files changed

+248
-2
lines changed

3 files changed

+248
-2
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,7 @@ https://www.amazon.com/s?k=darth+vader
197197
sys.exit(1)
198198
```
199199

200+
If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
200201
</details>
201202

202203
### Save downloaded HTML as a file

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ response.raise_for_status()
2222

2323
html_code = response.text
2424
soup = BeautifulSoup(html_code, "html.parser")
25+
2526
for product in soup.select(".product-item"):
2627
print(product.text)
2728
```
@@ -72,6 +73,7 @@ response.raise_for_status()
7273

7374
html_code = response.text
7475
soup = BeautifulSoup(html_code, "html.parser")
76+
7577
for product in soup.select(".product-item"):
7678
titles = product.select(".product-item__title")
7779
first_title = titles[0].text
@@ -113,6 +115,7 @@ response.raise_for_status()
113115

114116
html_code = response.text
115117
soup = BeautifulSoup(html_code, "html.parser")
118+
116119
for product in soup.select(".product-item"):
117120
title = product.select_one(".product-item__title").text
118121
price = product.select_one(".price").text
@@ -156,6 +159,7 @@ response.raise_for_status()
156159

157160
html_code = response.text
158161
soup = BeautifulSoup(html_code, "html.parser")
162+
159163
for product in soup.select(".product-item"):
160164
title = product.select_one(".product-item__title").text
161165
price = product.select_one(".price").contents[-1]
@@ -181,4 +185,111 @@ Great! We have managed to use CSS selectors and walk the HTML tree to get a list
181185

182186
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
183187

184-
TODO
188+
### Scrape Wikipedia
189+
190+
Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL:
191+
192+
```text
193+
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
194+
```
195+
196+
Your program should print the following:
197+
198+
```text
199+
Algeria
200+
Angola
201+
Benin
202+
Botswana
203+
...
204+
```
205+
206+
<details>
207+
<summary>Solution</summary>
208+
209+
```py
210+
import httpx
211+
from bs4 import BeautifulSoup
212+
213+
url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
214+
response = httpx.get(url)
215+
response.raise_for_status()
216+
217+
html_code = response.text
218+
soup = BeautifulSoup(html_code, "html.parser")
219+
220+
for table in soup.select(".wikitable"):
221+
for row in table.select("tr"):
222+
cells = row.select("td")
223+
if cells:
224+
third_column = cells[2]
225+
title_link = third_column.select_one("a")
226+
print(title_link.text)
227+
```
228+
229+
Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
230+
231+
</details>
232+
233+
### Use CSS selectors to their max
234+
235+
Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages:
236+
237+
- [Descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator)
238+
- [`:nth-child()` pseudo-class](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child)
239+
240+
<details>
241+
<summary>Solution</summary>
242+
243+
```py
244+
import httpx
245+
from bs4 import BeautifulSoup
246+
247+
url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
248+
response = httpx.get(url)
249+
response.raise_for_status()
250+
251+
html_code = response.text
252+
soup = BeautifulSoup(html_code, "html.parser")
253+
254+
for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
255+
print(name_cell.select_one("a").text)
256+
```
257+
258+
</details>
259+
260+
### Scrape F1 news
261+
262+
Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL:
263+
264+
```text
265+
https://www.theguardian.com/sport/formulaone
266+
```
267+
268+
Your program should print something like the following:
269+
270+
```text
271+
Wolff confident Mercedes are heading to front of grid after Canada improvement
272+
Frustrated Lando Norris blames McLaren team for missed chance
273+
Max Verstappen wins Canadian Grand Prix: F1 – as it happened
274+
...
275+
```
276+
277+
<details>
278+
<summary>Solution</summary>
279+
280+
```py
281+
import httpx
282+
from bs4 import BeautifulSoup
283+
284+
url = "https://www.theguardian.com/sport/formulaone"
285+
response = httpx.get(url)
286+
response.raise_for_status()
287+
288+
html_code = response.text
289+
soup = BeautifulSoup(html_code, "html.parser")
290+
291+
for title in soup.select("#maincontent ul li h3"):
292+
print(title.text)
293+
```
294+
295+
</details>

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 135 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ response.raise_for_status()
7171

7272
html_code = response.text
7373
soup = BeautifulSoup(html_code, "html.parser")
74+
7475
for product in soup.select(".product-item"):
7576
title = product.select_one(".product-item__title").text
7677

@@ -171,6 +172,7 @@ response.raise_for_status()
171172

172173
html_code = response.text
173174
soup = BeautifulSoup(html_code, "html.parser")
175+
174176
for product in soup.select(".product-item"):
175177
title = product.select_one(".product-item__title").text.strip()
176178

@@ -211,4 +213,136 @@ Well, not to spoil the excitement, but in its current form, the data isn't very
211213

212214
These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
213215

214-
TODO
216+
### Scrape units on stock
217+
218+
Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line:
219+
220+
```text
221+
JBL Flip 4 Waterproof Portable Bluetooth Speaker 672
222+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 77
223+
Sony SACS9 10" Active Subwoofer 7
224+
Sony PS-HX500 Hi-Res USB Turntable 15
225+
Klipsch R-120SW Powerful Detailed Home Speaker - Unit 0
226+
Denon AH-C720 In-Ear Headphones 236
227+
...
228+
```
229+
230+
<details>
231+
<summary>Solution</summary>
232+
233+
```py
234+
import httpx
235+
from bs4 import BeautifulSoup
236+
237+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
238+
response = httpx.get(url)
239+
response.raise_for_status()
240+
241+
html_code = response.text
242+
soup = BeautifulSoup(html_code, "html.parser")
243+
244+
for product in soup.select(".product-item"):
245+
title = product.select_one(".product-item__title").text.strip()
246+
247+
units_text = (
248+
product
249+
.select_one(".product-item__inventory")
250+
.text
251+
.removeprefix("In stock,")
252+
.removeprefix("Only")
253+
.removesuffix(" left")
254+
.removesuffix("units")
255+
.strip()
256+
)
257+
if "Sold out" in units_text:
258+
units = 0
259+
else:
260+
units = int(units_text)
261+
262+
print(title, units)
263+
```
264+
265+
</details>
266+
267+
### Use regular expressions
268+
269+
Simplify the code from previous exercise. Use [regular expressions](https://docs.python.org/3/library/re.html) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`.
270+
271+
<details>
272+
<summary>Solution</summary>
273+
274+
```py
275+
import re
276+
import httpx
277+
from bs4 import BeautifulSoup
278+
279+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
280+
response = httpx.get(url)
281+
response.raise_for_status()
282+
283+
html_code = response.text
284+
soup = BeautifulSoup(html_code, "html.parser")
285+
286+
for product in soup.select(".product-item"):
287+
title = product.select_one(".product-item__title").text.strip()
288+
289+
units_text = product.select_one(".product-item__inventory").text
290+
if re_match := re.search(r"\d+", units_text):
291+
units = int(re_match.group())
292+
else:
293+
units = 0
294+
295+
print(title, units)
296+
```
297+
298+
</details>
299+
300+
### Scrape publish dates of F1 news
301+
302+
Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL:
303+
304+
```text
305+
https://www.theguardian.com/sport/formulaone
306+
```
307+
308+
Your program should print something like the following. Note the dates at the end of each line:
309+
310+
```text
311+
Wolff confident Mercedes are heading to front of grid after Canada improvement 2024-06-10
312+
Frustrated Lando Norris blames McLaren team for missed chance 2024-06-09
313+
Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
314+
...
315+
```
316+
317+
Hints:
318+
319+
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
320+
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
321+
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
322+
- To get just the date part, you can call `.date()` on any `datetime` object.
323+
324+
<details>
325+
<summary>Solution</summary>
326+
327+
```py
328+
import httpx
329+
from bs4 import BeautifulSoup
330+
331+
url = "https://www.theguardian.com/sport/formulaone"
332+
response = httpx.get(url)
333+
response.raise_for_status()
334+
335+
html_code = response.text
336+
soup = BeautifulSoup(html_code, "html.parser")
337+
338+
for article in soup.select("#maincontent ul li"):
339+
title = article.select_one("h3").text.strip()
340+
341+
time_iso = article.select_one("time")["datetime"].strip()
342+
published_at = datetime.fromisoformat(time_iso)
343+
published_on = published_at.date()
344+
345+
print(title, published_on)
346+
```
347+
348+
</details>

0 commit comments

Comments
 (0)