feat: add exercises

honzajavorek · honzajavorek · commit 34cdfa9ab38f · 2024-09-10T09:30:36.000+02:00
diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
@@ -197,6 +197,7 @@ https://www.amazon.com/s?k=darth+vader
       sys.exit(1)
   ```
 
+  If you get `Server error '503 Service Unavailable'`, that's just Amazon's anti-scraping protections. You can learn about how to overcome those in our [Anti-scraping protections](../anti_scraping/index.md) course.
 </details>
 
 ### Save downloaded HTML as a file
diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -22,6 +22,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     print(product.text)
 ```
@@ -72,6 +73,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     titles = product.select(".product-item__title")
     first_title = titles[0].text
@@ -113,6 +115,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     title = product.select_one(".product-item__title").text
     price = product.select_one(".price").text
@@ -156,6 +159,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     title = product.select_one(".product-item__title").text
     price = product.select_one(".price").contents[-1]
@@ -181,4 +185,111 @@ Great! We have managed to use CSS selectors and walk the HTML tree to get a list
 
 These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
 
-TODO
+### Scrape Wikipedia
+
+Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL:
+
+```text
+https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
+```
+
+Your program should print the following:
+
+```text
+Algeria
+Angola
+Benin
+Botswana
+...
+```
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for table in soup.select(".wikitable"):
+      for row in table.select("tr"):
+          cells = row.select("td")
+          if cells:
+              third_column = cells[2]
+              title_link = third_column.select_one("a")
+              print(title_link.text)
+  ```
+
+  Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
+
+</details>
+
+### Use CSS selectors to their max
+
+Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages:
+
+- [Descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator)
+- [`:nth-child()` pseudo-class](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child)
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
+      print(name_cell.select_one("a").text)
+  ```
+
+</details>
+
+### Scrape F1 news
+
+Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL:
+
+```text
+https://www.theguardian.com/sport/formulaone
+```
+
+Your program should print something like the following:
+
+```text
+Wolff confident Mercedes are heading to front of grid after Canada improvement
+Frustrated Lando Norris blames McLaren team for missed chance
+Max Verstappen wins Canadian Grand Prix: F1 – as it happened
+...
+```
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://www.theguardian.com/sport/formulaone"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for title in soup.select("#maincontent ul li h3"):
+      print(title.text)
+  ```
+
+</details>
diff --git a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
@@ -71,6 +71,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     title = product.select_one(".product-item__title").text
 
@@ -171,6 +172,7 @@ response.raise_for_status()
 
 html_code = response.text
 soup = BeautifulSoup(html_code, "html.parser")
+
 for product in soup.select(".product-item"):
     title = product.select_one(".product-item__title").text.strip()
 
@@ -211,4 +213,136 @@ Well, not to spoil the excitement, but in its current form, the data isn't very
 
 These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
 
-TODO
+### Scrape units on stock
+
+Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line:
+
+```text
+JBL Flip 4 Waterproof Portable Bluetooth Speaker 672
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 77
+Sony SACS9 10" Active Subwoofer 7
+Sony PS-HX500 Hi-Res USB Turntable 15
+Klipsch R-120SW Powerful Detailed Home Speaker - Unit 0
+Denon AH-C720 In-Ear Headphones 236
+...
+```
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for product in soup.select(".product-item"):
+      title = product.select_one(".product-item__title").text.strip()
+
+      units_text = (
+          product
+          .select_one(".product-item__inventory")
+          .text
+          .removeprefix("In stock,")
+          .removeprefix("Only")
+          .removesuffix(" left")
+          .removesuffix("units")
+          .strip()
+      )
+      if "Sold out" in units_text:
+          units = 0
+      else:
+          units = int(units_text)
+
+      print(title, units)
+  ```
+
+</details>
+
+### Use regular expressions
+
+Simplify the code from previous exercise. Use [regular expressions](https://docs.python.org/3/library/re.html) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`.
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import re
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for product in soup.select(".product-item"):
+      title = product.select_one(".product-item__title").text.strip()
+
+      units_text = product.select_one(".product-item__inventory").text
+      if re_match := re.search(r"\d+", units_text):
+          units = int(re_match.group())
+      else:
+          units = 0
+
+      print(title, units)
+  ```
+
+</details>
+
+### Scrape publish dates of F1 news
+
+Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL:
+
+```text
+https://www.theguardian.com/sport/formulaone
+```
+
+Your program should print something like the following. Note the dates at the end of each line:
+
+```text
+Wolff confident Mercedes are heading to front of grid after Canada improvement 2024-06-10
+Frustrated Lando Norris blames McLaren team for missed chance 2024-06-09
+Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
+...
+```
+
+Hints:
+
+- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
+- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
+- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
+- To get just the date part, you can call `.date()` on any `datetime` object.
+
+<details>
+  <summary>Solution</summary>
+
+  ```py
+  import httpx
+  from bs4 import BeautifulSoup
+
+  url = "https://www.theguardian.com/sport/formulaone"
+  response = httpx.get(url)
+  response.raise_for_status()
+
+  html_code = response.text
+  soup = BeautifulSoup(html_code, "html.parser")
+
+  for article in soup.select("#maincontent ul li"):
+      title = article.select_one("h3").text.strip()
+
+      time_iso = article.select_one("time")["datetime"].strip()
+      published_at = datetime.fromisoformat(time_iso)
+      published_on = published_at.date()
+
+      print(title, published_on)
+  ```
+
+</details>