feat: cut on explanations

honzajavorek · honzajavorek · commit 1bd3d314b3c5 · 2024-09-10T09:32:58.000+02:00
diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
@@ -94,30 +94,23 @@ HTTP is a network protocol powering the internet. Understanding it well is an im
 
 ## Handling errors
 
-Sometimes websites return all kinds of errors, most often because:
-
-- The server is temporarily down.
-- The server breaks under a heavy load of requests.
-- The server applies anti-scraping protections.
-- The server application is buggy and just couldn't handle our request.
-
-In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or a success. A robust scraper skips or retries requests on errors. It's a big task though, and it's best to use libraries or frameworks for that.
+Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit _status code_ that indicates whether it is an error or a success.
 
 :::tip All status codes
 
 If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
 
 :::
 
-For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
+A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
 
 First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
 
 ```text
 https://warehouse-theme-metal.myshopify.com/does/not/exist
 ```
 
-We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
+We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
 
 ```py
 import httpx
diff --git a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
@@ -20,9 +20,9 @@ As a first step, let's try counting how many products are on the listing page.
 
 ## Processing HTML
 
-After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. But if it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
+After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
 
-While somewhat possible, such approach is tedious, fragile, and unreliable. To work with HTML we need a robust tool dedicated for the task. An _HTML parser_ takes a text with HTML markup and turns it into a tree of Python objects.
+While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
 
 :::info Why regex can't parse HTML
 
diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -94,18 +94,14 @@ JBL Flip 4 Waterproof Portable Bluetooth Speaker
 Sale price$74.95
 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
 Sale priceFrom $1,398.00
-Sony SACS9 10" Active Subwoofer
-Sale price$158.00
-Sony PS-HX500 Hi-Res USB Turntable
-Sale price$398.00
 ...
 ```
 
 There's still some room for improvement, but it's already much better!
 
 ## Locating a single element
 
-Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
+Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code!
 
 ```py
 import httpx
@@ -174,8 +170,23 @@ If we run the scraper now, it should print prices as only amounts:
 $ python main.py
 JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
-Sony SACS9 10" Active Subwoofer $158.00
-Sony PS-HX500 Hi-Res USB Turntable $398.00
+...
+```
+
+## Formatting output
+
+The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function:
+
+```py
+print(title, price, sep=" | ")
+```
+
+The output is much nicer this way:
+
+```text
+$ python main.py
+JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
 ...
 ```
 
diff --git a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
@@ -16,10 +16,8 @@ Locating the right HTML elements is the first step of a successful data extracti
 
 ```text
 $ python main.py
-JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
-Sony SACS9 10" Active Subwoofer $158.00
-Sony PS-HX500 Hi-Res USB Turntable $398.00
+JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
 ...
 ```
 
@@ -35,7 +33,7 @@ The last bullet point is the most important to figure out before we start coding
 
 It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing.
 
-Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix!
+Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
 
 ```py
 price_text = product.select_one(".price").contents[-1]
@@ -54,12 +52,13 @@ else:
     price = min_price
 ```
 
-We're using Python's built-in string methods:
+:::tip Built-in string methods
 
-- `.startswith()`, a method for [checking the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.startswith).
-- `.removeprefix()`, a method for [removing something from the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.removeprefix).
+If you're not proficient in Python's string methods, [.startswith()](https://docs.python.org/3/library/stdtypes.html#str.startswith) checks the beginning of a given string, and [.removeprefix()](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) removes something from the beginning of a given string.
 
-The whole program now looks like this:
+:::
+
+The whole program would look like this:
 
 ```py
 import httpx
@@ -83,31 +82,33 @@ for product in soup.select(".product-item"):
         min_price = price_text
         price = min_price
 
-    print(title, min_price, price)
+    print(title, min_price, price, sep=" | ")
 ```
 
 ## Removing white space
 
 Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags.
 
-We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it.
-
-In Python, we have `.strip()`, a built-in string method for [removing whitespace from both the beginning and the end](https://docs.python.org/3/library/stdtypes.html#str.strip). Let's add it to our code:
+We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip):
 
 ```py
 title = product.select_one(".product-item__title").text.strip()
 
 price_text = product.select_one(".price").contents[-1].strip()
 ```
 
-While we're at it, let's see what Beautiful Soup offers when it comes to working with strings:
+:::info Handling strings in Beautiful Soup
+
+Beautiful Soup offers several attributes when it comes to working with strings:
 
 - `.string`, which often is like `.text`,
 - `.strings`, which [returns a list of all nested textual nodes](https://beautiful-soup-4.readthedocs.io/en/latest/#strings-and-stripped-strings),
 - `.stripped_strings`, which does the same but with whitespace removed.
 
 These might be useful in some complex scenarios, but in our case, they won't make scraping the title or price any shorter or more elegant.
 
+:::
+
 ## Removing dollar sign and commas
 
 We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our Python program:
@@ -126,7 +127,7 @@ The demonstration above is inside the Python's [interactive REPL](https://realpy
 
 :::
 
-We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but our case is so simple that we can just throw in `.replace()`, Python's built-in string method for [replacing substrings](https://docs.python.org/3/library/stdtypes.html#str.replace):
+We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but in this case [`.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) is also sufficient:
 
 ```py
 price_text = (
@@ -159,7 +160,7 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
 0.30000000000000004
 ```
 
-These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in `Decimal()`, a [type designed to represent decimal numbers exactly](https://docs.python.org/3/library/decimal.html):
+These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type:
 
 ```py
 from decimal import Decimal
@@ -191,17 +192,15 @@ for product in soup.select(".product-item"):
         min_price = Decimal(price_text)
         price = min_price
 
-    print(title, min_price, price)
+    print(title, min_price, price, sep=" | ")
 ```
 
 If we run the code above, we have nice, clean data about all the products!
 
 ```text
 $ python main.py
-JBL Flip 4 Waterproof Portable Bluetooth Speaker 74.95 74.95
-Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 1398.00 None
-Sony SACS9 10" Active Subwoofer 158.00 158.00
-Sony PS-HX500 Hi-Res USB Turntable 398.00 398.00
+JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
+Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
 ...
 ```