Skip to content

Commit 1bd3d31

Browse files
committed
feat: cut on explanations
1 parent 4760e4f commit 1bd3d31

File tree

4 files changed

+43
-40
lines changed

4 files changed

+43
-40
lines changed

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -94,30 +94,23 @@ HTTP is a network protocol powering the internet. Understanding it well is an im
9494

9595
## Handling errors
9696

97-
Sometimes websites return all kinds of errors, most often because:
98-
99-
- The server is temporarily down.
100-
- The server breaks under a heavy load of requests.
101-
- The server applies anti-scraping protections.
102-
- The server application is buggy and just couldn't handle our request.
103-
104-
In HTTP, each response has a three-digit _status code_, which tells us whether it's an error or a success. A robust scraper skips or retries requests on errors. It's a big task though, and it's best to use libraries or frameworks for that.
97+
Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit _status code_ that indicates whether it is an error or a success.
10598

10699
:::tip All status codes
107100

108101
If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource.
109102

110103
:::
111104

112-
For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
105+
A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error.
113106

114107
First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available:
115108

116109
```text
117110
https://warehouse-theme-metal.myshopify.com/does/not/exist
118111
```
119112

120-
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX also provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
113+
We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful:
121114

122115
```py
123116
import httpx

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ As a first step, let's try counting how many products are on the listing page.
2020

2121
## Processing HTML
2222

23-
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. But if it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
23+
After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products?
2424

25-
While somewhat possible, such approach is tedious, fragile, and unreliable. To work with HTML we need a robust tool dedicated for the task. An _HTML parser_ takes a text with HTML markup and turns it into a tree of Python objects.
25+
While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects.
2626

2727
:::info Why regex can't parse HTML
2828

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -94,18 +94,14 @@ JBL Flip 4 Waterproof Portable Bluetooth Speaker
9494
Sale price$74.95
9595
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV
9696
Sale priceFrom $1,398.00
97-
Sony SACS9 10" Active Subwoofer
98-
Sale price$158.00
99-
Sony PS-HX500 Hi-Res USB Turntable
100-
Sale price$398.00
10197
...
10298
```
10399

104100
There's still some room for improvement, but it's already much better!
105101

106102
## Locating a single element
107103

108-
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers a `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or none. Let's simplify our code!
104+
Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code!
109105

110106
```py
111107
import httpx
@@ -174,8 +170,23 @@ If we run the scraper now, it should print prices as only amounts:
174170
$ python main.py
175171
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
176172
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
177-
Sony SACS9 10" Active Subwoofer $158.00
178-
Sony PS-HX500 Hi-Res USB Turntable $398.00
173+
...
174+
```
175+
176+
## Formatting output
177+
178+
The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function:
179+
180+
```py
181+
print(title, price, sep=" | ")
182+
```
183+
184+
The output is much nicer this way:
185+
186+
```text
187+
$ python main.py
188+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
189+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
179190
...
180191
```
181192

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,8 @@ Locating the right HTML elements is the first step of a successful data extracti
1616

1717
```text
1818
$ python main.py
19-
JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95
20-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00
21-
Sony SACS9 10" Active Subwoofer $158.00
22-
Sony PS-HX500 Hi-Res USB Turntable $398.00
19+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95
20+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00
2321
...
2422
```
2523

@@ -35,7 +33,7 @@ The last bullet point is the most important to figure out before we start coding
3533

3634
It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing.
3735

38-
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix!
36+
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
3937

4038
```py
4139
price_text = product.select_one(".price").contents[-1]
@@ -54,12 +52,13 @@ else:
5452
price = min_price
5553
```
5654

57-
We're using Python's built-in string methods:
55+
:::tip Built-in string methods
5856

59-
- `.startswith()`, a method for [checking the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.startswith).
60-
- `.removeprefix()`, a method for [removing something from the beginning of a string](https://docs.python.org/3/library/stdtypes.html#str.removeprefix).
57+
If you're not proficient in Python's string methods, [.startswith()](https://docs.python.org/3/library/stdtypes.html#str.startswith) checks the beginning of a given string, and [.removeprefix()](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) removes something from the beginning of a given string.
6158

62-
The whole program now looks like this:
59+
:::
60+
61+
The whole program would look like this:
6362

6463
```py
6564
import httpx
@@ -83,31 +82,33 @@ for product in soup.select(".product-item"):
8382
min_price = price_text
8483
price = min_price
8584

86-
print(title, min_price, price)
85+
print(title, min_price, price, sep=" | ")
8786
```
8887

8988
## Removing white space
9089

9190
Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags.
9291

93-
We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it.
94-
95-
In Python, we have `.strip()`, a built-in string method for [removing whitespace from both the beginning and the end](https://docs.python.org/3/library/stdtypes.html#str.strip). Let's add it to our code:
92+
We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip):
9693

9794
```py
9895
title = product.select_one(".product-item__title").text.strip()
9996

10097
price_text = product.select_one(".price").contents[-1].strip()
10198
```
10299

103-
While we're at it, let's see what Beautiful Soup offers when it comes to working with strings:
100+
:::info Handling strings in Beautiful Soup
101+
102+
Beautiful Soup offers several attributes when it comes to working with strings:
104103

105104
- `.string`, which often is like `.text`,
106105
- `.strings`, which [returns a list of all nested textual nodes](https://beautiful-soup-4.readthedocs.io/en/latest/#strings-and-stripped-strings),
107106
- `.stripped_strings`, which does the same but with whitespace removed.
108107

109108
These might be useful in some complex scenarios, but in our case, they won't make scraping the title or price any shorter or more elegant.
110109

110+
:::
111+
111112
## Removing dollar sign and commas
112113

113114
We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our Python program:
@@ -126,7 +127,7 @@ The demonstration above is inside the Python's [interactive REPL](https://realpy
126127

127128
:::
128129

129-
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but our case is so simple that we can just throw in `.replace()`, Python's built-in string method for [replacing substrings](https://docs.python.org/3/library/stdtypes.html#str.replace):
130+
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but in this case [`.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) is also sufficient:
130131

131132
```py
132133
price_text = (
@@ -159,7 +160,7 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
159160
0.30000000000000004
160161
```
161162

162-
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in `Decimal()`, a [type designed to represent decimal numbers exactly](https://docs.python.org/3/library/decimal.html):
163+
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type:
163164

164165
```py
165166
from decimal import Decimal
@@ -191,17 +192,15 @@ for product in soup.select(".product-item"):
191192
min_price = Decimal(price_text)
192193
price = min_price
193194

194-
print(title, min_price, price)
195+
print(title, min_price, price, sep=" | ")
195196
```
196197

197198
If we run the code above, we have nice, clean data about all the products!
198199

199200
```text
200201
$ python main.py
201-
JBL Flip 4 Waterproof Portable Bluetooth Speaker 74.95 74.95
202-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 1398.00 None
203-
Sony SACS9 10" Active Subwoofer 158.00 158.00
204-
Sony PS-HX500 Hi-Res USB Turntable 398.00 398.00
202+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
203+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
205204
...
206205
```
207206

0 commit comments

Comments
 (0)