Skip to content

Commit 7903a0a

Browse files
committed
fix: improve section about min/max/exact price
Addressing #1023 (comment)
1 parent 5bd2e90 commit 7903a0a

File tree

1 file changed

+15
-17
lines changed

1 file changed

+15
-17
lines changed

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -27,31 +27,29 @@ Let's summarize what stands in our way if we want to have it in our Python progr
2727
- the number contains decimal commas for better human readability,
2828
- and some prices start with `From`, which reveals there is a certain complexity in how the shop deals with prices.
2929

30-
## Representing price as an interval
30+
## Representing price
3131

32-
The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually an [interval](https://en.wikipedia.org/wiki/Interval_(mathematics)).
32+
The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually a range.
3333

34-
In such a situation, we'd normally go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the interval in the data for them?
34+
It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing.
3535

36-
Maybe they'd tell us that we can just remove the `From` prefix and it's fine!
36+
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix!
3737

3838
```py
3939
price_text = product.select_one(".price").contents[-1]
4040
price = price_text.removeprefix("From ")
4141
```
4242

43-
In other cases, they'd tell us the data must include this information. And in cases when we just don't know, the safest option is to include in the data everything we have from the input and leave the decision on what's important to later stages.
44-
45-
We can represent interval in various ways. One approach could be having min price and max price. For regular prices, these two numbers would be the same. For prices starting with `From`, the max price would be none, because we don't know it:
43+
In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty:
4644

4745
```py
4846
price_text = product.select_one(".price").contents[-1]
4947
if price_text.startswith("From "):
5048
min_price = price_text.removeprefix("From ")
51-
max_price = None
49+
price = None
5250
else:
5351
min_price = price_text
54-
max_price = min_price
52+
price = min_price
5553
```
5654

5755
We're using Python's built-in string methods:
@@ -78,12 +76,12 @@ for product in soup.select(".product-item"):
7876
price_text = product.select_one(".price").contents[-1]
7977
if price_text.startswith("From "):
8078
min_price = price_text.removeprefix("From ")
81-
max_price = None
79+
price = None
8280
else:
8381
min_price = price_text
84-
max_price = min_price
82+
price = min_price
8583

86-
print(title, min_price, max_price)
84+
print(title, min_price, price)
8785
```
8886

8987
## Removing white space
@@ -146,10 +144,10 @@ Now we should be able to add `float()`, so that we have the prices not as a text
146144
```py
147145
if price_text.startswith("From "):
148146
min_price = float(price_text.removeprefix("From "))
149-
max_price = None
147+
price = None
150148
else:
151149
min_price = float(price_text)
152-
max_price = min_price
150+
price = min_price
153151
```
154152

155153
Great! Only if we didn't overlook an important pitfall called [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). In short, computers save `float()` numbers in a way which isn't always reliable:
@@ -186,12 +184,12 @@ for product in soup.select(".product-item"):
186184
)
187185
if price_text.startswith("From "):
188186
min_price = Decimal(price_text.removeprefix("From "))
189-
max_price = None
187+
price = None
190188
else:
191189
min_price = Decimal(price_text)
192-
max_price = min_price
190+
price = min_price
193191

194-
print(title, min_price, max_price)
192+
print(title, min_price, price)
195193
```
196194

197195
If we run the code above, we have nice, clean data about all the products!

0 commit comments

Comments
 (0)