You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+15-17Lines changed: 15 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,31 +27,29 @@ Let's summarize what stands in our way if we want to have it in our Python progr
27
27
- the number contains decimal commas for better human readability,
28
28
- and some prices start with `From`, which reveals there is a certain complexity in how the shop deals with prices.
29
29
30
-
## Representing price as an interval
30
+
## Representing price
31
31
32
-
The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually an [interval](https://en.wikipedia.org/wiki/Interval_(mathematics)).
32
+
The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually a range.
33
33
34
-
In such a situation, we'd normally go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the interval in the data for them?
34
+
It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing.
35
35
36
-
Maybe they'd tell us that we can just remove the `From` prefix and it's fine!
36
+
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix!
In other cases, they'd tell us the data must include this information. And in cases when we just don't know, the safest option is to include in the data everything we have from the input and leave the decision on what's important to later stages.
44
-
45
-
We can represent interval in various ways. One approach could be having min price and max price. For regular prices, these two numbers would be the same. For prices starting with `From`, the max price would be none, because we don't know it:
43
+
In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty:
Great! Only if we didn't overlook an important pitfall called [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). In short, computers save `float()` numbers in a way which isn't always reliable:
@@ -186,12 +184,12 @@ for product in soup.select(".product-item"):
0 commit comments