Skip to content

Commit f5f14e0

Browse files
authored
feat: scrape prices as cents, avoid Decimal (#1861)
This makes the code simpler and the lessons are closer to their JavaScript counterparts. Also the real world practice uses cents, e.g. the Stripe API and others.
1 parent 271e15a commit f5f14e0

File tree

7 files changed

+90
-125
lines changed

7 files changed

+90
-125
lines changed

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -159,12 +159,26 @@ Great! Only if we didn't overlook an important pitfall called [floating-point er
159159
0.30000000000000004
160160
```
161161

162-
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type:
162+
These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. We won't store dollars, but cents:
163+
164+
```py
165+
price_text = (
166+
product
167+
.select_one(".price")
168+
.contents[-1]
169+
.strip()
170+
.replace("$", "")
171+
# highlight-next-line
172+
.replace(".", "")
173+
.replace(",", "")
174+
)
175+
```
176+
177+
In this case, removing the dot from the price text is the same as if we multiplied all the numbers with 100, effectively converting dollars to cents. For converting the text to a number we'll use `int()` instead of `float()`. This is how the whole program looks like now:
163178

164179
```py
165180
import httpx
166181
from bs4 import BeautifulSoup
167-
from decimal import Decimal
168182

169183
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
170184
response = httpx.get(url)
@@ -182,13 +196,14 @@ for product in soup.select(".product-item"):
182196
.contents[-1]
183197
.strip()
184198
.replace("$", "")
199+
.replace(".", "")
185200
.replace(",", "")
186201
)
187202
if price_text.startswith("From "):
188-
min_price = Decimal(price_text.removeprefix("From "))
203+
min_price = int(price_text.removeprefix("From "))
189204
price = None
190205
else:
191-
min_price = Decimal(price_text)
206+
min_price = int(price_text)
192207
price = min_price
193208

194209
print(title, min_price, price, sep=" | ")
@@ -198,8 +213,8 @@ If we run the code above, we have nice, clean data about all the products!
198213

199214
```text
200215
$ python main.py
201-
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
202-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
216+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 7495 | 7495
217+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 139800 | None
203218
...
204219
```
205220

sources/academy/webscraping/scraping_basics_python/08_saving_data.md

Lines changed: 9 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ Producing results line by line is an efficient approach to handling large datase
2929
```py
3030
import httpx
3131
from bs4 import BeautifulSoup
32-
from decimal import Decimal
3332

3433
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
3534
response = httpx.get(url)
@@ -49,13 +48,14 @@ for product in soup.select(".product-item"):
4948
.contents[-1]
5049
.strip()
5150
.replace("$", "")
51+
.replace(".", "")
5252
.replace(",", "")
5353
)
5454
if price_text.startswith("From "):
55-
min_price = Decimal(price_text.removeprefix("From "))
55+
min_price = int(price_text.removeprefix("From "))
5656
price = None
5757
else:
58-
min_price = Decimal(price_text)
58+
min_price = int(price_text)
5959
price = min_price
6060

6161
# highlight-next-line
@@ -69,7 +69,7 @@ Before looping over the products, we prepare an empty list. Then, instead of pri
6969

7070
```text
7171
$ python main.py
72-
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...]
72+
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': 7495, 'price': 7495}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': 139800, 'price': None}, ...]
7373
```
7474

7575
:::tip Pretty print
@@ -87,7 +87,6 @@ In Python, we can read and write JSON using the [`json`](https://docs.python.org
8787
```py
8888
import httpx
8989
from bs4 import BeautifulSoup
90-
from decimal import Decimal
9190
# highlight-next-line
9291
import json
9392
```
@@ -99,39 +98,17 @@ with open("products.json", "w") as file:
9998
json.dump(data, file)
10099
```
101100

102-
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:
103-
104-
```text
105-
$ python main.py
106-
Traceback (most recent call last):
107-
...
108-
raise TypeError(f'Object of type {o.__class__.__name__} '
109-
TypeError: Object of type Decimal is not JSON serializable
110-
```
111-
112-
Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:
113-
114-
```py
115-
def serialize(obj):
116-
if isinstance(obj, Decimal):
117-
return str(obj)
118-
raise TypeError("Object not JSON serializable")
119-
120-
with open("products.json", "w") as file:
121-
json.dump(data, file, default=serialize)
122-
```
123-
124-
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
101+
That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
125102

126103
<!-- eslint-skip -->
127104
```json title=products.json
128-
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
105+
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "7495", "price": "7495"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "139800", "price": null}, ...]
129106
```
130107

131108
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
132109

133110
```json
134-
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
111+
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "15800", "price": "15800"}
135112
```
136113

137114
:::tip Pretty JSON
@@ -177,7 +154,6 @@ Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we
177154
```py
178155
import httpx
179156
from bs4 import BeautifulSoup
180-
from decimal import Decimal
181157
import json
182158
# highlight-next-line
183159
import csv
@@ -186,13 +162,8 @@ import csv
186162
Next, let's add one more data export to end of the source code of our scraper:
187163

188164
```py
189-
def serialize(obj):
190-
if isinstance(obj, Decimal):
191-
return str(obj)
192-
raise TypeError("Object not JSON serializable")
193-
194165
with open("products.json", "w") as file:
195-
json.dump(data, file, default=serialize)
166+
json.dump(data, file)
196167

197168
with open("products.csv", "w") as file:
198169
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
@@ -223,13 +194,12 @@ Write a new Python program that reads `products.json`, finds all products with a
223194
```py
224195
import json
225196
from pprint import pp
226-
from decimal import Decimal
227197

228198
with open("products.json", "r") as file:
229199
products = json.load(file)
230200

231201
for product in products:
232-
if Decimal(product["min_price"]) > 500:
202+
if int(product["min_price"]) > 500:
233203
pp(product)
234204
```
235205

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 18 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@ Over the course of the previous lessons, the code of our program grew to almost
3333
```py
3434
import httpx
3535
from bs4 import BeautifulSoup
36-
from decimal import Decimal
3736
import json
3837
import csv
3938

@@ -54,24 +53,20 @@ for product in soup.select(".product-item"):
5453
.contents[-1]
5554
.strip()
5655
.replace("$", "")
56+
.replace(".", "")
5757
.replace(",", "")
5858
)
5959
if price_text.startswith("From "):
60-
min_price = Decimal(price_text.removeprefix("From "))
60+
min_price = int(price_text.removeprefix("From "))
6161
price = None
6262
else:
63-
min_price = Decimal(price_text)
63+
min_price = int(price_text)
6464
price = min_price
6565

6666
data.append({"title": title, "min_price": min_price, "price": price})
6767

68-
def serialize(obj):
69-
if isinstance(obj, Decimal):
70-
return str(obj)
71-
raise TypeError("Object not JSON serializable")
72-
7368
with open("products.json", "w") as file:
74-
json.dump(data, file, default=serialize)
69+
json.dump(data, file)
7570

7671
with open("products.csv", "w") as file:
7772
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
@@ -103,13 +98,14 @@ def parse_product(product):
10398
.contents[-1]
10499
.strip()
105100
.replace("$", "")
101+
.replace(".", "")
106102
.replace(",", "")
107103
)
108104
if price_text.startswith("From "):
109-
min_price = Decimal(price_text.removeprefix("From "))
105+
min_price = int(price_text.removeprefix("From "))
110106
price = None
111107
else:
112-
min_price = Decimal(price_text)
108+
min_price = int(price_text)
113109
price = min_price
114110

115111
return {"title": title, "min_price": min_price, "price": price}
@@ -119,13 +115,8 @@ Now the JSON export. For better readability of it, let's make a small change her
119115

120116
```py
121117
def export_json(file, data):
122-
def serialize(obj):
123-
if isinstance(obj, Decimal):
124-
return str(obj)
125-
raise TypeError("Object not JSON serializable")
126-
127118
# highlight-next-line
128-
json.dump(data, file, default=serialize, indent=2)
119+
json.dump(data, file, indent=2)
129120
```
130121

131122
The last function we'll add will take care of the CSV export. We'll make a small change here as well. Having to specify the field names is not ideal. What if we add more field names in the parsing function? We'd always have to remember to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from the dictionary keys of the first row:
@@ -151,7 +142,6 @@ Now let's put it all together:
151142
```py
152143
import httpx
153144
from bs4 import BeautifulSoup
154-
from decimal import Decimal
155145
import json
156146
import csv
157147

@@ -171,24 +161,20 @@ def parse_product(product):
171161
.contents[-1]
172162
.strip()
173163
.replace("$", "")
164+
.replace(".", "")
174165
.replace(",", "")
175166
)
176167
if price_text.startswith("From "):
177-
min_price = Decimal(price_text.removeprefix("From "))
168+
min_price = int(price_text.removeprefix("From "))
178169
price = None
179170
else:
180-
min_price = Decimal(price_text)
171+
min_price = int(price_text)
181172
price = min_price
182173

183174
return {"title": title, "min_price": min_price, "price": price}
184175

185176
def export_json(file, data):
186-
def serialize(obj):
187-
if isinstance(obj, Decimal):
188-
return str(obj)
189-
raise TypeError("Object not JSON serializable")
190-
191-
json.dump(data, file, default=serialize, indent=2)
177+
json.dump(data, file, indent=2)
192178

193179
def export_csv(file, data):
194180
fieldnames = list(data[0].keys())
@@ -254,13 +240,13 @@ In the previous code example, we've also added the URL to the dictionary returne
254240
[
255241
{
256242
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
257-
"min_price": "74.95",
258-
"price": "74.95",
243+
"min_price": "7495",
244+
"price": "7495",
259245
"url": "/products/jbl-flip-4-waterproof-portable-bluetooth-speaker"
260246
},
261247
{
262248
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
263-
"min_price": "1398.00",
249+
"min_price": "139800",
264250
"price": null,
265251
"url": "/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv"
266252
},
@@ -277,7 +263,6 @@ Browsers reading the HTML know the base address and automatically resolve such l
277263
```py
278264
import httpx
279265
from bs4 import BeautifulSoup
280-
from decimal import Decimal
281266
import json
282267
import csv
283268
# highlight-next-line
@@ -319,13 +304,13 @@ When we run the scraper now, we should see full URLs in our exports:
319304
[
320305
{
321306
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
322-
"min_price": "74.95",
323-
"price": "74.95",
307+
"min_price": "7495",
308+
"price": "7495",
324309
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker"
325310
},
326311
{
327312
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
328-
"min_price": "1398.00",
313+
"min_price": "139800",
329314
"price": null,
330315
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv"
331316
},

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ Thanks to the refactoring, we have functions ready for each of the tasks, so we
1818
```py
1919
import httpx
2020
from bs4 import BeautifulSoup
21-
from decimal import Decimal
2221
import json
2322
import csv
2423
from urllib.parse import urljoin
@@ -41,24 +40,20 @@ def parse_product(product, base_url):
4140
.contents[-1]
4241
.strip()
4342
.replace("$", "")
43+
.replace(".", "")
4444
.replace(",", "")
4545
)
4646
if price_text.startswith("From "):
47-
min_price = Decimal(price_text.removeprefix("From "))
47+
min_price = int(price_text.removeprefix("From "))
4848
price = None
4949
else:
50-
min_price = Decimal(price_text)
50+
min_price = int(price_text)
5151
price = min_price
5252

5353
return {"title": title, "min_price": min_price, "price": price, "url": url}
5454

5555
def export_json(file, data):
56-
def serialize(obj):
57-
if isinstance(obj, Decimal):
58-
return str(obj)
59-
raise TypeError("Object not JSON serializable")
60-
61-
json.dump(data, file, default=serialize, indent=2)
56+
json.dump(data, file, indent=2)
6257

6358
def export_csv(file, data):
6459
fieldnames = list(data[0].keys())
@@ -159,14 +154,14 @@ If we run the program now, it'll take longer to finish since it's making 24 more
159154
[
160155
{
161156
"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker",
162-
"min_price": "74.95",
163-
"price": "74.95",
157+
"min_price": "7495",
158+
"price": "7495",
164159
"url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker",
165160
"vendor": "JBL"
166161
},
167162
{
168163
"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",
169-
"min_price": "1398.00",
164+
"min_price": "139800",
170165
"price": null,
171166
"url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv",
172167
"vendor": "Sony"

0 commit comments

Comments
 (0)