Skip to content

Commit 0e342e8

Browse files
honzajavorekdaveomri
authored andcommitted
feat: update the saving lesson of the JS2 course to be about JavaScript (apify#1762)
Part of apify#1584
1 parent eae40fc commit 0e342e8

File tree

1 file changed

+142
-151
lines changed

1 file changed

+142
-151
lines changed

sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md

Lines changed: 142 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ unlisted: true
1313
We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically.
1414

1515
```text
16-
$ python main.py
17-
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
18-
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
16+
$ node index.js
17+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 7495 | 7495
18+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 139800 | null
1919
...
2020
```
2121

@@ -27,220 +27,211 @@ We should use widely popular formats that have well-defined solutions for all th
2727

2828
Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program:
2929

30-
```py
31-
import httpx
32-
from bs4 import BeautifulSoup
33-
from decimal import Decimal
34-
35-
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
36-
response = httpx.get(url)
37-
response.raise_for_status()
38-
39-
html_code = response.text
40-
soup = BeautifulSoup(html_code, "html.parser")
41-
42-
# highlight-next-line
43-
data = []
44-
for product in soup.select(".product-item"):
45-
title = product.select_one(".product-item__title").text.strip()
46-
47-
price_text = (
48-
product
49-
.select_one(".price")
50-
.contents[-1]
51-
.strip()
52-
.replace("$", "")
53-
.replace(",", "")
54-
)
55-
if price_text.startswith("From "):
56-
min_price = Decimal(price_text.removeprefix("From "))
57-
price = None
58-
else:
59-
min_price = Decimal(price_text)
60-
price = min_price
61-
62-
# highlight-next-line
63-
data.append({"title": title, "min_price": min_price, "price": price})
64-
65-
# highlight-next-line
66-
print(data)
30+
```js
31+
import * as cheerio from 'cheerio';
32+
33+
const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
34+
const response = await fetch(url);
35+
36+
if (response.ok) {
37+
const html = await response.text();
38+
const $ = cheerio.load(html);
39+
40+
// highlight-next-line
41+
const data = [];
42+
$(".product-item").each((i, element) => {
43+
const productItem = $(element);
44+
45+
const title = productItem.find(".product-item__title");
46+
const titleText = title.text().trim();
47+
48+
const price = productItem.find(".price").contents().last();
49+
const priceRange = { minPrice: null, price: null };
50+
const priceText = price
51+
.text()
52+
.trim()
53+
.replace("$", "")
54+
.replace(".", "")
55+
.replace(",", "");
56+
57+
if (priceText.startsWith("From ")) {
58+
priceRange.minPrice = parseInt(priceText.replace("From ", ""));
59+
} else {
60+
priceRange.minPrice = parseInt(priceText);
61+
priceRange.price = priceRange.minPrice;
62+
}
63+
64+
// highlight-next-line
65+
data.push({ title: titleText, ...priceRange });
66+
});
67+
68+
// highlight-next-line
69+
console.log(data);
70+
} else {
71+
throw new Error(`HTTP ${response.status}`);
72+
}
6773
```
6874

69-
Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once.
75+
Before looping over the products, we prepare an empty array. Then, instead of printing each line, we append the data of each product to the array in the form of a JavaScript object. At the end of the program, we print the entire array at once.
7076

7177
```text
72-
$ python main.py
73-
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...]
78+
$ node index.js
79+
[
80+
{
81+
title: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker',
82+
minPrice: 7495,
83+
price: 7495
84+
},
85+
{
86+
title: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV',
87+
minPrice: 139800,
88+
price: null
89+
},
90+
...
91+
]
7492
```
7593

76-
:::tip Pretty print
94+
:::tip Spread syntax
95+
96+
The three dots in `{ title: titleText, ...priceRange }` are called [spread syntax](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax). It's the same as if we wrote the following:
7797

78-
If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead.
98+
```js
99+
{
100+
title: titleText,
101+
minPrice: priceRange.minPrice,
102+
price: priceRange.price,
103+
}
104+
```
79105

80106
:::
81107

82-
## Saving data as CSV
108+
## Saving data as JSON
83109

84-
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
110+
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of JavaScript objects, but people now use it accross programming languages.
85111

86-
In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
112+
We'll begin with importing the `writeFile` function from the Node.js standard library, so that we can, well, write files:
87113

88-
```py
89-
>>> import csv
90-
>>> with open("data.csv", "w") as file:
91-
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
92-
... writer.writeheader()
93-
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
94-
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
95-
...
114+
```js
115+
import * as cheerio from 'cheerio';
116+
// highlight-next-line
117+
import { writeFile } from "fs/promises";
96118
```
97119

98-
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
120+
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `console.log(data)` with the following:
99121

100-
```csv title=data.csv
101-
name,age,hobbies
102-
Alice,24,"kickbox, Python"
103-
Bob,42,"reading, TypeScript"
122+
```js
123+
const jsonData = JSON.stringify(data);
124+
await writeFile('products.json', jsonData);
104125
```
105126

106-
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
107-
108-
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
109-
110-
![CSV example preview](images/csv-example.png)
111-
112-
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
127+
That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
113128

114-
```py
115-
import httpx
116-
from bs4 import BeautifulSoup
117-
from decimal import Decimal
118-
# highlight-next-line
119-
import csv
129+
<!-- eslint-skip -->
130+
```json title=products.json
131+
[{"title":"JBL Flip 4 Waterproof Portable Bluetooth Speaker","minPrice":7495,"price":7495},{"title":"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV","minPrice":139800,"price":null},...]
120132
```
121133

122-
Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
134+
If you skim through the data, you'll notice that the `JSON.stringify()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
123135

124-
```py
125-
with open("products.csv", "w") as file:
126-
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
127-
writer.writeheader()
128-
for row in data:
129-
writer.writerow(row)
136+
```json
137+
{"title":"Sony SACS9 10\" Active Subwoofer","minPrice":15800,"price":15800}
130138
```
131139

132-
If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
140+
:::tip Pretty JSON
133141

134-
![CSV preview](images/csv.png)
142+
While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can call `JSON.stringify(data, null, 2)` for prettier output. See [documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify) for explanation of the parameters and more examples.
135143

136-
## Saving data as JSON
144+
:::
137145

138-
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
146+
## Saving data as CSV
139147

140-
In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
148+
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
141149

142-
```py
143-
import httpx
144-
from bs4 import BeautifulSoup
145-
from decimal import Decimal
146-
import csv
147-
# highlight-next-line
148-
import json
149-
```
150+
Neither JavaScript itself nor Node.js offers anything built-in to read and write CSV, so we'll need to install a library. We'll use [json2csv](https://juanjodiaz.github.io/json2csv/), a _de facto_ standard for working with CSV in JavaScript:
150151

151-
Next, let’s append one more export to end of the source code of our scraper:
152+
```text
153+
$ npm install @json2csv/node --save
152154
153-
```py
154-
with open("products.json", "w") as file:
155-
json.dump(data, file)
155+
added 4 packages, and audited 28 packages in 1s
156+
...
156157
```
157158

158-
That’s it! If we run the program now, it should also create a `products.json` file in the current working directory:
159+
Once installed, we can add the following line to our imports:
159160

160-
```text
161-
$ python main.py
162-
Traceback (most recent call last):
163-
...
164-
raise TypeError(f'Object of type {o.__class__.__name__} '
165-
TypeError: Object of type Decimal is not JSON serializable
161+
```js
162+
import * as cheerio from 'cheerio';
163+
import { writeFile } from "fs/promises";
164+
// highlight-next-line
165+
import { AsyncParser } from '@json2csv/node';
166166
```
167167

168-
Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:
168+
Then, let's add one more data export near the end of the source code of our scraper:
169169

170-
```py
171-
def serialize(obj):
172-
if isinstance(obj, Decimal):
173-
return str(obj)
174-
raise TypeError("Object not JSON serializable")
170+
```js
171+
const jsonData = JSON.stringify(data);
172+
await writeFile('products.json', jsonData);
175173

176-
with open("products.json", "w") as file:
177-
json.dump(data, file, default=serialize)
174+
const parser = new AsyncParser();
175+
const csvData = await parser.parse(data).promise();
176+
await writeFile("products.csv", csvData);
178177
```
179178

180-
Now the program should work as expected, producing a JSON file with the following content:
179+
The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
181180

182-
<!-- eslint-skip -->
183-
```json title=products.json
184-
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
185-
```
181+
![CSV preview](images/csv.png)
186182

187-
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
183+
In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this:
188184

189-
```json
190-
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
185+
```csv title=data.csv
186+
"title","minPrice","price"
187+
"JBL Flip 4 Waterproof Portable Bluetooth Speaker",7495,7495
188+
"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV",139800,
189+
"Sony SACS9 10"" Active Subwoofer",15800,15800
190+
...
191+
"Samsung Surround Sound Bar Home Speaker, Set of 7 (HW-NW700/ZA)",64799,64799
192+
...
191193
```
192194

193-
:::tip Pretty JSON
194-
195-
While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output.
196-
197-
Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`.
198-
199-
:::
200-
201-
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
195+
We've built a Node.js application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
202196

203197
---
204198

205199
## Exercises
206200

207-
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
201+
In this lesson, we created export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
208202

209-
### Process your CSV
203+
### Process your JSON
210204

211-
Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
205+
Write a new Node.js program that reads `products.json`, finds all products with a min price greater than $500, and prints each of them.
212206

213207
<details>
214208
<summary>Solution</summary>
215209

216-
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
217-
218-
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
219-
2. Select the header row. Go to **Data > Create filter**.
220-
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
210+
```js
211+
import { readFile } from "fs/promises";
221212

222-
![CSV in Google Sheets](images/csv-sheets.png)
213+
const jsonData = await readFile("products.json");
214+
const data = JSON.parse(jsonData);
215+
data
216+
.filter(row => row.minPrice > 50000)
217+
.forEach(row => console.log(row));
218+
```
223219

224220
</details>
225221

226-
### Process your JSON
222+
### Process your CSV
227223

228-
Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp).
224+
Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.
229225

230226
<details>
231227
<summary>Solution</summary>
232228

233-
```py
234-
import json
235-
from pprint import pp
236-
from decimal import Decimal
229+
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
237230

238-
with open("products.json", "r") as file:
239-
products = json.load(file)
231+
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
232+
2. Select the header row. Go to **Data > Create filter**.
233+
3. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
240234

241-
for product in products:
242-
if Decimal(product["min_price"]) > 500:
243-
pp(product)
244-
```
235+
![CSV in Google Sheets](images/csv-sheets.png)
245236

246237
</details>

0 commit comments

Comments
 (0)