You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: re-order JSON and CSV in Python lessons (#1658)
When working on #1584 I realized
it'd be better if the lesson started with JSON and continued with CSV,
not the other way.
In Python it doesn't matter and in JavaScript it's easier to start with
JSON, which is built-in, and only then move to CSV, which requires an
additional library. So for the sake of having both lessons aligned, I
want to change the order in the Python lesson, too.
So most of the diff is just the two sections reversed, and the two
exercises reversed. I made only a few additional changes to the wording.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/08_saving_data.md
+81-73Lines changed: 81 additions & 73 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,83 +78,28 @@ If you find the complex data structures printed by `print()` difficult to read,
78
78
79
79
:::
80
80
81
-
## Saving data as CSV
82
-
83
-
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
84
-
85
-
In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
98
-
99
-
```csv title=data.csv
100
-
name,age,hobbies
101
-
Alice,24,"kickbox, Python"
102
-
Bob,42,"reading, TypeScript"
103
-
```
104
-
105
-
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
106
-
107
-
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
108
-
109
-

110
-
111
-
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
112
-
113
-
```py
114
-
import httpx
115
-
from bs4 import BeautifulSoup
116
-
from decimal import Decimal
117
-
# highlight-next-line
118
-
import csv
119
-
```
120
-
121
-
Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
132
-
133
-

134
-
135
81
## Saving data as JSON
136
82
137
83
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
138
84
139
-
In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
85
+
In Python, we can read and write JSON using the [`json`](https://docs.python.org/3/library/json.html) standard library module. We'll begin with imports:
140
86
141
87
```py
142
88
import httpx
143
89
from bs4 import BeautifulSoup
144
90
from decimal import Decimal
145
-
import csv
146
91
# highlight-next-line
147
92
import json
148
93
```
149
94
150
-
Next, let’s append one more export to end of the source code of our scraper:
95
+
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `print(data)` with the following:
151
96
152
97
```py
153
98
withopen("products.json", "w") asfile:
154
99
json.dump(data, file)
155
100
```
156
101
157
-
That’s it! If we run the program now, it should also create a `products.json` file in the current working directory:
102
+
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:
158
103
159
104
```text
160
105
$ python main.py
@@ -176,7 +121,7 @@ with open("products.json", "w") as file:
176
121
json.dump(data, file, default=serialize)
177
122
```
178
123
179
-
Now the program should work as expected, producing a JSON file with the following content:
124
+
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
180
125
181
126
<!-- eslint-skip -->
182
127
```json title=products.json
@@ -197,30 +142,76 @@ Also, if your data contains non-English characters, set `ensure_ascii=False`. By
197
142
198
143
:::
199
144
200
-
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
145
+
## Saving data as CSV
201
146
202
-
---
147
+
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
203
148
204
-
## Exercises
149
+
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
205
150
206
-
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
209
162
210
-
Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
163
+
```csv title=data.csv
164
+
name,age,hobbies
165
+
Alice,24,"kickbox, Python"
166
+
Bob,42,"reading, TypeScript"
167
+
```
211
168
212
-
<details>
213
-
<summary>Solution</summary>
169
+
In the CSV format, if a value contains commas, we should enclose it in quotes. When we open the file in a text editor of our choice, we can see that the writer automatically handled this.
214
170
215
-
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
171
+
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
216
172
217
-
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
218
-
2. Select the header row. Go to **Data > Create filter**.
219
-
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
173
+

220
174
221
-

175
+
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
222
176
223
-
</details>
177
+
```py
178
+
import httpx
179
+
from bs4 import BeautifulSoup
180
+
from decimal import Decimal
181
+
import json
182
+
# highlight-next-line
183
+
import csv
184
+
```
185
+
186
+
Next, let's add one more data export to end of the source code of our scraper:
The program should now also produce a CSV file with the following content:
205
+
206
+

207
+
208
+
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
209
+
210
+
---
211
+
212
+
## Exercises
213
+
214
+
In this lesson, we created export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
224
215
225
216
### Process your JSON
226
217
@@ -243,3 +234,20 @@ Write a new Python program that reads `products.json`, finds all products with a
243
234
```
244
235
245
236
</details>
237
+
238
+
### Process your CSV
239
+
240
+
Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.
241
+
242
+
<details>
243
+
<summary>Solution</summary>
244
+
245
+
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
246
+
247
+
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
248
+
2. Select the header row. Go to **Data > Create filter**.
249
+
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
0 commit comments