You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Making this change because in Python it doesn't matter
and in JavaScript it's easier to start with JSON, which
is built-in, and only then move to CSV, which requires
an additional library.
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/08_saving_data.md
+72-72Lines changed: 72 additions & 72 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,65 +78,11 @@ If you find the complex data structures printed by `print()` difficult to read,
78
78
79
79
:::
80
80
81
-
## Saving data as CSV
82
-
83
-
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
84
-
85
-
In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
98
-
99
-
```csv title=data.csv
100
-
name,age,hobbies
101
-
Alice,24,"kickbox, Python"
102
-
Bob,42,"reading, TypeScript"
103
-
```
104
-
105
-
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
106
-
107
-
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
108
-
109
-

110
-
111
-
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
112
-
113
-
```py
114
-
import httpx
115
-
from bs4 import BeautifulSoup
116
-
from decimal import Decimal
117
-
# highlight-next-line
118
-
import csv
119
-
```
120
-
121
-
Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
132
-
133
-

134
-
135
81
## Saving data as JSON
136
82
137
83
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
138
84
139
-
In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
85
+
In Python, we can read and write JSON using the [`json`](https://docs.python.org/3/library/json.html) standard library module. We'll begin with imports:
140
86
141
87
```py
142
88
import httpx
@@ -147,14 +93,14 @@ import csv
147
93
import json
148
94
```
149
95
150
-
Next, let’s append one more export to end of the source code of our scraper:
96
+
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Replace `print(data)` with the following:
151
97
152
98
```py
153
99
withopen("products.json", "w") asfile:
154
100
json.dump(data, file)
155
101
```
156
102
157
-
That’s it! If we run the program now, it should also create a `products.json` file in the current working directory:
103
+
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:
158
104
159
105
```text
160
106
$ python main.py
@@ -176,7 +122,7 @@ with open("products.json", "w") as file:
176
122
json.dump(data, file, default=serialize)
177
123
```
178
124
179
-
Now the program should work as expected, producing a JSON file with the following content:
125
+
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
180
126
181
127
<!-- eslint-skip -->
182
128
```json title=products.json
@@ -197,30 +143,67 @@ Also, if your data contains non-English characters, set `ensure_ascii=False`. By
197
143
198
144
:::
199
145
200
-
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
146
+
## Saving data as CSV
201
147
202
-
---
148
+
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
203
149
204
-
## Exercises
150
+
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
205
151
206
-
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
209
163
210
-
Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
164
+
```csv title=data.csv
165
+
name,age,hobbies
166
+
Alice,24,"kickbox, Python"
167
+
Bob,42,"reading, TypeScript"
168
+
```
211
169
212
-
<details>
213
-
<summary>Solution</summary>
170
+
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
214
171
215
-
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
172
+
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
216
173
217
-
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
218
-
2. Select the header row. Go to **Data > Create filter**.
219
-
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
174
+

220
175
221
-

176
+
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
222
177
223
-
</details>
178
+
```py
179
+
import httpx
180
+
from bs4 import BeautifulSoup
181
+
from decimal import Decimal
182
+
# highlight-next-line
183
+
import csv
184
+
```
185
+
186
+
Next, let’s append one more export to end of the source code of our scraper:
Now the program should work as expected, producing a CSV file with the following content:
197
+
198
+

199
+
200
+
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
201
+
202
+
---
203
+
204
+
## Exercises
205
+
206
+
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
224
207
225
208
### Process your JSON
226
209
@@ -243,3 +226,20 @@ Write a new Python program that reads `products.json`, finds all products with a
243
226
```
244
227
245
228
</details>
229
+
230
+
### Process your CSV
231
+
232
+
Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.
233
+
234
+
<details>
235
+
<summary>Solution</summary>
236
+
237
+
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
238
+
239
+
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
240
+
2. Select the header row. Go to **Data > Create filter**.
241
+
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
0 commit comments