Skip to content

Commit c4d8578

Browse files
committed
style: change order, first json, then csv
Making this change because in Python it doesn't matter and in JavaScript it's easier to start with JSON, which is built-in, and only then move to CSV, which requires an additional library.
1 parent c76817e commit c4d8578

File tree

1 file changed

+72
-72
lines changed

1 file changed

+72
-72
lines changed

sources/academy/webscraping/scraping_basics_python/08_saving_data.md

Lines changed: 72 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -78,65 +78,11 @@ If you find the complex data structures printed by `print()` difficult to read,
7878

7979
:::
8080

81-
## Saving data as CSV
82-
83-
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
84-
85-
In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
86-
87-
```py
88-
>>> import csv
89-
>>> with open("data.csv", "w") as file:
90-
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
91-
... writer.writeheader()
92-
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
93-
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
94-
...
95-
```
96-
97-
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
98-
99-
```csv title=data.csv
100-
name,age,hobbies
101-
Alice,24,"kickbox, Python"
102-
Bob,42,"reading, TypeScript"
103-
```
104-
105-
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
106-
107-
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
108-
109-
![CSV example preview](images/csv-example.png)
110-
111-
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
112-
113-
```py
114-
import httpx
115-
from bs4 import BeautifulSoup
116-
from decimal import Decimal
117-
# highlight-next-line
118-
import csv
119-
```
120-
121-
Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
122-
123-
```py
124-
with open("products.csv", "w") as file:
125-
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
126-
writer.writeheader()
127-
for row in data:
128-
writer.writerow(row)
129-
```
130-
131-
If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
132-
133-
![CSV preview](images/csv.png)
134-
13581
## Saving data as JSON
13682

13783
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
13884

139-
In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
85+
In Python, we can read and write JSON using the [`json`](https://docs.python.org/3/library/json.html) standard library module. We'll begin with imports:
14086

14187
```py
14288
import httpx
@@ -147,14 +93,14 @@ import csv
14793
import json
14894
```
14995

150-
Next, let’s append one more export to end of the source code of our scraper:
96+
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Replace `print(data)` with the following:
15197

15298
```py
15399
with open("products.json", "w") as file:
154100
json.dump(data, file)
155101
```
156102

157-
Thats it! If we run the program now, it should also create a `products.json` file in the current working directory:
103+
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:
158104

159105
```text
160106
$ python main.py
@@ -176,7 +122,7 @@ with open("products.json", "w") as file:
176122
json.dump(data, file, default=serialize)
177123
```
178124

179-
Now the program should work as expected, producing a JSON file with the following content:
125+
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
180126

181127
<!-- eslint-skip -->
182128
```json title=products.json
@@ -197,30 +143,67 @@ Also, if your data contains non-English characters, set `ensure_ascii=False`. By
197143

198144
:::
199145

200-
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
146+
## Saving data as CSV
201147

202-
---
148+
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
203149

204-
## Exercises
150+
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
205151

206-
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
152+
```py
153+
>>> import csv
154+
>>> with open("data.csv", "w") as file:
155+
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
156+
... writer.writeheader()
157+
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
158+
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
159+
...
160+
```
207161

208-
### Process your CSV
162+
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
209163

210-
Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
164+
```csv title=data.csv
165+
name,age,hobbies
166+
Alice,24,"kickbox, Python"
167+
Bob,42,"reading, TypeScript"
168+
```
211169

212-
<details>
213-
<summary>Solution</summary>
170+
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
214171

215-
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
172+
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
216173

217-
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
218-
2. Select the header row. Go to **Data > Create filter**.
219-
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
174+
![CSV example preview](images/csv-example.png)
220175

221-
![CSV in Google Sheets](images/csv-sheets.png)
176+
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
222177

223-
</details>
178+
```py
179+
import httpx
180+
from bs4 import BeautifulSoup
181+
from decimal import Decimal
182+
# highlight-next-line
183+
import csv
184+
```
185+
186+
Next, let’s append one more export to end of the source code of our scraper:
187+
188+
```py
189+
with open("products.csv", "w") as file:
190+
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
191+
writer.writeheader()
192+
for row in data:
193+
writer.writerow(row)
194+
```
195+
196+
Now the program should work as expected, producing a CSV file with the following content:
197+
198+
![CSV preview](images/csv.png)
199+
200+
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
201+
202+
---
203+
204+
## Exercises
205+
206+
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
224207

225208
### Process your JSON
226209

@@ -243,3 +226,20 @@ Write a new Python program that reads `products.json`, finds all products with a
243226
```
244227

245228
</details>
229+
230+
### Process your CSV
231+
232+
Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.
233+
234+
<details>
235+
<summary>Solution</summary>
236+
237+
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
238+
239+
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
240+
2. Select the header row. Go to **Data > Create filter**.
241+
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
242+
243+
![CSV in Google Sheets](images/csv-sheets.png)
244+
245+
</details>

0 commit comments

Comments
 (0)