Skip to content

Commit 069abdc

Browse files
authored
Merge branch 'master' into feat/update-milvus
2 parents 26e2b92 + 54cd84c commit 069abdc

File tree

8 files changed

+251
-17
lines changed

8 files changed

+251
-17
lines changed

_typos.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
extend-ignore-re = [
33
'`[^`\n]+`', # skip inline code
44
'```[\s\S]*?```', # skip code blocks
5+
'Bún bò Nam Bô', # otherwise "Nam" is considered as a typo of "Name"
56
]
67

78
[default.extend-words]
Lines changed: 236 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,246 @@
11
---
22
title: Saving data with Python
33
sidebar_label: Saving data
4-
description: TODO
4+
description: Lesson about building a Python application for watching prices. Using standard library to save data scraped from product listing pages in popular formats such as CSV or JSON.
55
sidebar_position: 8
66
slug: /scraping-basics-python/saving-data
77
---
88

9-
:::danger Work in progress
9+
**In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON. We'll use Python's standard library to export the files.**
1010

11-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
11+
---
12+
13+
We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically.
14+
15+
```text
16+
$ python main.py
17+
JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95
18+
Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None
19+
...
20+
```
21+
22+
However, the format of this text is rather _ad hoc_ and does not adhere to any specific standard that others could follow. It's unclear what to do if a product title already contains the `|` character or how to represent multi-line product descriptions. No ready-made library can handle all the parsing.
23+
24+
We should use widely popular formats that have well-defined solutions for all the corner cases and that other programs can read without much effort. Two such formats are CSV (_Comma-separated values_) and JSON (_JavaScript Object Notation_).
25+
26+
## Collecting data
27+
28+
Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program:
29+
30+
```py
31+
import httpx
32+
from bs4 import BeautifulSoup
33+
from decimal import Decimal
34+
35+
url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
36+
response = httpx.get(url)
37+
response.raise_for_status()
38+
39+
html_code = response.text
40+
soup = BeautifulSoup(html_code, "html.parser")
41+
42+
# highlight-next-line
43+
data = []
44+
for product in soup.select(".product-item"):
45+
title = product.select_one(".product-item__title").text.strip()
46+
47+
price_text = (
48+
product
49+
.select_one(".price")
50+
.contents[-1]
51+
.strip()
52+
.replace("$", "")
53+
.replace(",", "")
54+
)
55+
if price_text.startswith("From "):
56+
min_price = Decimal(price_text.removeprefix("From "))
57+
price = None
58+
else:
59+
min_price = Decimal(price_text)
60+
price = min_price
61+
62+
# highlight-next-line
63+
data.append({"title": title, "min_price": min_price, "price": price})
64+
65+
# highlight-next-line
66+
print(data)
67+
```
68+
69+
Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once.
70+
71+
```text
72+
$ python main.py
73+
[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...]
74+
```
75+
76+
:::tip Pretty print
77+
78+
If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead.
79+
80+
:::
81+
82+
## Saving data as CSV
83+
84+
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
85+
86+
In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
87+
88+
```py
89+
>>> import csv
90+
>>> with open("data.csv", "w") as file:
91+
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
92+
... writer.writeheader()
93+
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
94+
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
95+
...
96+
```
97+
98+
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
99+
100+
```csv title=data.csv
101+
name,age,hobbies
102+
Alice,24,"kickbox, Python"
103+
Bob,42,"reading, TypeScript"
104+
```
105+
106+
In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.
107+
108+
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.
109+
110+
![CSV example preview](images/csv-example.png)
111+
112+
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
113+
114+
```py
115+
import httpx
116+
from bs4 import BeautifulSoup
117+
from decimal import Decimal
118+
# highlight-next-line
119+
import csv
120+
```
121+
122+
Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:
123+
124+
```py
125+
with open("products.csv", "w") as file:
126+
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
127+
writer.writeheader()
128+
for row in data:
129+
writer.writerow(row)
130+
```
131+
132+
If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.
133+
134+
![CSV preview](images/csv.png)
135+
136+
## Saving data as JSON
137+
138+
The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.
139+
140+
In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
141+
142+
```py
143+
import httpx
144+
from bs4 import BeautifulSoup
145+
from decimal import Decimal
146+
import csv
147+
# highlight-next-line
148+
import json
149+
```
150+
151+
Next, let’s append one more export to end of the source code of our scraper:
152+
153+
```py
154+
with open("products.json", "w") as file:
155+
json.dump(data, file)
156+
```
157+
158+
That’s it! If we run the program now, it should also create a `products.json` file in the current working directory:
159+
160+
```text
161+
$ python main.py
162+
Traceback (most recent call last):
163+
...
164+
raise TypeError(f'Object of type {o.__class__.__name__} '
165+
TypeError: Object of type Decimal is not JSON serializable
166+
```
167+
168+
Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:
169+
170+
```py
171+
def serialize(obj):
172+
if isinstance(obj, Decimal):
173+
return str(obj)
174+
raise TypeError("Object not JSON serializable")
175+
176+
with open("products.json", "w") as file:
177+
json.dump(data, file, default=serialize)
178+
```
179+
180+
Now the program should work as expected, producing a JSON file with the following content:
181+
182+
<!-- eslint-skip -->
183+
```json title=products.json
184+
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
185+
```
186+
187+
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
188+
189+
```json
190+
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
191+
```
192+
193+
:::tip Pretty JSON
194+
195+
While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output.
196+
197+
Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`.
12198

13199
:::
200+
201+
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
202+
203+
---
204+
205+
## Exercises
206+
207+
In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
208+
209+
### Process your CSV
210+
211+
Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
212+
213+
<details>
214+
<summary>Solution</summary>
215+
216+
Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
217+
218+
1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
219+
2. Select the header row. Go to **Data > Create filter**.
220+
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
221+
222+
![CSV in Google Sheets](images/csv-sheets.png)
223+
224+
</details>
225+
226+
### Process your JSON
227+
228+
Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp).
229+
230+
<details>
231+
<summary>Solution</summary>
232+
233+
```py
234+
import json
235+
from pprint import pp
236+
from decimal import Decimal
237+
238+
with open("products.json", "r") as file:
239+
products = json.load(file)
240+
241+
for product in products:
242+
if Decimal(product["min_price"]) > 500:
243+
pp(product)
244+
```
245+
246+
</details>
285 KB
Loading
279 KB
Loading
387 KB
Loading

sources/academy/webscraping/scraping_basics_python/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ TODO image of warehouse with some CVS or JSON exported, similar to sources/acade
3131
- Inspect pages using browser DevTools.
3232
- Download web pages using the HTTPX library.
3333
- Extract data from web pages using the Beautiful Soup library.
34-
- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open
34+
- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open.
3535
- Follow links programmatically (crawling).
3636
- Save time and effort with frameworks, such as Crawlee, and scraping platforms, such as Apify.
3737

@@ -52,7 +52,7 @@ Let's explore the key reasons to take this course. What is web scraping good for
5252

5353
### Why learn scraping
5454

55-
The internet is full of useful data, but most of it isn't offered in a structured way that is easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
55+
The internet is full of useful data, but most of it isn't offered in a structured way that's easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
5656

5757
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
5858

@@ -71,7 +71,7 @@ As a scraper developer, you are not limited by whether certain data is available
7171

7272
### Why learn with Apify
7373

74-
We are [Apify](https://apify.com), a web scraping and automation platform. We did our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
74+
We are [Apify](https://apify.com), a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how scraping platforms can simplify your life, but those lessons are optional and designed to fit within our [free tier](https://apify.com/pricing).
7575

7676
## Course content
7777

sources/platform/actors/development/actor_definition/input_schema/specification.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -183,16 +183,16 @@ Rendered input:
183183

184184
Properties:
185185

186-
| Property | Value | Required | Description |
187-
| --- | --- | --- | --- |
188-
| `editor` | One of <ul><li>`textfield`</li><li>`textarea`</li><li>`javascript`</li><li>`python`</li><li>`select`</li><li>`datepicker`</li><li>`hidden`</li></ul> | Yes | Visual editor used for <br/>the input field. |
189-
| `pattern` | String | No | Regular expression that will be <br/>used to validate the input. <br/> If validation fails, <br/>the Actor will not run. |
190-
| `minLength` | Integer | No | Minimum length of the string. |
191-
| `maxLength` | Integer | No | Maximum length of the string. |
192-
| `enum` | [String] | Required if <br/>`editor` <br/>is `select` | Using this field, you can limit values <br/>to the given array of strings. <br/>Input will be displayed as select box. |
193-
| `enumTitles` | [String] | No | Titles for the `enum` keys described. |
194-
| `nullable` | Boolean | No | Specifies whether `null` <br/>is an allowed value. |
195-
| `isSecret` | Boolean | No | Specifies whether the input field<br />will be stored encrypted.<br />Only available <br />with `textfield` and `textarea` editors. |
186+
| Property | Value | Required | Description |
187+
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
188+
| `editor` | One of <ul><li>`textfield`</li><li>`textarea`</li><li>`javascript`</li><li>`python`</li><li>`select`</li><li>`datepicker`</li><li>`hidden`</li></ul> | Yes | Visual editor used for <br/>the input field. |
189+
| `pattern` | String | No | Regular expression that will be <br/>used to validate the input. <br/> If validation fails, <br/>the Actor will not run. |
190+
| `minLength` | Integer | No | Minimum length of the string. |
191+
| `maxLength` | Integer | No | Maximum length of the string. |
192+
| `enum` | [String] | Required if <br/>`editor` <br/>is `select` | Using this field, you can limit values <br/>to the given array of strings. <br/>Input will be displayed as select box. |
193+
| `enumTitles` | [String] | No | Titles for the `enum` keys described. |
194+
| `nullable` | Boolean | No | Specifies whether `null` <br/>is an allowed value. |
195+
| `isSecret` | Boolean | No | Specifies whether the input field<br />will be stored encrypted.<br />Only available <br />with `textfield` and `textarea` editors. |
196196

197197
:::note Regex escape
198198

sources/platform/storage/dataset.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,7 +284,7 @@ For more information, visit our [Python SDK documentation](/sdk/python/docs/conc
284284

285285
Fields in a dataset that begin with a `#` are treated as hidden. You can exclude these fields when downloading data by using either `skipHidden=1` or `clean=1` in your query parameters. This feature is useful for excluding debug information from the final dataset output.
286286

287-
The following example demonstrates a dataset record with hiddent fields, including HTTP response and error details.
287+
The following example demonstrates a dataset record with hidden fields, including HTTP response and error details.
288288

289289
```json
290290
{

0 commit comments

Comments
 (0)