Skip to content

Commit 7dcae8b

Browse files
allisonhorstFil
andauthored
Technique example: data loader, Python to parquet (#1422)
* python to parquet technique example * grammar * minor updates, add note div * updates for consistency, add requirements.txt * newline * Apply suggestions from code review Co-authored-by: Philippe Rivière <[email protected]> * explicitly add compression codec in write_table * Mention write_table compression, move venv setup to copyable code * save * prettier and update py dependencies section --------- Co-authored-by: Philippe Rivière <[email protected]>
1 parent 05d70c5 commit 7dcae8b

File tree

9 files changed

+128
-0
lines changed

9 files changed

+128
-0
lines changed

examples/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
- [`loader-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-parquet/) - Generat Apache Parquet files
6666
- [`loader-postgres`](https://observablehq.observablehq.cloud/framework-example-loader-postgres/) - Load data from PostgreSQL
6767
- [`loader-python-to-csv`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-csv/) - Generate CSV from Python
68+
- [`loader-python-to-parquet`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet) - Generate Apache Parquet from Python
6869
- [`loader-python-to-png`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-png/) - Generate PNG from Python
6970
- [`loader-python-to-zip`](https://observablehq.observablehq.cloud/framework-example-loader-python-to-zip/) - Generate ZIP from Python
7071
- [`loader-r-to-csv`](https://observablehq.observablehq.cloud/framework-example-loader-r-to-csv/) - Generate CSV from R
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
.DS_Store
2+
/dist/
3+
node_modules/
4+
yarn-error.log
5+
.venv
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[Framework examples →](../)
2+
3+
# Python data loader to generate Apache Parquet
4+
5+
View live: <https://observablehq.observablehq.cloud/framework-example-loader-python-to-parquet/>
6+
7+
This Observable Framework example demonstrates how to write a Python data loader that outputs an Apache Parquet file using the [pyarrow](https://pypi.org/project/pyarrow/) library. The loader reads in a CSV with records for over 91,000 dams in the United States from the [National Inventory of Dams](https://nid.sec.usace.army.mil/), selects several columns, then writes the data frame as a parquet file to standard output. The data loader lives in [`src/data/us-dams.parquet.py`](./src/data/us-dams.parquet.py).
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
export default {
2+
root: "src"
3+
};
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
{
2+
"type": "module",
3+
"private": true,
4+
"scripts": {
5+
"clean": "rimraf src/.observablehq/cache",
6+
"build": "rimraf dist && observable build",
7+
"dev": "observable preview",
8+
"deploy": "observable deploy",
9+
"observable": "observable"
10+
},
11+
"dependencies": {
12+
"@observablehq/framework": "^1.7.0"
13+
},
14+
"devDependencies": {
15+
"rimraf": "^5.0.5"
16+
},
17+
"engines": {
18+
"node": ">=18"
19+
}
20+
}
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pandas==2.2.0
2+
pyarrow==16.1
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/.observablehq/cache/
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Load libraries (must be installed in environment)
2+
import pandas as pd
3+
import pyarrow as pa
4+
import pyarrow.parquet as pq
5+
import sys
6+
7+
df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]
8+
9+
# Write DataFrame to a temporary file-like object
10+
buf = pa.BufferOutputStream()
11+
table = pa.Table.from_pandas(df)
12+
pq.write_table(table, buf, compression="snappy")
13+
14+
# Get the buffer as a bytes object
15+
buf_bytes = buf.getvalue().to_pybytes()
16+
17+
# Write the bytes to standard output
18+
sys.stdout.buffer.write(buf_bytes)
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Python data loader to generate Apache Parquet
2+
3+
Here’s a Python data loader that accesses records for over 91,000 dams from the [National Inventory of Dams](https://nid.sec.usace.army.mil/), limits the data to only four columns, then outputs an Apache Parquet file to standard out.
4+
5+
```python
6+
# Load libraries (must be installed in environment)
7+
import pandas as pd
8+
import pyarrow as pa
9+
import pyarrow.parquet as pq
10+
import sys
11+
12+
df = pd.read_csv("https://nid.sec.usace.army.mil/api/nation/csv", low_memory=False, skiprows=1).loc[:, ["Dam Name", "Primary Purpose", "Primary Dam Type", "Hazard Potential Classification"]]
13+
14+
# Write DataFrame to a temporary file-like object
15+
buf = pa.BufferOutputStream()
16+
table = pa.Table.from_pandas(df)
17+
pq.write_table(table, buf, compression="snappy")
18+
19+
# Get the buffer as a bytes object
20+
buf_bytes = buf.getvalue().to_pybytes()
21+
22+
# Write the bytes to standard output
23+
sys.stdout.buffer.write(buf_bytes)
24+
```
25+
26+
<div class="note">
27+
28+
To run this data loader you’ll need python3, and the `pandas` and `pyarrow` libraries installed and available on your `$PATH`.
29+
30+
</div>
31+
32+
<div class="tip">
33+
34+
We recommend using a [Python virtual environment](https://observablehq.com/framework/loaders#venv), such as with venv or uv, and managing required packages via `requirements.txt` rather than installing them globally.
35+
36+
</div>
37+
38+
This example uses the default Snappy compression algorithm. See other [options for compression](https://parquet.apache.org/docs/file-format/data-pages/compression/) available in pyarrow’s [`write_table()`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html) function.
39+
40+
The above data loader lives in `data/us-dams.parquet.py`, so we can load the data using `data/us-dams.parquet`. The `FileAttachment.parquet` method parses the file and returns a promise to an Apache Arrow table.
41+
42+
```js echo
43+
const dams = FileAttachment("data/us-dams.parquet").parquet();
44+
```
45+
46+
We can display the table using `Inputs.table`.
47+
48+
```js echo
49+
Inputs.table(dams)
50+
```
51+
52+
Lastly, we can pass the table to Observable Plot to make a simple bar chart of dam counts by purpose, with color mapped to hazard classification.
53+
54+
```js echo
55+
Plot.plot({
56+
marginLeft: 220,
57+
color: {legend: true, domain: ["Undetermined", "Low", "Significant", "High"]},
58+
marks: [
59+
Plot.barX(dams,
60+
Plot.groupY(
61+
{x: "count"},
62+
{
63+
y: "Primary Purpose",
64+
fill: "Hazard Potential Classification",
65+
sort: {y: "x", reverse: true}
66+
}
67+
)
68+
)
69+
]
70+
})
71+
```

0 commit comments

Comments
 (0)