Skip to content
This repository was archived by the owner on Jul 15, 2024. It is now read-only.

Commit fdb75aa

Browse files
add example: read remote parquet using duckdb (#14)
1 parent c5cbbe9 commit fdb75aa

File tree

1 file changed

+178
-0
lines changed

1 file changed

+178
-0
lines changed
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "379407de-ed10-472c-ad81-228ba73c7d15",
6+
"metadata": {},
7+
"source": [
8+
"# Reading Parquet Files using DuckDB\n",
9+
"\n",
10+
"In this example, we will use Ibis's DuckDB backend to analyze data from a remote parquet source using `ibis.read_parquet`.\n",
11+
"`ibis.read_parquet` can also read local parquet files,\n",
12+
"and there are other `ibis.read_*` functions that conveniently return a table expression from a file.\n",
13+
"One such function is `ibis.read_csv`, which reads from local and remote CSV.\n",
14+
"\n",
15+
"We will be reading from the [**Global Biodiversity Information Facility (GBIF) Species Occurrences**](https://registry.opendata.aws/gbif/) dataset.\n",
16+
"It is hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/`"
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"id": "4402d524-bd38-4127-a8ec-500be723711c",
22+
"metadata": {},
23+
"source": [
24+
"## Reading One Partition\n",
25+
"\n",
26+
"We can read a single partition by specifying its path.\n",
27+
"\n",
28+
"We do this by calling [`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet) on the partition we care about.\n",
29+
"\n",
30+
"So to read the first partition in this dataset, we'll call `read_parquet` on `00000` in that path:"
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": null,
36+
"id": "062ba84c-1f4f-4ec7-9df5-73444c491342",
37+
"metadata": {
38+
"tags": []
39+
},
40+
"outputs": [],
41+
"source": [
42+
"import ibis\n",
43+
"\n",
44+
"t = ibis.read_parquet(f\"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000\")\n",
45+
"t"
46+
]
47+
},
48+
{
49+
"cell_type": "markdown",
50+
"id": "5440fa0f-2aca-40da-b4ed-4fde06051e10",
51+
"metadata": {},
52+
"source": [
53+
"Note that we're calling `read_parquet` and receiving a table expression without establishing a connection first.\n",
54+
"Ibis spins up a DuckDB connection (or whichever default backend you have) when you call `ibis.read_parquet` (or even `ibis.read_csv`).\n",
55+
"\n",
56+
"Since our result, `t`, is a table expression, we can now run queries against the file using Ibis expressions.\n",
57+
"For example, we can select columns, filter the file, and then view the first five rows of the result:"
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": null,
63+
"id": "035e845c-761a-4728-9361-ae33f3205c45",
64+
"metadata": {
65+
"tags": []
66+
},
67+
"outputs": [],
68+
"source": [
69+
"cols = ['gbifid', 'datasetkey', 'occurrenceid', 'kingdom',\n",
70+
" 'phylum', 'class', 'order', 'family', 'genus',\n",
71+
" 'species', 'day', 'month', 'year']\n",
72+
"\n",
73+
"t.select(cols).filter(t['family'].isin(['Corvidae'])).limit(5).execute()"
74+
]
75+
},
76+
{
77+
"cell_type": "markdown",
78+
"id": "4595a5ae-0007-4b8a-8e31-803d92e7e52c",
79+
"metadata": {},
80+
"source": [
81+
"or count the rows in the table (partition):"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": null,
87+
"id": "bd6d8cc6-ce49-44dd-9507-bd26176127f8",
88+
"metadata": {
89+
"tags": []
90+
},
91+
"outputs": [],
92+
"source": [
93+
"t.count().execute()"
94+
]
95+
},
96+
{
97+
"cell_type": "markdown",
98+
"id": "4286d9f0-8e06-498b-a561-e75193126adc",
99+
"metadata": {},
100+
"source": [
101+
"## Reading All Partitions: Filter, Aggregate, Export\n",
102+
"We can use `read_parquet` to read an entire parquet file by globbing all partitions:"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"id": "3d2246c9-57b0-4b6c-8849-e8d2d85b29bb",
109+
"metadata": {
110+
"tags": []
111+
},
112+
"outputs": [],
113+
"source": [
114+
"t = ibis.read_parquet(f\"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*\")"
115+
]
116+
},
117+
{
118+
"cell_type": "markdown",
119+
"id": "9bd746c0-d414-4212-ab76-c5d585bafc82",
120+
"metadata": {},
121+
"source": [
122+
"and since the function returns a table expression, we can perform valid selections, filters, aggregations, and exports just as we could with any other table expression:"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"id": "0f92c38b-1487-464c-86a2-4b922831207e",
129+
"metadata": {
130+
"tags": []
131+
},
132+
"outputs": [],
133+
"source": [
134+
"df = (\n",
135+
" t.select(['gbifid', 'family', 'species'])\n",
136+
" .filter(t['family'].isin(['Corvidae']))\n",
137+
" # Here we limit by 10,000 to fetch a quick batch of results\n",
138+
" .limit(10000)\n",
139+
" .group_by('species')\n",
140+
" .count()\n",
141+
" .execute()\n",
142+
")\n",
143+
"\n",
144+
"print(df.shape)\n",
145+
"df.head()"
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": null,
151+
"id": "aecbd689-d632-42e1-80ed-28a7f0a22d17",
152+
"metadata": {},
153+
"outputs": [],
154+
"source": []
155+
}
156+
],
157+
"metadata": {
158+
"kernelspec": {
159+
"display_name": "Python 3 (ipykernel)",
160+
"language": "python",
161+
"name": "python3"
162+
},
163+
"language_info": {
164+
"codemirror_mode": {
165+
"name": "ipython",
166+
"version": 3
167+
},
168+
"file_extension": ".py",
169+
"mimetype": "text/x-python",
170+
"name": "python",
171+
"nbconvert_exporter": "python",
172+
"pygments_lexer": "ipython3",
173+
"version": "3.10.6"
174+
}
175+
},
176+
"nbformat": 4,
177+
"nbformat_minor": 5
178+
}

0 commit comments

Comments
 (0)