Skip to content
This repository was archived by the owner on Jul 15, 2024. It is now read-only.

Commit c1cdd6e

Browse files
authored
GH-9: Add Substrait example (#16)
1 parent fdb75aa commit c1cdd6e

File tree

4 files changed

+246
-1
lines changed

4 files changed

+246
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,4 +136,5 @@ Icon?
136136

137137
# Tutorial and examples artifacts
138138
geography.db
139+
palmer_penguins.ddb*
139140
*.log

examples/.gitignore

Lines changed: 0 additions & 1 deletion
This file was deleted.

examples/Substrait.ipynb

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Substrait\n",
8+
"\n",
9+
"[Substrait](https://substrait.io) is a cross-language specification for data compute operations. Ibis can produce Substrait plans using the `ibis-substrait` python package. \n",
10+
"\n",
11+
"### Why Substrait?\n",
12+
"\n",
13+
"The current state of the world requires tools like Ibis to build connectors for each unique data system. This is a many-to-many relationship that grows exponentially. Substrait removes the need for connectors by introducing an Intermediate Representation (IR). Now, we can have a many-to-one relationship from frontend -> IR and a one-to-many relationship from IR -> backend. \n",
14+
"\n",
15+
"### But, how is this useful to me?\n",
16+
"\n",
17+
"Interoperability now _and in the future_. The same Substrait Plan can run anywhere that has built-in support for the Substrait specification. No need to wait for Ibis to implement the shiny new connector for your data system of choice."
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"## Example\n",
25+
"\n",
26+
"Let's see Ibis Substrait in action."
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"metadata": {},
32+
"source": [
33+
"### Setup\n",
34+
"\n",
35+
"Let's build a toy example of a database server. Our example uses a local DuckDB database, but in practice we can imagine talking to a database server over the network."
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": null,
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"import duckdb\n",
45+
"import os\n",
46+
"from urllib.request import urlretrieve\n",
47+
"\n",
48+
"\n",
49+
"class DatabaseServer:\n",
50+
" DB_NAME = \"palmer_penguins.ddb\"\n",
51+
" DB_URL = \"https://storage.googleapis.com/ibis-tutorial-data/palmer_penguins.ddb\"\n",
52+
" \n",
53+
" def __init__(self):\n",
54+
" if not os.path.exists(self.DB_NAME):\n",
55+
" urlretrieve(self.DB_URL, self.DB_NAME)\n",
56+
" self.db = duckdb.connect(self.DB_NAME)\n",
57+
" self.db.install_extension(\"substrait\")\n",
58+
" self.db.load_extension(\"substrait\")\n",
59+
" \n",
60+
" def execute(self, substrait):\n",
61+
" result = self.db.from_substrait(substrait)\n",
62+
" return result.fetchall()\n",
63+
" \n",
64+
"\n",
65+
"db_server = DatabaseServer()"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"### Ibis Table\n",
73+
"\n",
74+
"We need an Ibis Table to query against. Let's define one that matches the table in our mock DB server."
75+
]
76+
},
77+
{
78+
"cell_type": "code",
79+
"execution_count": null,
80+
"metadata": {},
81+
"outputs": [],
82+
"source": [
83+
"import ibis\n",
84+
"from ibis.expr.datatypes.core import Float64, Int64, String\n",
85+
"\n",
86+
"table = ibis.table(\n",
87+
" name=\"penguins\", \n",
88+
" schema=[\n",
89+
" (\"species\", String()),\n",
90+
" (\"island\", String()),\n",
91+
" (\"bill_length_mm\", Float64()),\n",
92+
" (\"bill_depth_mm\", Float64()),\n",
93+
" (\"flipper_length_mm\", Int64()),\n",
94+
" (\"body_mass_g\", Int64()),\n",
95+
" (\"sex\", String()),\n",
96+
" (\"year\", Int64)\n",
97+
" ]\n",
98+
")\n",
99+
"\n",
100+
"print(table)"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"metadata": {},
106+
"source": [
107+
"### Substrait Compiler\n",
108+
"\n",
109+
"The `ibis-substrait` package provides a `SubstraitCompiler` that can both compile and decompile Substrait Plans.\n",
110+
"\n",
111+
"Let's see it in action:"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {},
118+
"outputs": [],
119+
"source": [
120+
"from ibis import _\n",
121+
"from ibis_substrait.compiler.core import SubstraitCompiler\n",
122+
"\n",
123+
"compiler = SubstraitCompiler()\n",
124+
"\n",
125+
"query = (\n",
126+
" table\n",
127+
" .select(_.species)\n",
128+
" .group_by(_.species)\n",
129+
" .agg(count=_.species.count())\n",
130+
")\n",
131+
"\n",
132+
"substrait_plan = compiler.compile(query)\n",
133+
"\n",
134+
"print(substrait_plan)"
135+
]
136+
},
137+
{
138+
"cell_type": "markdown",
139+
"metadata": {},
140+
"source": [
141+
"### Substrait Execution\n",
142+
"\n",
143+
"Let's serialize the Substrait Plan to bytes that can be sent over the network and pass them to our mock DB server.\n",
144+
"\n",
145+
"The query counts the number of penguins per species."
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": null,
151+
"metadata": {},
152+
"outputs": [],
153+
"source": [
154+
"plan_bytes = substrait_plan.SerializeToString()\n",
155+
"\n",
156+
"db_server.execute(substrait=plan_bytes)"
157+
]
158+
},
159+
{
160+
"cell_type": "markdown",
161+
"metadata": {},
162+
"source": [
163+
"Success! We've created an Ibis Table expression, serialized it to the Substrait IR, sent it to our DB server, and received the resulting rows back.\n",
164+
"\n",
165+
"We can iterate on our data analysis. Let's see how many of each species lives on each island."
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"metadata": {},
172+
"outputs": [],
173+
"source": [
174+
"query = (\n",
175+
" table\n",
176+
" .select(_.island, _.species)\n",
177+
" .group_by([_.island, _.species])\n",
178+
" .agg(num=_.species.count())\n",
179+
" .order_by([ibis.asc(_.island), ibis.asc(_.species)])\n",
180+
")\n",
181+
"\n",
182+
"plan_bytes = compiler.compile(query).SerializeToString()\n",
183+
"\n",
184+
"db_server.execute(substrait=plan_bytes)"
185+
]
186+
},
187+
{
188+
"cell_type": "markdown",
189+
"metadata": {},
190+
"source": [
191+
"Interesting! And what is the average body mass in grams for each row result?"
192+
]
193+
},
194+
{
195+
"cell_type": "code",
196+
"execution_count": null,
197+
"metadata": {},
198+
"outputs": [],
199+
"source": [
200+
"query = (\n",
201+
" table\n",
202+
" .select(_.island, _.species, _.body_mass_g)\n",
203+
" .group_by([_.island, _.species])\n",
204+
" .agg(num=_.species.count(), avg_weight=_.body_mass_g.mean())\n",
205+
" .order_by([ibis.asc(_.island), ibis.asc(_.species)])\n",
206+
")\n",
207+
"\n",
208+
"plan_bytes = compiler.compile(query).SerializeToString()\n",
209+
"\n",
210+
"db_server.execute(substrait=plan_bytes)"
211+
]
212+
},
213+
{
214+
"cell_type": "markdown",
215+
"metadata": {},
216+
"source": [
217+
"## Conclusion\n",
218+
"\n",
219+
"We saw how we can translate Ibis expressions into Substrait Plans that can theoretically run anywhere. Backend support for Substrait is growing. Checkout some compatible projects such as [DuckDB](https://duckdb.org/docs/extensions/substrait), [Apache DataFusion](https://arrow.apache.org/datafusion), and Apache Arrow's [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html)!"
220+
]
221+
}
222+
],
223+
"metadata": {
224+
"kernelspec": {
225+
"display_name": "Python 3 (ipykernel)",
226+
"language": "python",
227+
"name": "python3"
228+
},
229+
"language_info": {
230+
"codemirror_mode": {
231+
"name": "ipython",
232+
"version": 3
233+
},
234+
"file_extension": ".py",
235+
"mimetype": "text/x-python",
236+
"name": "python",
237+
"nbconvert_exporter": "python",
238+
"pygments_lexer": "ipython3",
239+
"version": "3.10.10"
240+
}
241+
},
242+
"nbformat": 4,
243+
"nbformat_minor": 2
244+
}

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
notebook
22
jupyterlab == 3.4.8
33
ibis-framework[sqlite,duckdb,clickhouse]@git+https://github.com/ibis-project/ibis.git
4+
ibis-substrait

0 commit comments

Comments
 (0)