You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: daft/01_what_makes_daft_special.py
+30-35Lines changed: 30 additions & 35 deletions
Original file line number
Diff line number
Diff line change
@@ -8,37 +8,33 @@
8
8
9
9
importmarimo
10
10
11
-
__generated_with="0.13.6"
11
+
__generated_with="0.18.4"
12
12
app=marimo.App(width="medium")
13
13
14
14
15
15
@app.cell(hide_code=True)
16
16
def_(mo):
17
-
mo.md(
18
-
r"""
17
+
mo.md(r"""
19
18
# What Makes Daft Special?
20
19
21
20
> _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
22
21
23
22
Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
24
-
"""
25
-
)
23
+
""")
26
24
return
27
25
28
26
29
27
@app.cell(hide_code=True)
30
28
def_(mo):
31
-
mo.md(
32
-
r"""
29
+
mo.md(r"""
33
30
## 🎯 Introducing Daft: A Unified Data Engine
34
31
35
32
Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
36
33
37
34
The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
38
35
39
36
Let's go ahead and `pip install daft` to see it in action!
40
-
"""
41
-
)
37
+
""")
42
38
return
43
39
44
40
@@ -86,8 +82,7 @@ def _(mo):
86
82
87
83
@app.cell(hide_code=True)
88
84
def_(mo):
89
-
mo.md(
90
-
r"""
85
+
mo.md(r"""
91
86
## 🦀 Built with Rust: Performance and Simplicity
92
87
93
88
One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
@@ -97,8 +92,7 @@ def _(mo):
97
92
* **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
98
93
99
94
Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
100
-
"""
101
-
)
95
+
""")
102
96
return
103
97
104
98
@@ -118,7 +112,9 @@ def _(mo):
118
112
119
113
@app.cell(hide_code=True)
120
114
def_(mo):
121
-
mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!""")
115
+
mo.md(r"""
116
+
A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!
117
+
""")
122
118
return
123
119
124
120
@@ -135,7 +131,9 @@ def _(daft):
135
131
136
132
@app.cell(hide_code=True)
137
133
def_(mo):
138
-
mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:""")
134
+
mo.md(r"""
135
+
With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:
mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""")
148
+
mo.md(r"""
149
+
This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.
150
+
""")
151
151
return
152
152
153
153
154
154
@app.cell(hide_code=True)
155
155
def_(mo):
156
-
mo.md(
157
-
r"""
156
+
mo.md(r"""
158
157
## 🌐 Scale Your Work: From Laptop to Cluster
159
158
160
159
Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
@@ -163,24 +162,21 @@ def _(mo):
163
162
* **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
164
163
165
164
This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
166
-
"""
167
-
)
165
+
""")
168
166
return
169
167
170
168
171
169
@app.cell(hide_code=True)
172
170
def_(mo):
173
-
mo.md(
174
-
r"""
171
+
mo.md(r"""
175
172
## 🖼️ Handling More Than Just Tables: Multimodal Data Support
176
173
177
174
Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
178
175
179
176
Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
180
177
181
178
As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
182
-
"""
183
-
)
179
+
""")
184
180
return
185
181
186
182
@@ -217,20 +213,23 @@ def _(daft):
217
213
218
214
@app.cell(hide_code=True)
219
215
def_(mo):
220
-
mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""")
216
+
mo.md(r"""
217
+
> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.
218
+
""")
221
219
return
222
220
223
221
224
222
@app.cell(hide_code=True)
225
223
def_(mo):
226
-
mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""")
224
+
mo.md(r"""
225
+
In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.
226
+
""")
227
227
return
228
228
229
229
230
230
@app.cell(hide_code=True)
231
231
def_(mo):
232
-
mo.md(
233
-
r"""
232
+
mo.md(r"""
234
233
## 🧑💻 Designed for Developers: Python and SQL Interfaces
235
234
236
235
Daft aims to be developer-friendly by offering flexible ways to interact with your data:
@@ -239,8 +238,7 @@ def _(mo):
239
238
* **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
240
239
241
240
This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
242
-
"""
243
-
)
241
+
""")
244
242
return
245
243
246
244
@@ -285,8 +283,7 @@ def _(daft):
285
283
286
284
@app.cell(hide_code=True)
287
285
def_(mo):
288
-
mo.md(
289
-
r"""
286
+
mo.md(r"""
290
287
## 🟣 Daft's Value Proposition
291
288
292
289
So, what makes Daft special? It's the combination of these design choices:
@@ -299,16 +296,14 @@ def _(mo):
299
296
These elements combine to make Daft a versatile tool for tackling modern data challenges.
300
297
301
298
And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀.
0 commit comments