|
| 1 | +--- |
| 2 | +title: "Mastering DuckDB when you're used to pandas or Polars" |
| 3 | +published: February 3, 2025 |
| 4 | +authors: [marco-gorelli] |
| 5 | +description: "It's not as scary as you think" |
| 6 | +category: [PyData ecosystem] |
| 7 | +featuredImage: |
| 8 | + src: /posts/duckdb-when-used-to-frames/featured.jpg |
| 9 | + alt: 'Photo by <a href="https://unsplash.com/@rthiemann?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Robert Thiemann</a> on <a href="https://unsplash.com/photos/brown-and-green-mallard-duck-on-water--ZSnI9gSX1Y?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>' |
| 10 | +hero: |
| 11 | + imageSrc: /posts/duckdb-when-used-to-frames/hero.jpg |
| 12 | + imageAlt: 'Photo by <a href="https://unsplash.com/@rthiemann?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Robert Thiemann</a> on <a href="https://unsplash.com/photos/brown-and-green-mallard-duck-on-water--ZSnI9gSX1Y?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>' |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +# Mastering DuckDB when you're used to pandas or Polars |
| 17 | + |
| 18 | +You may have heard about DuckDB's impressive robustness and performance. Perhaps you want to try it out - BUT WAIT, you're a data scientist and are used to pandas and/or Polars, not SQL. You can use the `SELECT`, `JOIN` and `GROUP BY` commands, but not much more, and you may be wondering: is it even possible to use SQL to: |
| 19 | + |
| 20 | +- Center a variable (i.e. subtract its mean)? |
| 21 | +- Resample by time? |
| 22 | +- Compute rolling statistics? |
| 23 | + |
| 24 | +Not only are these all possible, but they're also easy. Let's learn how to implement dataframe fundamentals in SQL! |
| 25 | + |
| 26 | +## But first - why? |
| 27 | + |
| 28 | +Why use DuckDB / SQL at all? Aren't dataframe APIs more readable and expressive anyway? Arguably, yes. Nonetheless, I think there are some very good reasons to implement a DuckDB SQL solution if you're able to: |
| 29 | + |
| 30 | +- Stability: dataframe APIs tend to go through deprecation cycles to make API improvements. If you write a dataframe solution today, it's unlikely that it will still work 5 years from now. A SQL one, on the other hand, probably will. |
| 31 | +- Portability: SQL standards exist, and although implementation differences exist, migrating between SQL dialects is probably less painful than migrating between dataframe APIs. |
| 32 | +- Widespreadness: analysts, engineers, and data scientists across industries are all likely familiar with SQL. They may not all rank it as their favourite language, but they can probably all read it, especially with the help of an LLM. |
| 33 | +- Robustness: extensive SQL testing frameworks, such as [sqllogictest](https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki), have already been developed, and so DuckDB can test against them to guard against buggy query results. |
| 34 | + |
| 35 | +Furthermore, although classic SQL tends to have some annoying rules (such as "no comma after the last expression in SELECT!"), DuckDB has innovated on the syntax side with their [Friendly SQL](https://duckdb.org/docs/sql/dialect/friendly_sql.html). |
| 36 | + |
| 37 | +Let's now look at translating common dataframe tasks into SQL. |
| 38 | + |
| 39 | +## Subtracting the mean |
| 40 | + |
| 41 | +Subtracting the mean, also known as "centering", is a common data science technique performed before fitting classical regression models. In pandas or Polars, it's trivial: |
| 42 | + |
| 43 | +```python |
| 44 | +data = {"a": [1, 3, -1, 8]} |
| 45 | + |
| 46 | +# pandas |
| 47 | +import pandas as pd |
| 48 | + |
| 49 | +df_pd = pd.DataFrame(data) |
| 50 | +df_pd["a_centered"] = df_pd["a"] - df_pd["a"].mean() |
| 51 | + |
| 52 | +# Polars |
| 53 | +import polars as pl |
| 54 | + |
| 55 | +df_pl = pl.DataFrame(data) |
| 56 | +df_pl.with_columns(a_centered=pl.col("a") - pl.col("a").mean()) |
| 57 | +``` |
| 58 | +``` |
| 59 | +shape: (4, 2) |
| 60 | +βββββββ¬βββββββββββββ |
| 61 | +β a β a_centered β |
| 62 | +β --- β --- β |
| 63 | +β i64 β f64 β |
| 64 | +βββββββͺβββββββββββββ‘ |
| 65 | +β 1 β -1.75 β |
| 66 | +β 3 β 0.25 β |
| 67 | +β -1 β -3.75 β |
| 68 | +β 8 β 5.25 β |
| 69 | +βββββββ΄βββββββββββββ |
| 70 | +``` |
| 71 | + |
| 72 | +If you naively try translating to SQL, however, you'll get an error: |
| 73 | +```python |
| 74 | +import duckdb |
| 75 | + |
| 76 | +duckdb.sql( |
| 77 | + """ |
| 78 | + SELECT |
| 79 | + *, |
| 80 | + a - MEAN(a) AS a_centered |
| 81 | + FROM df_pl |
| 82 | + """ |
| 83 | +) |
| 84 | +``` |
| 85 | +``` |
| 86 | +BinderException: Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function. |
| 87 | +Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important. |
| 88 | +``` |
| 89 | +SQL does not let us compare columns with aggregates. To do so, we need to use a [window function](https://en.wikipedia.org/wiki/Window_function_(SQL)), which is a kind of function that produces a value for each row. We're taking the mean of column `'a'` over the entire column, so we write: |
| 90 | + |
| 91 | +```python |
| 92 | +duckdb.sql( |
| 93 | + """ |
| 94 | + SELECT |
| 95 | + *, |
| 96 | + a - MEAN(a) OVER () AS a_centered |
| 97 | + FROM df_pl |
| 98 | + """ |
| 99 | +) |
| 100 | +``` |
| 101 | + |
| 102 | +``` |
| 103 | +βββββββββ¬βββββββββββββ |
| 104 | +β a β a_centered β |
| 105 | +β int64 β double β |
| 106 | +βββββββββΌβββββββββββββ€ |
| 107 | +β 1 β -1.75 β |
| 108 | +β 3 β 0.25 β |
| 109 | +β -1 β -3.75 β |
| 110 | +β 8 β 5.25 β |
| 111 | +βββββββββ΄βββββββββββββ |
| 112 | +``` |
| 113 | + |
| 114 | +## Resampling: weekly average |
| 115 | + |
| 116 | +Say we have unevenly spaced temporal data, such as: |
| 117 | + |
| 118 | +```python |
| 119 | +from datetime import datetime |
| 120 | + |
| 121 | +dates = [ |
| 122 | + datetime(2025, 1, 1), # Wednesday |
| 123 | + datetime(2025, 1, 7), # Tuesday |
| 124 | + datetime(2025, 1, 8), # Wednesday |
| 125 | + datetime(2025, 1, 9), # Thursday |
| 126 | + datetime(2025, 1, 16), # Thursday |
| 127 | + datetime(2025, 1, 17), # Friday |
| 128 | +] |
| 129 | +sales = [1, 5, 0, 4, 3, 6] |
| 130 | +data = {"date": dates, "sales": sales} |
| 131 | +``` |
| 132 | + |
| 133 | +We need to find the average weekly sales, where a week is defined as Wednesday to Tuesday. In pandas we'd use `resample`, in Polars `group_by_dynamic`: |
| 134 | + |
| 135 | +```python |
| 136 | +# pandas |
| 137 | +import pandas as pd |
| 138 | + |
| 139 | +df_pd = pd.DataFrame(data) |
| 140 | +df_pd.resample("1W-Wed", on="date", closed="left", label="left")["sales"].mean() |
| 141 | + |
| 142 | +# Polars |
| 143 | +import polars as pl |
| 144 | + |
| 145 | +df_pl = pl.DataFrame(data) |
| 146 | +( |
| 147 | + df_pl.group_by_dynamic( |
| 148 | + pl.col("date").alias("week_start"), every="1w", start_by="wednesday" |
| 149 | + ).agg(pl.col("sales").mean()) |
| 150 | +) |
| 151 | +``` |
| 152 | +``` |
| 153 | +shape: (3, 2) |
| 154 | +βββββββββββββββββββββββ¬ββββββββ |
| 155 | +β date β sales β |
| 156 | +β --- β --- β |
| 157 | +β datetime[ΞΌs] β f64 β |
| 158 | +βββββββββββββββββββββββͺββββββββ‘ |
| 159 | +β 2025-01-01 00:00:00 β 3.0 β |
| 160 | +β 2025-01-08 00:00:00 β 2.0 β |
| 161 | +β 2025-01-15 00:00:00 β 4.5 β |
| 162 | +βββββββββββββββββββββββ΄ββββββββ |
| 163 | +``` |
| 164 | + |
| 165 | +Replicating this in DuckDB is not rocket science, but it does involve a preprocessing step: |
| 166 | + |
| 167 | +- We use `DATE_TRUNC('week', date)` to truncate each date to the Monday at the start of the Monday-Sunday week. |
| 168 | +- To get our week to start on Wednesday, we need to first subtract 2 days, then truncate, and then add 2 days back: `DATE_TRUNC('week', date - INTERVAL 2 DAYS) + INTERVAL 2 DAYS AS week_start` |
| 169 | + |
| 170 | +In code: |
| 171 | + |
| 172 | +```python |
| 173 | +import duckdb |
| 174 | + |
| 175 | +duckdb.sql( |
| 176 | + """ |
| 177 | + SELECT |
| 178 | + DATE_TRUNC('week', date - INTERVAL 2 DAYS) + INTERVAL 2 DAYS AS week_start, |
| 179 | + AVG(sales) AS sales |
| 180 | + FROM df_pl |
| 181 | + GROUP BY week_start |
| 182 | + ORDER BY week_start |
| 183 | + """ |
| 184 | +) |
| 185 | +``` |
| 186 | +``` |
| 187 | +βββββββββββββββββββββββ¬βββββββββ |
| 188 | +β week_start β sales β |
| 189 | +β timestamp β double β |
| 190 | +βββββββββββββββββββββββΌβββββββββ€ |
| 191 | +β 2025-01-01 00:00:00 β 3.0 β |
| 192 | +β 2025-01-08 00:00:00 β 2.0 β |
| 193 | +β 2025-01-15 00:00:00 β 4.5 β |
| 194 | +βββββββββββββββββββββββ΄βββββββββ |
| 195 | +``` |
| 196 | + |
| 197 | +> **_NOTE:_** In general, we recommend only using `ORDER BY` as late as possible in your queries, and until that point, not making any assumptions about the physical ordering of your data. You'll see in the next section how to get around physical ordering assumptions when performing order-dependent operations. |
| 198 | +
|
| 199 | +## Rolling ~~and tumbling~~ statistics |
| 200 | + |
| 201 | +If you work in finance, then rolling means are probably your bread and butter. For example, with data: |
| 202 | + |
| 203 | +```python |
| 204 | +from datetime import datetime |
| 205 | + |
| 206 | +dates = [ |
| 207 | + datetime(2025, 1, 1), |
| 208 | + datetime(2025, 1, 2), |
| 209 | + datetime(2025, 1, 3), |
| 210 | + datetime(2025, 1, 4), |
| 211 | + datetime(2025, 1, 5), |
| 212 | + datetime(2025, 1, 7), |
| 213 | +] |
| 214 | +sales = [2.0, 4.6, 1.32, 1.11, 9, 8] |
| 215 | +data = {"date": dates, "sales": sales} |
| 216 | +``` |
| 217 | + |
| 218 | +you may want to smooth out `'sales'` by taking a rolling average over the last three data points. With dataframes, it's easy: |
| 219 | + |
| 220 | +```python |
| 221 | +# pandas |
| 222 | +import pandas as pd |
| 223 | + |
| 224 | +df_pd = pd.DataFrame(data) |
| 225 | +df_pd["sales_smoothed"] = df_pd["sales"].rolling(3).mean() |
| 226 | + |
| 227 | +# Polars |
| 228 | +import polars as pl |
| 229 | + |
| 230 | +df_pl = pl.DataFrame(data) |
| 231 | +df_pl.with_columns(sales_smoothed=pl.col("sales").rolling_mean(3)) |
| 232 | +``` |
| 233 | +``` |
| 234 | +shape: (6, 3) |
| 235 | +βββββββββββββββββββββββ¬ββββββββ¬βββββββββββββββββ |
| 236 | +β date β sales β sales_smoothed β |
| 237 | +β --- β --- β --- β |
| 238 | +β datetime[ΞΌs] β f64 β f64 β |
| 239 | +βββββββββββββββββββββββͺββββββββͺβββββββββββββββββ‘ |
| 240 | +β 2025-01-01 00:00:00 β 2.0 β null β |
| 241 | +β 2025-01-02 00:00:00 β 4.6 β null β |
| 242 | +β 2025-01-03 00:00:00 β 1.32 β 2.64 β |
| 243 | +β 2025-01-04 00:00:00 β 1.11 β 2.343333 β |
| 244 | +β 2025-01-05 00:00:00 β 9.0 β 3.81 β |
| 245 | +β 2025-01-07 00:00:00 β 8.0 β 6.036667 β |
| 246 | +βββββββββββββββββββββββ΄ββββββββ΄βββββββββββββββββ |
| 247 | +``` |
| 248 | + |
| 249 | +We're relying on our data being sorted by `'date'`. In pandas / Polars, we often know that our data is ordered in a particular way, and that order is often preserved across operations, so calculating a rolling mean with ordering assumptions is fine. For SQL engines however, row order is typically undefined, although there are some limited cases where DuckDB promises to maintain order. The solution is to specify `'ORDER BY'` inside your window function - this tells the engine which column(s) to use to determine the order in which to compute the rolling mean: |
| 250 | + |
| 251 | +```python |
| 252 | +import duckdb |
| 253 | + |
| 254 | +duckdb.sql( |
| 255 | + """ |
| 256 | + SELECT |
| 257 | + *, |
| 258 | + MEAN(sales) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sales_smoothed |
| 259 | + FROM df_pl |
| 260 | + """ |
| 261 | +) |
| 262 | +``` |
| 263 | +``` |
| 264 | +βββββββββββββββββββββββ¬βββββββββ¬βββββββββββββββββββββ |
| 265 | +β date β sales β sales_smoothed β |
| 266 | +β timestamp β double β double β |
| 267 | +βββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββββ€ |
| 268 | +β 2025-01-01 00:00:00 β 2.0 β 2.0 β |
| 269 | +β 2025-01-02 00:00:00 β 4.6 β 3.3 β |
| 270 | +β 2025-01-03 00:00:00 β 1.32 β 2.64 β |
| 271 | +β 2025-01-04 00:00:00 β 1.11 β 2.3433333333333333 β |
| 272 | +β 2025-01-05 00:00:00 β 9.0 β 3.81 β |
| 273 | +β 2025-01-07 00:00:00 β 8.0 β 6.036666666666666 β |
| 274 | +βββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββββ |
| 275 | +``` |
| 276 | + |
| 277 | +This gets us close to the pandas/Polars output, but it's not identical - notice how the first two rows are null in the dataframe case, but non-null in the SQL case! This is because the dataframe solution only computes the mean when there are at least `window_size` (in this case, 3) observations per window, whereas the DuckDB output computes the mean for every window. We can remedy this by using a case statement (and also a named window function for readability): |
| 278 | + |
| 279 | +```python |
| 280 | +import duckdb |
| 281 | + |
| 282 | +duckdb.sql( |
| 283 | + """ |
| 284 | + SELECT |
| 285 | + *, |
| 286 | + CASE WHEN (COUNT(sales) OVER w) >= 3 |
| 287 | + THEN MEAN(sales) OVER w |
| 288 | + ELSE NULL |
| 289 | + END AS sales_smoothed |
| 290 | + FROM df_pl |
| 291 | + WINDOW w AS (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) |
| 292 | + """ |
| 293 | +) |
| 294 | +``` |
| 295 | +``` |
| 296 | +βββββββββββββββββββββββ¬βββββββββ¬βββββββββββββββββββββ |
| 297 | +β date β sales β sales_smoothed β |
| 298 | +β timestamp β double β double β |
| 299 | +βββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββββ€ |
| 300 | +β 2025-01-01 00:00:00 β 2.0 β NULL β |
| 301 | +β 2025-01-02 00:00:00 β 4.6 β NULL β |
| 302 | +β 2025-01-03 00:00:00 β 1.32 β 2.64 β |
| 303 | +β 2025-01-04 00:00:00 β 1.11 β 2.3433333333333333 β |
| 304 | +β 2025-01-05 00:00:00 β 9.0 β 3.81 β |
| 305 | +β 2025-01-07 00:00:00 β 8.0 β 6.036666666666666 β |
| 306 | +βββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββββ |
| 307 | +``` |
| 308 | + |
| 309 | +Now it perfectly matches the pandas / Polars output exactly π! |
| 310 | + |
| 311 | +## What if you don't like SQL? |
| 312 | + |
| 313 | +If you want to use DuckDB as an engine but prefer Python APIs, some available options are: |
| 314 | + |
| 315 | +- [SQLFrame](https://github.com/eakmanrq/sqlframe): transpiles the PySpark API to different backends, including DuckDB. |
| 316 | +- [DuckDB's Python Relational API](https://duckdb.org/docs/api/python/relational_api.html): very strict and robust, though documentation is quite scant. In particular, window expressions are not yet supported (but they are on the roadmap!). |
| 317 | +- [Narwhals](https://github.com/narwhals-dev/narwhals): transpiles the Polars API to different backends. For DuckDB it uses DuckDB's Python Relational API, and so it also does not yet support window expressions. |
| 318 | +- [Ibis](https://ibis-project.org/): transpiles its own API to different backends. |
| 319 | + |
| 320 | +What's more, DuckDB allows you to write queries against in-memory pandas and Polars dataframes. There's nothing wrong with mixing and matching tools - in fact, that'll probably take you further than swearing by a single tool and trying to do everything using just that. |
| 321 | + |
| 322 | +## Conclusion |
| 323 | + |
| 324 | +We've learned how to translate some common dataframe operations to SQL so that we can port them over to DuckDB. We looked at centering, resampling, and rolling statistics. Porting to SQL / DuckDB may be desirable if you would like to use the DuckDB engine, if your client and/or team prefer SQL to dataframe APIs, or if you would like to have a robust and mostly standardised solution which is unlikely to break in the future. |
| 325 | + |
| 326 | +If you would like help implementing solutions with any of the tools covered in this post or would like to sponsor efforts toward dataframe API unification, [we can help](https://quansight.com/about-us/#bookacallform)! |
0 commit comments