|
1 | 1 | # DataFusion-DuckLake |
2 | 2 |
|
3 | | -**This is an early pre-release, that is very much so a work in progress.** |
| 3 | +**This is an early pre-release and very much a work in progress.** |
4 | 4 |
|
5 | 5 | A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage. |
6 | 6 |
|
| 7 | +The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion. |
| 8 | + |
| 9 | +--- |
| 10 | + |
7 | 11 | ## Currently Supported |
8 | 12 |
|
9 | 13 | - Read-only queries against DuckLake catalogs |
10 | 14 | - DuckDB catalog backend |
11 | 15 | - Local filesystem and S3-compatible object stores (MinIO, S3) |
12 | 16 | - Snapshot-based consistency |
13 | 17 | - Basic and decimal types |
14 | | -- Hierarchical path resolution (data_path, schema, table, file) |
15 | | -- Delete files for row-level deletion (MOR - Merge-On-Read) |
| 18 | +- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`) |
| 19 | +- Delete files for row-level deletion (MOR – Merge-On-Read) |
16 | 20 | - Parquet footer size hints for optimized I/O |
17 | 21 | - Filter pushdown to Parquet for row group pruning and page-level filtering |
18 | 22 | - Dynamic metadata lookup (no upfront catalog caching) |
19 | 23 | - SQL-queryable `information_schema` for catalog metadata (snapshots, schemas, tables, columns, files) |
20 | 24 | - DuckDB-style table functions: `ducklake_snapshots()`, `ducklake_table_info()`, `ducklake_list_files()` |
21 | 25 |
|
| 26 | +--- |
| 27 | + |
22 | 28 | ## Known Limitations |
23 | 29 |
|
24 | 30 | - Complex types (nested lists, structs, maps) have minimal support |
25 | 31 | - No write operations |
26 | | -- No filter-based file pruning (partition pruning not yet implemented) |
| 32 | +- No partition-based file pruning |
27 | 33 | - Single metadata provider implementation (DuckDB only) |
| 34 | +- No time travel support |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## Roadmap |
| 39 | + |
| 40 | +This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository. |
| 41 | + |
| 42 | +### Metadata & Catalog Improvements |
| 43 | + |
| 44 | +- Metadata caching to reduce repeated catalog lookups |
| 45 | +- Pluggable metadata providers beyond DuckDB: |
| 46 | + - PostgreSQL |
| 47 | + - SQLite |
| 48 | + - MySQL |
| 49 | +- Clear abstraction boundaries between catalog, metadata provider, and execution |
| 50 | + |
| 51 | +### Query Planning & Performance |
| 52 | + |
| 53 | +- Partition-aware file pruning |
| 54 | +- Improved predicate pushdown |
| 55 | +- Smarter Parquet I/O planning |
| 56 | +- Reduced metadata round-trips during planning |
| 57 | +- Better alignment with DataFusion optimizer rules |
| 58 | + |
| 59 | +### Write Support |
| 60 | + |
| 61 | +- Initial write support for DuckLake tables |
| 62 | + |
| 63 | +### Time Travel & Versioning |
28 | 64 |
|
29 | | -## TODO |
30 | | -- [ ] Support caching metadata |
31 | | -- [ ] Support alternative metadata databases |
32 | | - - [ ] postgres |
33 | | - - [ ] sqlite |
34 | | - - [ ] mysql |
35 | | -- [ ] Writes |
36 | | -- [ ] Timetravel |
| 65 | +- Querying historical snapshots |
| 66 | +- Explicit snapshot selection |
| 67 | + |
| 68 | +### Type System Expansion |
| 69 | + |
| 70 | +- Improved support for complex and nested types |
| 71 | +- Better alignment with DuckDB and DataFusion type semantics |
| 72 | + |
| 73 | +### Stability & Ergonomics |
| 74 | + |
| 75 | +- Expanded test coverage |
| 76 | +- Improved error messages and diagnostics |
| 77 | +- Cleaner APIs for embedding in other DataFusion-based systems |
| 78 | +- Additional documentation and examples |
| 79 | + |
| 80 | +--- |
37 | 81 |
|
38 | 82 | ## Usage |
| 83 | + |
39 | 84 | ### Example |
| 85 | + |
40 | 86 | ```bash |
41 | 87 | cargo run --example basic_query -- <catalog.db> <sql> |
| 88 | + |
42 | 89 | ``` |
43 | 90 |
|
44 | 91 | ### Integration |
45 | | - |
46 | 92 | ```rust |
47 | 93 | use datafusion::execution::runtime_env::RuntimeEnv; |
48 | 94 | use datafusion::prelude::*; |
@@ -81,4 +127,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog)); |
81 | 127 | // Query |
82 | 128 | let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?; |
83 | 129 | df.show().await?; |
| 130 | + |
| 131 | + |
84 | 132 | ``` |
| 133 | +### Project Status |
| 134 | + |
| 135 | +This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined. |
| 136 | + |
| 137 | +Feedback, issues, and contributions are welcome. |
0 commit comments