Skip to content

Commit 32c7e69

Browse files
authored
Merge pull request #12 from hotdata-dev/eddietejeda-patch-1
Refine README content and add roadmap section
2 parents 0b06715 + 1033ac8 commit 32c7e69

File tree

1 file changed

+66
-13
lines changed

1 file changed

+66
-13
lines changed

README.md

Lines changed: 66 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,94 @@
11
# DataFusion-DuckLake
22

3-
**This is an early pre-release, that is very much so a work in progress.**
3+
**This is an early pre-release and very much a work in progress.**
44

55
A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage.
66

7+
The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion.
8+
9+
---
10+
711
## Currently Supported
812

913
- Read-only queries against DuckLake catalogs
1014
- DuckDB catalog backend
1115
- Local filesystem and S3-compatible object stores (MinIO, S3)
1216
- Snapshot-based consistency
1317
- Basic and decimal types
14-
- Hierarchical path resolution (data_path, schema, table, file)
15-
- Delete files for row-level deletion (MOR - Merge-On-Read)
18+
- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`)
19+
- Delete files for row-level deletion (MOR Merge-On-Read)
1620
- Parquet footer size hints for optimized I/O
1721
- Filter pushdown to Parquet for row group pruning and page-level filtering
1822
- Dynamic metadata lookup (no upfront catalog caching)
1923
- SQL-queryable `information_schema` for catalog metadata (snapshots, schemas, tables, columns, files)
2024
- DuckDB-style table functions: `ducklake_snapshots()`, `ducklake_table_info()`, `ducklake_list_files()`
2125

26+
---
27+
2228
## Known Limitations
2329

2430
- Complex types (nested lists, structs, maps) have minimal support
2531
- No write operations
26-
- No filter-based file pruning (partition pruning not yet implemented)
32+
- No partition-based file pruning
2733
- Single metadata provider implementation (DuckDB only)
34+
- No time travel support
35+
36+
---
37+
38+
## Roadmap
39+
40+
This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository.
41+
42+
### Metadata & Catalog Improvements
43+
44+
- Metadata caching to reduce repeated catalog lookups
45+
- Pluggable metadata providers beyond DuckDB:
46+
- PostgreSQL
47+
- SQLite
48+
- MySQL
49+
- Clear abstraction boundaries between catalog, metadata provider, and execution
50+
51+
### Query Planning & Performance
52+
53+
- Partition-aware file pruning
54+
- Improved predicate pushdown
55+
- Smarter Parquet I/O planning
56+
- Reduced metadata round-trips during planning
57+
- Better alignment with DataFusion optimizer rules
58+
59+
### Write Support
60+
61+
- Initial write support for DuckLake tables
62+
63+
### Time Travel & Versioning
2864

29-
## TODO
30-
- [ ] Support caching metadata
31-
- [ ] Support alternative metadata databases
32-
- [ ] postgres
33-
- [ ] sqlite
34-
- [ ] mysql
35-
- [ ] Writes
36-
- [ ] Timetravel
65+
- Querying historical snapshots
66+
- Explicit snapshot selection
67+
68+
### Type System Expansion
69+
70+
- Improved support for complex and nested types
71+
- Better alignment with DuckDB and DataFusion type semantics
72+
73+
### Stability & Ergonomics
74+
75+
- Expanded test coverage
76+
- Improved error messages and diagnostics
77+
- Cleaner APIs for embedding in other DataFusion-based systems
78+
- Additional documentation and examples
79+
80+
---
3781

3882
## Usage
83+
3984
### Example
85+
4086
```bash
4187
cargo run --example basic_query -- <catalog.db> <sql>
88+
4289
```
4390

4491
### Integration
45-
4692
```rust
4793
use datafusion::execution::runtime_env::RuntimeEnv;
4894
use datafusion::prelude::*;
@@ -81,4 +127,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog));
81127
// Query
82128
let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?;
83129
df.show().await?;
130+
131+
84132
```
133+
### Project Status
134+
135+
This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined.
136+
137+
Feedback, issues, and contributions are welcome.

0 commit comments

Comments
 (0)