Merge pull request #12 from hotdata-dev/eddietejeda-patch-1

eddietejeda · web-flow · commit 32c7e69febda · 2025-12-17T10:45:44.000-08:00
Refine README content and add roadmap section
diff --git a/README.md b/README.md
@@ -1,48 +1,94 @@
 # DataFusion-DuckLake
 
-**This is an early pre-release, that is very much so a work in progress.**
+**This is an early pre-release and very much a work in progress.**
 
 A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage.
 
+The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion.
+
+---
+
 ## Currently Supported
 
 - Read-only queries against DuckLake catalogs
 - DuckDB catalog backend
 - Local filesystem and S3-compatible object stores (MinIO, S3)
 - Snapshot-based consistency
 - Basic and decimal types
-- Hierarchical path resolution (data_path, schema, table, file)
-- Delete files for row-level deletion (MOR - Merge-On-Read)
+- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`)
+- Delete files for row-level deletion (MOR – Merge-On-Read)
 - Parquet footer size hints for optimized I/O
 - Filter pushdown to Parquet for row group pruning and page-level filtering
 - Dynamic metadata lookup (no upfront catalog caching)
 - SQL-queryable `information_schema` for catalog metadata (snapshots, schemas, tables, columns, files)
 - DuckDB-style table functions: `ducklake_snapshots()`, `ducklake_table_info()`, `ducklake_list_files()`
 
+---
+
 ## Known Limitations
 
 - Complex types (nested lists, structs, maps) have minimal support
 - No write operations
-- No filter-based file pruning (partition pruning not yet implemented)
+- No partition-based file pruning
 - Single metadata provider implementation (DuckDB only)
+- No time travel support
+
+---
+
+## Roadmap
+
+This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository.
+
+### Metadata & Catalog Improvements
+
+- Metadata caching to reduce repeated catalog lookups
+- Pluggable metadata providers beyond DuckDB:
+  - PostgreSQL
+  - SQLite
+  - MySQL
+- Clear abstraction boundaries between catalog, metadata provider, and execution
+
+### Query Planning & Performance
+
+- Partition-aware file pruning
+- Improved predicate pushdown
+- Smarter Parquet I/O planning
+- Reduced metadata round-trips during planning
+- Better alignment with DataFusion optimizer rules
+
+### Write Support
+
+- Initial write support for DuckLake tables
+
+### Time Travel & Versioning
 
-## TODO
-- [ ] Support caching metadata
-- [ ] Support alternative metadata databases
-  - [ ] postgres
-  - [ ] sqlite
-  - [ ] mysql
-- [ ] Writes
-- [ ] Timetravel
+- Querying historical snapshots
+- Explicit snapshot selection
+
+### Type System Expansion
+
+- Improved support for complex and nested types
+- Better alignment with DuckDB and DataFusion type semantics
+
+### Stability & Ergonomics
+
+- Expanded test coverage
+- Improved error messages and diagnostics
+- Cleaner APIs for embedding in other DataFusion-based systems
+- Additional documentation and examples
+
+---
 
 ## Usage
+
 ### Example
+
 ```bash
 cargo run --example basic_query -- <catalog.db> <sql>
+
 ```
 
 ### Integration
-
 ```rust
 use datafusion::execution::runtime_env::RuntimeEnv;
 use datafusion::prelude::*;
@@ -81,4 +127,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog));
 // Query
 let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?;
 df.show().await?;
+
+
 ```
+### Project Status
+
+This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined.
+
+Feedback, issues, and contributions are welcome.