Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
strategy:
matrix:
os:
- ubuntu-latest
- ubuntu-latest-m
- macos-latest
rust: [stable]
steps:
Expand Down
12 changes: 7 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ The codebase follows a layered architecture with clear separation of concerns:

2. **DataFusion Integration Layer** (`src/catalog.rs`, `src/schema.rs`, `src/table.rs`)
- Bridges DuckLake concepts to DataFusion's catalog system
- `DuckLakeCatalog`: Implements `CatalogProvider`, uses dynamic metadata lookup (queries on every call to `schema()` and `schema_names()`)
- `DuckLakeCatalog`: Implements `CatalogProvider`, uses dynamic metadata lookup with configurable snapshot resolution
- `DuckLakeSchema`: Implements `SchemaProvider`, uses dynamic metadata lookup (queries on every call to `table()` and `table_names()`)
- `DuckLakeTable`: Implements `TableProvider`, caches table structure and file lists at creation time
- **No HashMaps**: Catalog and schema providers query metadata on-demand rather than caching
- **Snapshot Resolution**: Configurable TTL (time-to-live) for balancing freshness and performance

3. **Path Resolution** (`src/path_resolver.rs`)
- Centralized utilities for parsing object store URLs and resolving hierarchical paths
Expand Down Expand Up @@ -77,7 +78,8 @@ The catalog uses a **pure dynamic lookup** approach with no caching at the catal
- **DuckLakeCatalog** (`catalog.rs`):
- `schema_names()`: Queries `list_schemas()` on every call
- `schema()`: Queries `get_schema_by_name()` on every call
- `new()`: O(1) - only fetches snapshot ID and data_path
- `new()`: O(1) - only fetches data_path
- **Snapshot Resolution**: Configurable via `SnapshotConfig`

- **DuckLakeSchema** (`schema.rs`):
- `table_names()`: Queries `list_tables()` on every call
Expand All @@ -92,12 +94,12 @@ The catalog uses a **pure dynamic lookup** approach with no caching at the catal
**Benefits**:
- O(1) memory usage regardless of catalog size
- Fast catalog startup (no upfront schema/table listing)
- Always fresh metadata (no stale cache issues)
- Simple implementation (no cache invalidation logic)
- Configurable freshness vs performance trade-off
- Simple implementation (no complex cache invalidation logic)

**Trade-offs**:
- Small query overhead per metadata lookup (acceptable for read-only DuckDB connections)
- Future optimization: Add optional caching layer via wrapper implementation
- Snapshot resolution adds one SQL query per catalog operation (configurable via TTL)

### Data Flow

Expand Down
20 changes: 17 additions & 3 deletions examples/basic_query.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
//!
//! This example demonstrates how to:
//! 1. Create a DuckLake catalog from a DuckDB catalog file
//! 2. Register it with DataFusion
//! 3. Execute a simple SELECT query
//! 2. Configure snapshot resolution with TTL (time-to-live)
//! 3. Register it with DataFusion
//! 4. Execute a simple SELECT query
//!
//! To run this example, you need:
//! - A DuckDB database file with DuckLake tables
Expand All @@ -14,6 +15,8 @@
use datafusion::execution::runtime_env::RuntimeEnv;
use datafusion::prelude::*;
use datafusion_ducklake::{DuckLakeCatalog, DuckdbMetadataProvider};
// Uncomment when using custom snapshot config:
// use datafusion_ducklake::SnapshotConfig;
use object_store::ObjectStore;
use object_store::aws::AmazonS3Builder;
use std::env;
Expand Down Expand Up @@ -56,9 +59,20 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
);
runtime.register_object_store(&Url::parse("s3://ducklake-data/")?, s3);

// Create the DuckLake catalog
// Configure snapshot resolution behavior
//
// Option 1: Default configuration (TTL=0) - Always fresh, queries snapshot on every access
let ducklake_catalog = DuckLakeCatalog::new(provider)?;

// Option 2: Custom TTL - Balance freshness and performance
// Caches snapshot for 5 seconds, then refreshes
// let config = SnapshotConfig { ttl_seconds: Some(5) };
// let ducklake_catalog = DuckLakeCatalog::new_with_config(provider, config)?;

// Option 3: Cache forever - Maximum performance, snapshot frozen at catalog creation
// let config = SnapshotConfig { ttl_seconds: None };
// let ducklake_catalog = DuckLakeCatalog::new_with_config(provider, config)?;

println!("✓ Connected to DuckLake catalog");

let config = SessionConfig::new().with_default_catalog_and_schema("ducklake", "main");
Expand Down
Loading