Skip to content

Conversation

@sanujbasu
Copy link
Collaborator

@sanujbasu sanujbasu commented Nov 19, 2025

feat: Add CreateTable API with typestate pattern

What changes are proposed in this pull request?

This PR introduces a complete CreateTable API for Delta Kernel Rust, enabling programmatic creation of Delta tables with proper metadata tracking, validation, and compile-time safety guarantees.

Motivation

Currently, Delta Kernel Rust lacks a public API for creating new Delta tables. Users need to:

  • Manually construct Protocol and Metadata actions
  • Handle low-level serialization and commit logic
  • Manage table initialization without type safety
  • Use test utilities or external tools for table creation

This PR provides a high-level, type-safe API similar to the Java Kernel's CreateTableTransactionBuilder, making it easy to create Delta tables programmatically while ensuring protocol compliance.

Design: Typestate Pattern

The API uses a typestate pattern to enforce correct usage at compile time:

1. CreateTableBuilder (Configuration Stage)

pub struct CreateTableBuilder { /* ... */ }- Configure table properties, schema, partition columns

  • Builder methods: with_table_properties(), with_partition_columns()
  • Validates configuration before transitioning to transaction
  • Returns DeltaResult<CreateTableTransaction>

2. CreateTableTransaction (Commit Stage)

pub struct CreateTableTransaction { /* ... */ }- Commits Protocol, Metadata, and CommitInfo actions

  • Returns Transaction for future data operations
  • Enforces that metadata is committed before data operations

This pattern is extensible for future table features:

  • Add builder methods for clustering, constraints, invariants
  • Add table feature flags to Protocol action
  • Extend validation in build() method
  • Maintain type safety throughout

Changes

New Files

  • kernel/src/table_manager.rs: Entry point with static factory method TableManager::create_table()
  • kernel/src/transaction/create_table.rs : Core implementation of CreateTableBuilder and CreateTableTransaction
  • kernel/tests/create_table.rs: Comprehensive test suite.

Modified Files

  • kernel/src/actions/mod.rs: Added protocol version constants
    • TABLE_FEATURES_MIN_READER_VERSION = 3
    • TABLE_FEATURES_MIN_WRITER_VERSION = 7
  • kernel/src/lib.rs: Exported new API (TableManager, CreateTableBuilder, CreateTableTransaction)
  • kernel/src/transaction/mod.rs: Added create_table module export
  • kernel/examples/write-table/src/main.rs: Updated to use new CreateTable API

Key Implementation Details

  1. Protocol Version Constants: Uses table features protocol (3/7) for forward compatibility
  2. CommitInfo Generation: Properly records operation metadata (operation: "CREATE TABLE", engineInfo, timestamp, txnId, kernelVersion)
  3. Table Existence Check: Validates table doesn't already exist before creation
  4. In-Commit Timestamp Support: Compatible with ICT-enabled workflows

How was this change tested?

Comprehensive three-layer test coverage:

1. High-Level Integration Tests (Snapshot Validation)

  • test_create_simple_table: Basic table creation with events schema
  • test_create_table_with_properties: Financial table with Delta properties
  • test_create_table_already_exists: Duplicate creation prevention
  • test_create_table_multiple_properties: Builder pattern chaining

Coverage: Verifies tables load correctly and schemas are preserved

2. Mid-Level Specification Tests (Log File Structure)

  • test_commit_info_is_written_to_log: Validates all 3 actions present
    • Parses actual JSON log file (00000000000000000000.json)
    • Verifies Protocol (minReaderVersion=3, minWriterVersion=7)
    • Verifies Metadata (id, schemaString, createdTime)
    • Verifies CommitInfo (timestamp, operation, engineInfo, txnId, kernelVersion)
  • test_log_action_order: Ensures correct action ordering (Protocol → Metadata → CommitInfo)

Coverage: Guarantees Delta protocol compliance and complete audit trail

3. Low-Level Validation Tests (Constraint Enforcement)

  • test_create_table_empty_schema: Schema validation
  • Realistic schemas: DECIMAL types, TIMESTAMP, DATE, BOOLEAN

Coverage: Prevents invalid table configurations

Example Usage

use delta_kernel::table_manager::TableManager;
use delta_kernel::schema::{StructType, StructField, DataType};
use delta_kernel::committer::FileSystemCommitter;
use std::sync::Arc;
use std::collections::HashMap;

// Define schema
let schema = Arc::new(StructType::try_new(vec![
    StructField::new("user_id", DataType::LONG, false),
    StructField::new("event_type", DataType::STRING, false),
    StructField::new("timestamp", DataType::TIMESTAMP, false),
])?);

// Configure table properties
let mut properties = HashMap::new();
properties.insert("delta.appendOnly".to_string(), "true".to_string());
properties.insert("delta.checkpointInterval".to_string(), "10".to_string());

// Create table with typestate pattern
let create_txn = TableManager::create_table(
    "/path/to/table",
    schema,
    "MyApplication/1.0"
)
.with_table_properties(properties)
.build(&engine)?;

// Commit metadata (creates _delta_log/00000000000000000000.json)
let txn = create_txn.commit_metadata(&engine, Box::new(FileSystemCommitter::new()))?;

// Optional: Use returned transaction for immediate data operations
// txn.add_files(data);
// txn.commit(&engine)?;### Breaking Changes

This PR affects the following public APIs

Introduces new CreateTableTransaction and CreateTableBuilder APIs.

Future Work

  • Add support for partition columns in builder
  • Add clustering configuration
  • Add check constraints and invariants
  • Support column mapping modes
  • Enable schema evolution from CreateTableTransaction
  • Support custom Protocol/Metadata fields

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Nov 19, 2025
This commit introduces a complete CreateTable API for Delta Kernel Rust,
enabling programmatic creation of Delta tables with proper metadata tracking
and validation.

The API uses a typestate pattern to enforce correct usage at compile time:

1. **CreateTableBuilder** (configuration stage)
   - Configure table properties, schema, partition columns
   - Builder methods: with_table_properties(), with_partition_columns()
   - Validates configuration before transitioning

2. **CreateTableTransaction** (commit stage)
   - Commits Protocol, Metadata, and CommitInfo actions
   - Returns Transaction for future data operations
   - Enforces that metadata is committed before data operations

This pattern is extensible for future table features:
- Add builder methods for clustering, constraints, invariants
- Add table feature flags to Protocol action
- Extend validation in build() method
- Maintain type safety throughout

Example extensibility:
```rust
CreateTableBuilder::new(path, schema, engine_info)
    .with_table_properties(props)
    .with_partition_columns(vec!["date", "region"])  // ✓ Supported
    .with_clustering(vec!["user_id"])                // Future
    .with_check_constraints(constraints)             // Future
    .build(engine)?
```

- test_create_simple_table: Basic table creation with events schema
- test_create_table_with_properties: Financial table with Delta properties
- test_create_table_already_exists: Duplicate creation prevention
- test_create_table_multiple_properties: Builder pattern chaining

**Coverage**: Verifies tables load correctly and schemas are preserved

- test_commit_info_is_written_to_log: Validates all 3 actions present
  * Parses actual JSON log file
  * Verifies Protocol (minReaderVersion=3, minWriterVersion=7)
  * Verifies Metadata (id, schemaString, createdTime)
  * Verifies CommitInfo (timestamp, operation, engineInfo, txnId, kernelVersion)
- test_log_action_order: Ensures correct action ordering

**Coverage**: Guarantees Delta protocol compliance and complete audit trail

- test_create_table_empty_schema: Schema validation
- Realistic schemas: DECIMAL types, TIMESTAMP, DATE, BOOLEAN

**Coverage**: Prevents invalid table configurations

```rust
use delta_kernel::table_manager::TableManager;
use delta_kernel::schema::{StructType, StructField, DataType};
use delta_kernel::committer::FileSystemCommitter;
use std::sync::Arc;
use std::collections::HashMap;

// Define schema
let schema = Arc::new(StructType::try_new(vec![
    StructField::new("user_id", DataType::LONG, false),
    StructField::new("event_type", DataType::STRING, false),
    StructField::new("timestamp", DataType::TIMESTAMP, false),
])?);

// Configure table properties
let mut properties = HashMap::new();
properties.insert("delta.appendOnly".to_string(), "true".to_string());
properties.insert("delta.checkpointInterval".to_string(), "10".to_string());

// Create table with typestate pattern
let create_txn = TableManager::create_table(
    "/path/to/table",
    schema,
    "MyApplication/1.0"
)
.with_table_properties(properties)
.build(&engine)?;

// Commit metadata (creates _delta_log/00000000000000000000.json)
let txn = create_txn.commit_metadata(&engine, Box::new(FileSystemCommitter::new()))?;

// Future: Use returned transaction for data operations
// txn.add_files(data);
// txn.commit(&engine)?;
```

- kernel/src/actions/mod.rs: Protocol version constants
- kernel/src/transaction/create_table.rs: CommitInfo generation
- kernel/tests/create_table.rs: Comprehensive test suite (7 tests)

None - this is a new API addition.

- Add support for partition columns in builder
- Add clustering configuration
- Add check constraints and invariants
- Support column mapping modes
- Enable schema evolution from CreateTableTransaction
@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.66%. Comparing base (74e75d3) to head (7b507b8).

Files with missing lines Patch % Lines
kernel/src/transaction/create_table.rs 82.40% 9 Missing and 13 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1504      +/-   ##
==========================================
- Coverage   84.67%   84.66%   -0.01%     
==========================================
  Files         122      124       +2     
  Lines       32741    32873     +132     
  Branches    32741    32873     +132     
==========================================
+ Hits        27722    27832     +110     
- Misses       3674     3683       +9     
- Partials     1345     1358      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sanujbasu sanujbasu marked this pull request as ready for review November 20, 2025 01:44
@sanujbasu sanujbasu changed the title [WIP] Createtable feat: Add CreateTable API with typestate pattern Nov 20, 2025
/// # Ok(())
/// # }
/// ```
pub fn create_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fly by comment:

  • Kernel Java: We have TableManager as our one entry point into Kernel. That is: all Snapshot loading (which is used for reads and writes), all table creation, and all streaming goes through this API (it exposes APIs to other classes, like SnapshotBuiler, CreateTableTransactionBuilder, etc.)
  • Kernel Rust: No TableManager, instead has APIs on underlying classes/traits/impls e.g. SnapshotBuilder

So I encourage: either fully adopting TableManager, or not adopting TableManager, and deleting this TableManager.create_table API

path: impl AsRef<str>,
schema: SchemaRef,
engine_info: impl Into<String>,
) -> CreateTableBuilder {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Kernel Rust have TransctionBuilders yet?

Though verbose, Kernel Java's CreateTableTransactionBuilder naming convention makes it clear that: this is a builder for a transaction (yields a Transaction) that should be used to create a table.

The callout being: my_create_table_instance_builder.build() might make a caller think that that has "built" the table -- wheras instead all it has done is "built" the Transaction instance

/// # Ok(())
/// # }
/// ```
pub fn commit_metadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the API flow we want.

For example: In general, I believe the two kernels have aligned on: I have a Transaction, I commit, and then I can get a PostCommitSnapshot (just a Snapshot). From there I can do whatever want! Perhaps use that postCommitSnapshot to do some maintence operations (publish, write CRC, write checkpoint). Perhaps use it to start a new transaction.

Yet this API hardcodes: We write 0000.json, and then load a Snapshot, and then return a Transaction.

This means I can't write 000.crc ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the PR description:

Returns Transaction for future data operations
Enforces that metadata is committed before data operations

I don't think we want this?

CREATE TABLE AS SELECT -- we want to either compelte or fail -- if we have created the table ... but didn't insert the data actions ... and then the data actions txn fails ... then that SQL statement was not atomic? it is half-complete?

Comment on lines +247 to +265
let protocol = Protocol::try_new(
TABLE_FEATURES_MIN_READER_VERSION,
TABLE_FEATURES_MIN_WRITER_VERSION,
Some(Vec::<String>::new()), // readerFeatures (empty for now)
Some(Vec::<String>::new()), // writerFeatures (empty for now)
)?;

// Get current timestamp
let created_time = current_time_ms()?;

// Create Metadata action
let metadata = Metadata::try_new(
None, // name
None, // description
(*self.schema).clone(),
self.partition_columns,
created_time,
self.table_properties,
)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing the tableProperties -> protocol parsing.

See, for example, Kernel Java :: https://github.com/delta-io/delta/blob/765a91688893b07c1e38835ee427b1eb9aa6357b/kernel/kernel-api/src/main/java/io/delta/kernel/internal/TransactionMetadataFactory.java#L93

For example, what if my table properties has:

  • delta.feature.XXX = supported --> this should turn on support for table feature XXX -- right? // @zachschuermann and @OussamaSaoudi -- have you all aligned on this?
  • delta.appendOnly = true --> this should add appendOnly to the protocol
  • delta.enableInCommitTimestamps = true --> this should add ICT to the protocol, and also upgrade the writer version to 7

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the PR description:

Add table feature flags to Protocol action

Can we add checks to at least throw? So we explicitly only support a very limited initial use case, and we can safely add more support later?

) -> DeltaResult<Transaction> {
// Create Protocol action with table features support
let protocol = Protocol::try_new(
TABLE_FEATURES_MIN_READER_VERSION,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this to be the default. Typically, if I just do a simpel CREATE, it creates the table with delta version (1, 2). This gives it the maximum possible allowed reader and writer set (of connectors).

Now, if some catalog has an opinion, perhaps it can inject some defaults. Kernel Java's Committer supports that. We can chat offline.

partition_columns: Vec<String>,
}

impl CreateTableTransaction {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to think about: How can CREATE and DDL (ALTER TABLE SET TBLPROPERTIES) share, or even DML (a MERGE that writes a new wider type and so we auto-enable type widening) --- how do all of these code paths share common schema, protocol, metadata, table property validation?

/// // Future: txn.add_files(...); txn.commit(engine)?;
/// # Ok(())
/// # }
/// ```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious (as we chatted offline): How will this eventually extend to support CTAS?

  • Kernel Rust has its own transaction logic/impl/class today for data writes
  • This is now doing create here (separately from data write logic)

Where will the logic go that does both? (I ask with a worry about duplication / consolidation of logic) 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants