Skip to content

Conversation

thevar1able
Copy link
Member

No description provided.

scovich and others added 30 commits February 14, 2025 07:37
## What changes are proposed in this pull request?

Starting in arrow-53.3, the parquet reader no longer computes NULL masks
for non-nullable leaf columns -- even if they have nullable ancestors.
This breaks row visitors, who rely on each leaf column to have a fully
accurate NULL mask.

The quick-fix solution is to manually fixup the null masks of every
`RecordBatch` that comes from the parquet reader.

Fixes delta-io#691

## How was this change tested?

New unit test that checks whether parquet reads produce properly nested
NULL masks. The test also leverages (and verifies) the JSON parser, so
we can reliably detect any unwelcome behavior changes to JSON parsing
that might land in the future.
…o#700)

Fixes delta-io#698

## What changes are proposed in this pull request?

Updates the `DataSkippingFilter` to treat all columns as nullable for
the purpose of parsing stats, as suggested in
delta-io#698 (comment).

This is particularly important for partition columns, which won't have
values present in stats. But stats are also only usually stored for the
first 32 columns, so we shouldn't rely on stats being present for
non-partition fields either.

## How was this change tested?

I've added a new unit test.

I've also tested building duckdb-delta with this change (cherry-picked
onto 0.6.1) and verified that the code in delta-io#698 now works.
…_snapshot_schema (delta-io#683)

## What changes are proposed in this pull request?
When given a schema (e.g. in `global_scan_state`) the engine needs a way
to visit this schema. This introduces a new API `visit_schema` to allow
engines to visit any schema over FFI. An API called `visit_schema`
previously existed but visited the schema of a given _snapshot_; this
has now been renamed to `visit_snapshot_schema`.


### This PR affects the following public APIs
Renamed `visit_schema` to `visit_snapshot_schema` and now `visit_schema`
takes `SharedSchema` as an argument instead of a snapshot.


## How was this change tested?
updated read_table test
…o#654)

This change introduces arrow_53 and arrow_54 feature flags on kernel
which are _required_ when using default-engine or sync-engine.
Fundamentally we must push users of the crate to select their arrow
major version through flags since Cargo _will_ include multiple major
versions in the dependency tree which can cause ABI breakages when
passing around symbols such as `RecordBatch`

See delta-io#640

---------

Signed-off-by: R. Tyler Croy <[email protected]>
Our previous write protocol check was too strict. Now we just ensure that the protocol makes sense given what features are present/specified.

Made all existing `write.rs` tests also write to a protocol 1/1 table, and they all work.
Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :)

In particular: 
- remove `add_partition_columns` in `arrow.c`, we don't need it anymore
- expose ffi methods to get an expression evaluator and evaluate an
expression from `c`
- use the above to add an `apply_transform` function in `arrow.c`

## How was this change tested?

- existing tests
…ta-io#709)

## What changes are proposed in this pull request?
This PR removes the old `visit_snapshot_schema` introduced in delta-io#683 - we
should just go ahead and do the 'right thing' with having a
`visit_schema` (introduced in the other PR) and a `logical_schema()`
function (added here) in order to facilitate visiting the snapshot
schema. Additionally I've moved the schema-related things up from `scan`
module to top-level in ffi crate. Exact changes listed below; this PR
updates tests/examples to leverage the new changes.

### This PR affects the following public APIs

1. Remove `visit_snapshot_schema()` API
2. Add a new `logical_schema(snapshot)` API so you can get the schema of
a snapshot and use the `visit_schema` directly
3. Renames `free_global_read_schema` to just `free_schema`
4. Moves `SharedSchema` and `free_schema` up from `mod scan` into
top-level `ffi` crate.


## How was this change tested?
existing UT
If we try and have a `need_arrow` flag we can make that include the code
line: `pub use arrow_53::*` but we _cannot_ have it actually pull in the
dependency. Pulling in the dependency is purely expressed in
`Cargo.toml`, so the `use` just fails because we don't _have_ an
arrow_53 dependency in that case.

We can do some gross stuff in `build.rs` to inject the dependency, but
even that doesn't apply to crates that depend on us so it only works if
just compiling `delta-kernel` but isn't actually helpful for the use
case we want.

So I kept the `need-arrow` dep as a way for us to express that something
needs arrow enabled, but rather than trying to do the import, it just
issues a `compile_error` asking you to pick an arrow version.

Perhaps we can get something more clever in the future, but for now
let's unblock things.

Also have test-utils depend on the arrow feature
## What changes are proposed in this pull request?
Current chrono 0.4.40 breaks building arrow, pin chrono to a prior
version that does not break arrow.

## How was this change tested?
CI/CD

---------

Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request?

The FFI expression visitor code incorrectly passes a `(u64, u64)` pair
to `visit_literal_decimal` callback, representing the upper and lower
half of an `i128` decimal value. It should actually be `(i64, u64)` to
preserve signedness.

### This PR affects the following public APIs

The expression visitor callback `visit_literal_decimal` now takes `i64`
for the upper half of a 128-bit int value.


## How was this change tested?

Updated the example code.
)

## What changes are proposed in this pull request?

Make `get_partition_column_count` and `get_partition_columns` take a
snapshot so engines can work this out at planning time without creating
a scan.

The previous methods to get this info out of a scan have been removed.

### This PR affects the following public APIs

The old functions that took snapshots have been removed

## How was this change tested?

New unit test
## What changes are proposed in this pull request?
`Url` crate now has MSRV with default unicode backend of rustc `1.81`.
instead of fighting this, we will just bump up our MSRV from `1.80` to
`1.81` seeing as (1) a large part of the ecosystem (datafusion, polars
(though still states `1.80` in README), delta-rs, etc.) already is on
`1.81` and (2) `Url` is a rather foundational crate so if they bump it
seems reasonable to assume that many consumers will too

### This PR affects the following public APIs
bumping MSRV from `1.80` to `1.81`

## How was this change tested?
MSRV test
…ent futures (delta-io#711)

## What changes are proposed in this pull request?
The original `FileStream` API, though intended to concurrently make GET
requests to the object store, actually made serial requests and relied
on a hand-written poll function in order to implement `Stream`. This PR
aims to make a minimal change in order to (1) increase performance for
the JSON reader by issuing concurrent GET requests and (2) simplify the
code by removing the need for a custom `Stream` and instead leverage
existing functions/adapters to convert the files to read into a `Stream`
and issue concurrent requests through the
[`futures::stream::buffered`](https://docs.rs/futures/latest/futures/stream/struct.Buffered.html)
adapter.

This is effectively a similar improvement as in delta-io#595 but for the JSON
reader.

### Specifically, the changes are:
1. replace the `FileStream::new_async_read_iterator()` call (the
manually-implemented `Stream`) with an inline implementation of
converting the files slice into a Stream (via `stream::iter`) and use
the
[`futures::stream::buffered`](https://docs.rs/futures/latest/futures/stream/struct.Buffered.html)
adapter to concurrently execute file opening futures. It then sends
results across an `mpsc` channel to bridge the async/sync gap.
2. JsonOpener no longer implements `FileOpener` (which requires a
synchronous `fn open()` and instead directly exposes an `async fn
open()` for easier/simpler use above. This removes all reliance on
`FileStream`/`FileOpener` in the JSON reader.
3. adds a custom `ObjectStore` implementation: `OrderedGetStore` to
deterministically control the ordering in which GET request futures are
resolved

### This PR affects the following public APIs
- `DefaultJsonHandler::with_readahead()` renamed to
`DefaultJsonHandler::with_buffer_size()`
- DefaultJsonHandler's default buffer size: 10 => 1000
- DefaultJsonHandler's default batch size: 1024 => 1000


## How was this change tested?
added test with a new `OrderedGetStore` which will resolve the GET
requests in a jumbled order but we expect the test to return the natural
order of requests. in a additionally, manually validated that we went
from serial JSON file reads to concurrent reads
…elta-io#679)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Summary

This PR introduces foundational changes required for V2 checkpoint read
support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment `replay` functionality to retrieve actions
from sidecar files if need be.

This PR specifically adds support for Item 2.

This PR **does not introduce full v2Checkpoints reader/writer support**
as we are missing support for Item 1, meaning log segments can never
have V2 checkpoint files in the first place. That functionality will be
completed in [PR
delta-io#685](delta-io#685) which is
stacked on top of this PR. However, the changes to log `replay` done
here are compatible with tables using V1 checkpoints, allowing us to
safely merge the changes here.
### Changes

For each batch of `EngineData` from a checkpoint file:
1. Use the new `SidecarVisitor` to scan each batch for sidecar file
paths embedded in sidecar actions.
3. If sidecar file paths exist:
    - Read the corresponding sidecar files.
- Generate an iterator over batches of actions within the sidecar files.
- Insert the sidecar batches that contain the add actions necessary to
reconstruct the table’s state into the top level iterator
**- Note: the original checkpoint batch is still included in the
iterator**
4. If no sidecar file paths exist, move to the next batch & leave the
original checkpoint batch in the iterator.
  
Notes:
- If the `checkpoint_read_schema` does not have file actions, we do not
need to scan the batch with the `SidecarVisitor` and can leave the batch
as-is in the top-level iterator.
- Multi-part checkpoints do not have sidecar actions, so we do not need
to scan the batch with the `SidecarVisitor` and can leave the batch
as-is in the top-level iterator.
- A batch may not include add actions, but other actions (like txn,
metadata, protocol). This is safe to leave in the iterator as the
non-file actions will be ignored.

resolves delta-io#670

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
Although log segments can not yet have V2 checkpoints, we can easily
mock batches that include sidecar actions that we can encounter in V2
checkpoints.

- `test_sidecar_to_filemeta_valid_paths`
  - Tests handling of sidecar paths that can either be:
- A relative path within the _delta_log/_sidecars directory, but it is
just file-name
  - paths that are relative and have a parent (i.e. directory component)
  - An absolute path.

**Unit tests for process_single_checkpoint_batch:**

- `test_checkpoint_batch_with_no_sidecars_returns_none`
- Verifies that if no sidecar actions are present, the checkpoint batch
is returned unchanged.
- `test_checkpoint_batch_with_sidecars_returns_sidecar_batches`
- Ensures that when sidecars are present, the corresponding sidecar
files are read, and their batches are returned.
- `test_checkpoint_batch_with_sidecar_files_that_do_not_exist`
- Tests behavior when sidecar files referenced in the checkpoint batch
do not exist, ensuring an error is returned.
- `test_reading_sidecar_files_with_predicate`
- Tests that sidecar files that do not match the passed predicate are
skipped correctly
  
**Unit tests for create_checkpoint_stream:**

-
`test_create_checkpoint_stream_errors_when_schema_has_remove_but_no_sidecar_action`
- Validates that if the schema includes the remove action, it must also
contain the sidecar column.
-
`test_create_checkpoint_stream_errors_when_schema_has_add_but_no_sidecar_action`
- Validates that if the schema includes the add action, it must also
contain the sidecar column.
-
`test_create_checkpoint_stream_returns_checkpoint_batches_as_is_if_schema_has_no_file_actions`
- Checks that if the schema has no file actions, the checkpoint batches
are returned unchanged
-
`test_create_checkpoint_stream_returns_checkpoint_batches_if_checkpoint_is_multi_part`
- Ensures that for multi-part checkpoints, the batch is not visited, and
checkpoint batches are returned as-is.
-
`test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars`
- Tests reading a Parquet checkpoint batch and verifying it matches the
expected result.
-
`test_create_checkpoint_stream_reads_json_checkpoint_batch_without_sidecars`
  - Verifies that JSON checkpoint batches are read correctly
- `test_create_checkpoint_stream_reads_checkpoint_batch_with_sidecar`
- Test ensuring that checkpoint files containing sidecar references
return the additional corresponding sidecar batches correctly
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->



### Summary

This PR introduces foundational changes required for V2 checkpoint read
support. The high-level changes required for v2 checkpoint support are:
Item 1. Allow log segments to be built with V2 checkpoint files
Item 2. Allow log segment replay functionality to retrieve actions from
sidecar files if need be.

This PR specifically adds support for Item 1.

This PR enables support for the `v2Checkpoints` reader/writer table
feature for delta kernel rust by
1. Allowing snapshots to now leverage UUID-named checkpoints as part of
their log segment.
2. Adding the `v2Checkpoints` feature to the list of supported reader
features.

- This PR is stacked on Item 2
[here](delta-io#679). Golden
table tests are included in this PR.

- More integration tests will be introduced in a follow-up PR tracked
here: delta-io#671

- This PR stacks changes on top of
delta-io#679. For the correct
file diff view, [please only review these
commits](https://github.com/delta-io/delta-kernel-rs/pull/685/files/501c675736dd102a691bc2132c6e81579cf4a1a6..3dcd0859be048dc05f3e98223d0950e460633b60)

resolves delta-io#688


### Changes

We already have the capability to recognize UUID-named checkpoint files
with the variant `LogPathFileType::UuidCheckpoint(uuid)`. This PR does
the folllowing:
- Adds `LogPathFileType::UuidCheckpoint(_)` to the list of valid
checkpoint file types that are collected during log listing
  - This addition allows V2 checkpoints to be included in log segments.
- Adds `ReaderFeatures::V2Checkpoint` to the list of supported reader
features
- This addition allows protocol & metadata validation to pass for tables
with the `v2Checkpoints` reader feature
- Adds the `UnsupportedFeature` reader/writer feature for testing
purposes.

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

Test coverage for the changes required to support building log segments
with V2 checkpoints:

- `test_uuid_checkpoint_patterns` (already exists, small update)
- Verifies the behavior of parsing log file paths that follow the
UUID-naming scheme
- `test_v2_checkpoint_supported`
- Tests the `ensure_read_supported()` func appropriately validates
protocol with `ReaderFeatures::V2Checkpoint`
- `build_snapshot_with_uuid_checkpoint_json`
- `build_snapshot_with_uuid_checkpoint_parquet` (already exists)
- `build_snapshot_with_correct_last_uuid_checkpoint`

Golden table tests:
- `v2-checkpoint-json`
- `v2-checkpoint-parquet`

Potential todos:
- is it worth introducing a preference for V2 checkpoints vs V1
checkpoints if both are present in the log for a version
- what about a preference for checkpoints referenced by
_last_checkpoint?
Updates HDFS dependencies to newest versions according to [compatibility
matrix](https://github.com/datafusion-contrib/hdfs-native-object-store?tab=readme-ov-file#compatibility).
## How was this change tested?

I expect current CI pipeline to cover this since there is a [HDFS
integration
test](delta-io@1f57962).
Also, I have run tests successfully (apart from code coverage due to
missing CI secret) on [my
fork](rzepinskip@d87922d).

---------

Co-authored-by: Nick Lanham <[email protected]>
…enabled (delta-io#664)

## What changes are proposed in this pull request?
This PR adds two functions to TableConfiguration: 
1) check whether appendOnly table feature is supported
2) check whether appendOnly table feature is enabled

It also enabled writes on tables with `AppendOnly` writer feature.

## How was this change tested?
I check that write is supported on Protocol with
`WriterFeatures::AppendOnly`.

---------

Co-authored-by: Zach Schuermann <[email protected]>
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR is part of building support for reading V2 checkpoints.
delta-io#498

This PR ports over existing delta‑spark tests and the tables they
create. This test coverage is necessary to ensure that V2 checkpoint
files - whether written in JSON or Parquet, with or without sidecars -
are read correctly and reliably.

This PR stacks changes on top of
delta-io#685
resolves delta-io#671

# How are these tests generated?

The test cases are derived from `delta-spark`'s
[CheckpointSuite](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48)
which creates known valid tables, reads them, and asserts correctness.
The process for adapting these tests is as follows:

1. I modified specific test cases in of interest in `delta-spark` to
persist their generated tables.
2. These tables were then compressed into `.tar.zst` archives and copied
over to delta-kernel-rs.
3. Each test in this PR loads a stored table, scans it, and asserts that
the returned table state matches the expected state - ( derived from the
corresponding table insertions in `delta-spark`.)

e.g in delta-spark test .
```
     // Append operations and assertions on checkpoint versions
      spark.range(1).repartition(1).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 1)
      assert(getV2CheckpointProvider(deltaLog).sidecarFileStatuses.size == 1)
      assert(getNumFilesInSidecarDirectory() == 1)

      spark.range(30).repartition(9).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 2)
      assert(getNumFilesInSidecarDirectory() == 3)

      spark.range(100).repartition(9).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 3)
      assert(getNumFilesInSidecarDirectory() == 5)

      spark.range(100).repartition(11).write.format("delta").mode("append").save(path)
      assert(getV2CheckpointProvider(deltaLog).version == 4)
      assert(getNumFilesInSidecarDirectory() == 9)
    }

```
Translates to an expected table state in the kernel:
```
    let mut expected = [
        header,
        vec!["| 0   |".to_string(); 3],
        generate_rows(30),
        generate_rows(100),
        generate_rows(100),
        generate_rows(1000),
        vec!["+-----+".to_string()],
    ]
```
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
Tables from test-cases of interest in delta-spark's
[`CheckpointSuite`](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48)
have been compressed into `.tar.zst` archives. They are read by the
kernel and the resulting tables are asserted for correctness.
- `v2_checkpoints_json_with_sidecars`
- `v2_checkpoints_parquet_with_sidecars`
- `v2_checkpoints_json_without_sidecars`
- `v2_checkpoints_parquet_without_sidecars`
- `v2_classic_checkpoint_json`
- `v2_classic_checkpoint_parquet`
- `v2_checkpoints_parquet_with_last_checkpoint`
- `v2_checkpoints_json_with_last_checkpoint`
## What changes are proposed in this pull request?

Add basic support for partition pruning by combining two pieces of
existing infra:
1. The log replay row visitor already needs to parse partition values
and already filters out unwanted rows
2. The default predicate evaluator works directly with scalars

Result: partition pruning gets applied during log replay, just before
deduplication so we don't have to remember pruned files.

WARNING: The implementation currently has a flaw, in case the history
contains a table-replace that affected partition columns. For example,
changing a value column into a non-nullable partition column, or an
incompatible type change to a partition column. In such cases, the
remove actions generated by the table-replace operation (for old files)
would have the wrong type or even be entirely absent. While the code can
handle an absent partition value, an incompatibly typed value would
cause a parsing error that fails the whole query. Note that stats-based
data skipping already has the same flaw, so we are not making the
problem worse. We will fix the problem for both as a follow-up item,
tracked by delta-io#712

NOTE: While this is a convenient way to achieve partition pruning in the
immediate term, Delta
[checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1)
can provide strongly-typed `stats_parsed` and `partitionValues_parsed`
columns which would have a completely different access.
* For `stats` vs. `stats_parsed`, the likely solution is simple enough
because we already json-parse `stats` into a strongly-typed nested
struct in order to evaluate the data skipping predicate over its record
batch. We just avoid the parsing overhead if `stats_parsed` is already
available.
* The `partitionValues` field poses a bigger challenge, because it's a
string-string map, not a JSON literal. In order to turn it into a
strongly-typed nested struct, we would need a SQL expression that can
extract the string values and try-cast them to the desired types. That's
ugly enough we might prefer to keep completely different code paths for
parsed vs. string partition values, but then there's a risk that
partition pruning behavior changes depending on which path got invoked.

## How was this change tested?

New unit tests, and adjusted one unit test that assumed no partition
pruning.
## What changes are proposed in this pull request?
Add `DeletionVectors` to supported writer features. We trivially support
DVs since we never write DVs. Note as we implement DML in the future we
need to ensure it correctly handles DVs

## How was this change tested?
modified UT
)

## What changes are proposed in this pull request?
Support writer version 2 and `Invariant` table (writer) feature. Note
that we don't _actually_ support invariants, rather we enable writing to
tables **without invariants** with version=2 or Invariant feature
enabled.


### This PR affects the following public APIs

Enable writes to version=2/Invariant enabled.

## How was this change tested?
new UTs

resolves delta-io#706
## What changes are proposed in this pull request?

`MetadataValue::Number(i32)` was the previous values for metadata
values, but identity columns are only longs, so updated
MetadataValue::Number to be `MetadataValue::Number(i64)` instead.


## How was this change tested?

I ran the tests, this doesn't change any existing functionality only the
type.

---------

Co-authored-by: Robert Pack <[email protected]>
## What changes are proposed in this pull request?

The `actions-rs/toolchain` action is deprecated in favor of
`actions-rust-lang/setup-rust-toolchain`. This PR updates the usages of
the respective actions in the github workflows. THe new action already
includes an integration with the rust-cache action, so no need to set
that up separately anymore.

This also sets up a dependabot configuration for `cargo` and
`github-actions` which we may or may not choose to keep.

## How was this change tested?

no code changes.

---------

Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request?
Split arrow_expression.rs into a dedicated submodule.
Move tests and utility functions to separate files for better
organization and maintainability.
This solves the [issue
745](delta-io#745).

## How was this change tested?
existing UT

Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request?

Small code readability improvement -- instead of a triple-nested match,
flatten it out to a single level.

## How was this change tested?

Existing unit tests. No new functionality.
…y(MapType)` (delta-io#757)

title. fixing a panic: return error instead.
## What changes are proposed in this pull request?

Our `list_from` implementation for the object_store based filesystem
client is currently broken, since it does not behave as documented /
required for that function. Specifically we should list all files in the
parent folder for using the path as offset to list from.

In a follow up PR we then need to lift the assumption that all URLs will
always be under the same store to get proper URL handling.

### This PR affects the following public APIs

- `DefaultEngine::new` no longer requires a `table_root` parameter.
- `list_from` consistently returns keys greater than (`>`) the offset,
previously the `sync-engines` client returned all keys (`>=`)

## How was this change tested?

Additional tests asserting consistent `list_from` behavior for all our
file system client implementations.

---------

Signed-off-by: Robert Pack <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Zach Schuermann <[email protected]>
roeap and others added 28 commits March 24, 2025 14:14
## What changes are proposed in this pull request?

Current checks for a pre-signed url in our readers may mis-classify
object store URLs that do not use store specific schemes (`s3://`m
`az://` ...) as pre-signed urls. We extend the current checks, to also
consider the presence of a query string and define a helper trait to
ensure we can maintain the check in one place.

## How was this change tested?

additional test to validate url matches.
…elta-io#766)

small change: `replay` generally refers to the work we do to resolve
actions during log replay. I've renamed `LogSegment::replay` to
`LogSegment::read_actions` since this method does just that: reads
actions, but doesn't actually do tangible 'replay' work (Add/Remove
matching, etc.).

just docs/rename. no material changes.
…expression pub(crate) (delta-io#767)

## What changes are proposed in this pull request?

These were never intended to be public, so just fixing that.


### This PR affects the following public APIs
- `visit_expression_internal`: is private
- `unwrap_kernel_expression`: is `pub(crate)`

## How was this change tested?
Existing tests
We don't want to just have all `action` types pub. This moves everything
to `pub(crate)` but turns them `pub` with dev-visibility.

Breaking change: actions types are now all private (`Metadata`,
`Protocol`, `Add`, etc.)
…o#772)

## What changes are proposed in this pull request?
According to [arrow docs] we should ensure that we have `-C
target-cpu=native` for rustc to generate best available instructions
(SIMD, etc.) for our architecture

[arrow docs]: https://crates.io/crates/arrow
## What changes are proposed in this pull request?
Adds a new required method: `new_null` API for creating a new single-row
null literal `EngineData`. Then, we provide the `create_one` API for
creating single-row `EngineData` by implementing a `SchemaTransform`
(`LiteralExpressionTransform`) to transform the given schema + leaf
values into an `Expression` which evaluates to literal values at the
leaves of the schema. (implemented in a new private
`ExpressionHandlerExtension` trait)
1. Adds the new required `fn new_null` to our `ExpressionHandler` trait
(breaking)
2. Adds the new provided `fn create_one` to an
`ExpressionHandlerExtension` trait
3. Implements `new_null` for `ArrowExpressionHandler`

additionally, adds a new `fields_len()` method to `StructType`.

### This PR affects the following public APIs

1. breaking: new `new_null` API for `ExpressionHandler`
2. breaking: new `LiteralExpressionTransformError`

## How was this change tested?
Bunch of new unit tests. For the nullability tests of our new
`SchemaTransform` we came up with a set of 24 exhaustive test cases:

```
test cases: x, a, b are nullable (n) or not-null (!). we have 6 interesting nullability
combinations:
1. n { n, n }
5. n { n, ! }
6. n { !, ! }
7. ! { n, n }
8. ! { n, ! }
9. ! { !, ! }

and for each we want to test the four combinations of values ("a" and "b" just chosen as
abitrary scalars):

1. (a, b)
2. (N, b)
4. (a, N)
5. (N, N)

here's the full list of test cases with expected output:

n { n, n }
1. (a, b) -> x (a, b)
2. (N, b) -> x (N, b)
3. (a, N) -> x (a, N)
4. (N, N) -> x (N, N)

n { n, ! }
1. (a, b) -> x (a, b)
2. (N, b) -> x (N, b)
3. (a, N) -> Err
4. (N, N) -> x NULL

n { !, ! }
1. (a, b) -> x (a, b)
2. (N, b) -> Err
3. (a, N) -> Err
4. (N, N) -> x NULL

! { n, n }
1. (a, b) -> x (a, b)
2. (N, b) -> x (N, b)
3. (a, N) -> x (a, N)
4. (N, N) -> x (N, N)

! { n, ! }
1. (a, b) -> x (a, b)
2. (N, b) -> x (N, b)
3. (a, N) -> Err
4. (N, N) -> NULL

! { !, ! }
1. (a, b) -> x (a, b)
2. (N, b) -> Err
3. (a, N) -> Err
4. (N, N) -> NULL
```
…to embeddable `FileActionsDeduplicator` (delta-io#769)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

**No behavioral changes were introduced, this is purely a refactoring
effort**

This PR extracts the core deduplication logic from the
`AddRemoveDedupVisitor` in order to be shared with the incoming
`CheckpointVisitor`. For a bigger picture view on how this refactor is
helpful, please take a look at the following PR which implements the
`CheckpointVisitor` with an embedded `FileActionsDeduplicator` that will
rebase this PR once merged. [[link to
PR]](delta-io#738).

This `FileActionsDeduplicator` lives in the new top-level `log_replay`
mod as it will be leveraged in the nested `scan/log_replay` mod and the
incoming `checkpoints/log_replay` mod. There are also additional traits
& structs that the two `log_replay` implementations will share via this
new top-level mod. For an even wider view of the implementation of the
`checkpoints` mod and the component re-use, please have a look at the
following PR. [[link to
PR]](delta-io#744)

## Summary of refactor
1. New `log_replay` mod
2. Moved `FileActionKey` definition from `scan/log_replay` to the new
`log_replay` mod
3. New `FileActionDeduplicator` in the new `log_replay` mod
- Includes the `check_and_record_seen` method which was simply **moved**
from the `AddRemoveDedupVisitor`
- Includes the `extract_file_action` method and `extract_dv_unique_id`
private method which may be new concepts, but includes functionality
which are both pieces of functionality pulled from the
`AddRemoveDedupVisitor` to be shared with the incoming
`CheckpointVisitor`



<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

All existing tests pass ✅
)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->
This PR refactors test helper functions by moving them to the test_utils
module:
- string_array_to_engine_data
- parse_json_batch
- action_batch

This change is a preparatory step for delta-io#738, which will leverage these
functions in a new checkpoint module.
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

No behavioral changes, all current tests pass ✅
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR introduces the `CheckpointMetadata` action.

This action is only allowed in checkpoints following [V2
spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#v2-spec).
For more information: [[link to
protocol]](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoint-metadata)

This PR is part of the on-going effort to implement single-file
checkpoint write support. For reference, [[link to write API
proposal]](delta-io#779)

There already exists a `CheckpointMetadata` named struct
[[link]](https://github.com/delta-io/delta-kernel-rs/blob/9290930bbeb1100e7af98c228dbd339eea38143a/kernel/src/snapshot.rs#L149)
which represents the `_last_checkpoint` file. This PR renames this
struct to `LastCheckpointHint`
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->




## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

- `test_checkpoint_metadata_schema`: tests schema projection
…tCheckpointHint` (delta-io#789)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR renames `CheckpointMetadata` to `LastCheckpointHint` in order
not to clash with the incoming `CheckpointMetadata` [[link to
PR]](delta-io#781).


Moved to another PR
~~This PR also changes the types of fields in the `LastCheckpointHint`.
This was done to unblock the in-flight work for single-file checkpoint
write support, which includes the creation of the `_last_checkpoint`
file. As `usize` and `u64` primitives are not supported, and we want to
avoid a mismatch of field types we read v.s. write, we are updating the
types to `i64`. There is an overarching github issue for converting
field types to unsigned integers (u64, usize) where semantically
appropriate (such as for `LastCheckpointHint.version`... ) and adding
proper support for these data type primitives throughout the codebase.
delta-io#786


<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

No new behavioral changes. All current tests pass ✅
…io#800)

builds currently broken due to rustc 1.86 changing an error message +
new clippy lints. this PR updates expected output to the new error and
does two clippy lints:
1. `next_back()` instead of `last()` on iterators
2. docstring indentation updates
…o#802)

## What changes are proposed in this pull request?
Renamed ReaderFeatures and WriterFeatures to ReaderFeature and
WriterFeature


## How was this change tested?
`cargo test`
…als (delta-io#803)

## What changes are proposed in this pull request?

The code for `Expression::references` uses an adhoc expression traversal
instead of an expression transform, and the default parquet reader's
`compute_field_indices` method uses an adhoc expression traversal
instead of invoking `Expression::references`. Fix them both.

## How was this change tested?

Existing unit tests.
…essor` trait (delta-io#774)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

**No behavioral changes were introduced, this is purely a refactoring
effort**

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR refactors the newly-named `ScanLogReplayProcessor` to implement
the newly introduced `LogReplayProcessor` trait in order to unify
components for the incoming `CheckpointLogReplayProcessor`. For a bigger
picture view of how this refactor is helpful, please have a look at the
following PR, which introduces the `CheckpointLogReplayProcessor` which
implements the `LogReplayProcessor` trait, that will rebase this PR once
merged delta-io#744


### Summary of changes
- Introduced the `LogReplayProcessor` trait in the top-level
`log_replay` mod
- Renamed `LogReplayScanner` in scan/log_replay ->
`ScanLogReplayProcessor`
- Updated `ScanLogReplayProcessor` to implement the `LogReplayProcessor`
trait
- requires pushing down the `add_transform`, `logical_schema`, and
`transform` fields to the `ScanLogReplayProcessor` for accessibility in
the `process_actions_batch` trait method, as the trait method
implementation has fixed parameters.
- Simplified the `scan_action_iter` function to use the trait's
`apply_to_iterator` method

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

All existing tests pass ✅
## What changes are proposed in this pull request?

Rust doesn't encourage the `get_` prefix for getters because it's
redundant and anyway a getter is allowed to have the same name as the
field it exposes. Remove the prefix from the various engine interfaces.
Additionally, rename `get_evaluator` as `new_expression_evaluator` to
accurately reflect that it is _NOT_ a getter at all, but actually
creates a new expression evaluator.

Finally, we also rename `ExpressionHandler` to `EvaluationHandler`
because that trait is used to create expression _evaluators_, not
expressions. Additionally, future work will differentiate generic
"expressions" from "predicates" (boolean-valued expressions with special
evaluation semantics), and that will likely necessitate defining a
`EvaluationHandler::new_predicate_evaluator` method alongside the
`new_expression_evaluator` method.

### This PR affects the following public APIs

All the methods and traits we change are public.

## How was this change tested?

Rename-only operation, no functional changes. Compilation suffices.
…io#782)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

This PR introduces the helper methods:
- `new_uuid_parquet_checkpoint` which creates a new
`ParsedCheckpointPath<Url>` for a uuid-named parquet checkpoint file at
the specified version. The UUID-naming scheme looks like:
`n.checkpoint.u.parquet`, where u is a UUID and n is the snapshot
version that this checkpoint represents.
- `new_classic_parquet_checkpoint` which creates a new
`ParsedCheckpointPath<Url>` for a classic-named parquet checkpoint file
at the specified version. The classic-naming scheme looks like:
`n.checkpoint.parquet`, where n is the snapshot version that this
checkpoint represents.
- **Updates the `uuid` dependency to always include `v4` and `fast-rng`
features:**
     - This ensures that `uuid::new_v4()` is always available.
- The `fast-rng` feature improves performance when generating UUIDs.




For more information on the two checkpoint naming-schemes:

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#uuid-named-checkpoint

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#classic-checkpoint

This PR is part of the on-going effort to implement single-file
checkpoint write support. For reference, [[link to write API
proposal]](delta-io#779)

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

- `test_new_uuid_parquet_checkpoint` - verifies UUID-named Parquet
checkpoint creation with proper attributes.
- `test_new_classic_parquet_checkpoint` - verifies classic-named Parquet
checkpoint creation with proper attributes.
## What changes are proposed in this pull request?

Continuing the cleanup started by
delta-io#804, rename the class
to eliminate the last vestiges of (ancient) "client" nomenclature.

### This PR affects the following public APIs

The renamed class (and module) are public.

## How was this change tested?

Rename-only operation, no functional changes. Compilation suffices.
This PR enables incremental snapshot updates. This is done with a new
`Snapshot::try_new_from(...)` which takes an `Arc<Snapshot>` and an
optional version (None = latest version) to incrementally create a new
snapshot from the existing one. The heuristic is as follows:
1. if the new version == existing version, just return the existing
snapshot
2. if the new version < existing version, error since the engine
shouldn't really be here
3. list from (existing checkpoint version + 1, or version 1 if no
checkpoint) onward (create a new 'incremental' `LogSegment`)
4. if no new commits/checkpoint, return existing snapshot (if requested
version matches), else create new `LogSegment`
5. check for a checkpoint:
a. if new checkpoint is found: just create a new snapshot from that
checkpoint (and commits after it)
b. if no new checkpoint is found: do lightweight P+M replay on the
latest commits

In addition to the 'main' `Snapshot::try_new_from()` API, the following
incremental APIs were introduced to support the above implementation:
1. `TableConfiguration::try_new_from(...)`
2. splitting `LogSegment::read_metadata()` into
`LogSegment::read_metadata()` and `LogSegment::protocol_and_metadata()`
3. new `LogSegment.checkpoint_version` field

resolves delta-io#489
## What changes are proposed in this pull request?
Instead of `Protocol` retaining a list of `Strings` representing lists
of `ReaderFeature`/`WriterFeature`, we move to embed the fully parsed
`ReaderFeature`s and `WriterFeature`s. This changes
`Protocol.reader_features` and `Protocol.writer_features` fields
from`Option<Vec<String>>` to Optional vecs of `ReaderFeatures` and
`WriterFeatures` respectively. Critically, this includes a new `Unknown`
variant which allows us to parse all possible strings in the protocol
into `ReaderFeature`s and `WriterFeature`s but later detect if unknown
features are present and fail when ensuring reader/writer features are
supported.

### This PR affects the following public APIs
Breaking: new `ReaderFeature::Unknown(String)` and
`WriterFeature::Unknown(String)` variants

(note that `Protocol` and fields are `pub(crate)`)

## How was this change tested?
UT modification

---------

Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request?
Rename `ScanData` to `ScanMetadata` and `Scan::scan_data` to
`Scan::scan_metadata` (and corresponding FFI). Additionally, renames
`TableChangesScanData` to `TableChangesScanMetadata`. Additional
docs/refactor coming in delta-io#768

### This PR affects the following public APIs

breaking changes:
1. rename `ScanData` to `ScanMetadata`
2. rename `Scan::scan_data()` to `Scan::scan_metadata()`
3. (ffi) rename `free_kernel_scan_data()` to `free_scan_metadata_iter()`
4. (ffi) rename `kernel_scan_data_next()` to `scan_metadata_next()`
5. (ffi) rename `visit_scan_data()` to `visit_scan_metadata()`
6. (ffi) rename `kernel_scan_data_init()` to `scan_metadata_iter_init()`
7. (ffi) rename `KernelScanDataIterator` to `ScanMetadataIterator`
8. (ffi) rename `SharedScanDataIterator` to `SharedScanMetadataIterator`


## How was this change tested?
existing

resolves delta-io#816
…ta` type (delta-io#768)

## What changes are proposed in this pull request?

1. Updated `ScanMetata` from typed tuple to struct. ScanMetadata is now
a struct with fields:
- filtered_data: A `FilteredEngineData` instance.
- transforms: A vector of transformations to be applied to the data read
from the files

2. Introduction of `FilteredEngineData` type:
Couples `EngineData` with a selection vector indicating which rows to
process.
This type is returned from the`scan_metadata` API and the incoming
`checkpoint` API

3. Updates `visit_scan_files` parameters to accept `ScanMetadata` to
avoid de-structuring.

4. Corresponding FFI changes for `visit_scan_files` to accept
`ScanMetadata` param

All current tests pass.
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
  3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/

Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.

Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->

## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Key changes

resolves delta-io#737. 

This PR implements the `CheckpointVisitor` necessary for filtering a
stream of actions into a stream of actions to be included in a
checkpoint file. This leverages the `FileActionDeduplicator` [[link to
PR]](delta-io#769).

This PR introduces the `checkpoint` mod, and implements the visitor in
the new `checkpoint/log_replay` mod.

Comprehensive module documents are included in the new modules which
provide an overview of the incoming code additions, along with it's
goal.


### Checkpoint Content 

A **complete V1 checkpoint** encapsulates:
1. All FILE actions that make up the state of a version of a table:
    - Add actions (after action reconciliation)
- Unexpired remove actions ([remove
tombstones](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file))
2. All NON-FILE actions that make up the state of a version of a table:
    - Protocol action
    - Metadata action
    - Txn actions

A **single-file V2 checkpoint** is simply a super-set of the actions
included in the V1 checkpoint schema, with the addition of the
`CheckpointMetadata` action (which must be generated on every write).
Since single-file v2 checkpoints will also leverage this visitor, we
have chosen to name it the general `CheckpointVisitor`
    
Note: 
- CDC, CommitInfo, Sidecar, and CheckpointMetadata actions are NOT part
of the **V1** checkpoint schema.
- Sidecar and CheckpointMetadata actions are part of the **V2**
checkpoint schema.

### The new `CheckpointVisitor`
This visitor selects the **FILE** actions for a V1 spec checkpoint via a
selection vector:
1. Processes add/remove actions with proper deduplication based on path
and deletion vector ID pairs
2. Optimization: Only tracks already seen file paths in **commit
files**, as actions in checkpoint files are the last batches to be
processed, and do not conflict with other actions in checkpoint files.
3. Applies tombstone expiration logic by filtering out remove actions
with deletion timestamps older than the minimum file retention timestamp

This visitor also selects the **NON-FILE** actions for a V1 spec
checkpoint via a selection vector:
1. Ensures exactly one protocol action is included (the newest one
encountered)
2. Ensures exactly one metadata action is included (the newest one
encountered)
3. Deduplicates transaction (txn) actions by app ID to include only the
newest action for each app ID

<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs

If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.

Note that _new_ public APIs are not considered breaking.
-->


## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->

`test_checkpoint_visitor` - Tests basic functionality with both file and
non-file actions, verifying correct counts and selection vector.

`test_checkpoint_visitor_boundary_cases_for_tombstone_expiration` -
Tests how tombstone expiration handles threshold boundary conditions.

`test_checkpoint_visitor_conflicting_file_actions_in_log_batch` -
Verifies duplicate path handling in log batches (keeping first, skipping
second).

`test_checkpoint_visitor_file_actions_in_checkpoint_batch` - Tests that
duplicate file actions are included in checkpoint batches.

`test_checkpoint_visitor_conflicts_with_deletion_vectors` - Tests file
deduplication with deletion vectors to ensure uniqueness.

`test_checkpoint_visitor_already_seen_non_file_actions` - Verifies that
pre-populated actions are skipped correctly.

`test_checkpoint_visitor_duplicate_non_file_actions` - Tests
deduplication of non-file actions (protocol, metadata, transactions).
## What changes are proposed in this pull request?
In delta-io#699 table_root was
removed from params, just fix the docs after this change.


Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request?

The `predicates` module has a confusing name because it sits alongside
the `expressions` module but does not actually define predicates (=
boolean-valued expressions) at all. Instead, it contains kernel's
implementation of predicate evaluation, used for e.g. data skipping and
partition pruning.

Rename the module to `kernel_predicates` and rename corresponding
classes like `PredicateEvaluator` to `KernelPredicateEvaluator`, to make
their purpose more clear. This also helps clear the way for us to split
out predicates as a separate concept from normal expressions, see
delta-io#765.

## How was this change tested?

Module and class renames only. Compilation suffices.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 14 committers have signed the CLA.

✅ kssenii
❌ scovich
❌ adamreeve
❌ zachschuermann
❌ nicklan
❌ sebastiantia
❌ hntd187
❌ roeap
❌ OussamaSaoudi
❌ maruschin
❌ rzepinskip
❌ rtyler
❌ Konstantin Bogdanov
❌ gotocoding-DB


Konstantin Bogdanov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.