forked from delta-io/delta-kernel-rs
-
Notifications
You must be signed in to change notification settings - Fork 0
Bump lockfile #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
thevar1able
wants to merge
61
commits into
main
Choose a base branch
from
bump-lockfile
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## What changes are proposed in this pull request? Starting in arrow-53.3, the parquet reader no longer computes NULL masks for non-nullable leaf columns -- even if they have nullable ancestors. This breaks row visitors, who rely on each leaf column to have a fully accurate NULL mask. The quick-fix solution is to manually fixup the null masks of every `RecordBatch` that comes from the parquet reader. Fixes delta-io#691 ## How was this change tested? New unit test that checks whether parquet reads produce properly nested NULL masks. The test also leverages (and verifies) the JSON parser, so we can reliably detect any unwelcome behavior changes to JSON parsing that might land in the future.
…o#700) Fixes delta-io#698 ## What changes are proposed in this pull request? Updates the `DataSkippingFilter` to treat all columns as nullable for the purpose of parsing stats, as suggested in delta-io#698 (comment). This is particularly important for partition columns, which won't have values present in stats. But stats are also only usually stored for the first 32 columns, so we shouldn't rely on stats being present for non-partition fields either. ## How was this change tested? I've added a new unit test. I've also tested building duckdb-delta with this change (cherry-picked onto 0.6.1) and verified that the code in delta-io#698 now works.
…_snapshot_schema (delta-io#683) ## What changes are proposed in this pull request? When given a schema (e.g. in `global_scan_state`) the engine needs a way to visit this schema. This introduces a new API `visit_schema` to allow engines to visit any schema over FFI. An API called `visit_schema` previously existed but visited the schema of a given _snapshot_; this has now been renamed to `visit_snapshot_schema`. ### This PR affects the following public APIs Renamed `visit_schema` to `visit_snapshot_schema` and now `visit_schema` takes `SharedSchema` as an argument instead of a snapshot. ## How was this change tested? updated read_table test
…o#654) This change introduces arrow_53 and arrow_54 feature flags on kernel which are _required_ when using default-engine or sync-engine. Fundamentally we must push users of the crate to select their arrow major version through flags since Cargo _will_ include multiple major versions in the dependency tree which can cause ABI breakages when passing around symbols such as `RecordBatch` See delta-io#640 --------- Signed-off-by: R. Tyler Croy <[email protected]>
Our previous write protocol check was too strict. Now we just ensure that the protocol makes sense given what features are present/specified. Made all existing `write.rs` tests also write to a protocol 1/1 table, and they all work.
Use new transform functionality to transform data over FFI. This lets us get rid of all the gross partition adding code in c :) In particular: - remove `add_partition_columns` in `arrow.c`, we don't need it anymore - expose ffi methods to get an expression evaluator and evaluate an expression from `c` - use the above to add an `apply_transform` function in `arrow.c` ## How was this change tested? - existing tests
…ta-io#709) ## What changes are proposed in this pull request? This PR removes the old `visit_snapshot_schema` introduced in delta-io#683 - we should just go ahead and do the 'right thing' with having a `visit_schema` (introduced in the other PR) and a `logical_schema()` function (added here) in order to facilitate visiting the snapshot schema. Additionally I've moved the schema-related things up from `scan` module to top-level in ffi crate. Exact changes listed below; this PR updates tests/examples to leverage the new changes. ### This PR affects the following public APIs 1. Remove `visit_snapshot_schema()` API 2. Add a new `logical_schema(snapshot)` API so you can get the schema of a snapshot and use the `visit_schema` directly 3. Renames `free_global_read_schema` to just `free_schema` 4. Moves `SharedSchema` and `free_schema` up from `mod scan` into top-level `ffi` crate. ## How was this change tested? existing UT
If we try and have a `need_arrow` flag we can make that include the code line: `pub use arrow_53::*` but we _cannot_ have it actually pull in the dependency. Pulling in the dependency is purely expressed in `Cargo.toml`, so the `use` just fails because we don't _have_ an arrow_53 dependency in that case. We can do some gross stuff in `build.rs` to inject the dependency, but even that doesn't apply to crates that depend on us so it only works if just compiling `delta-kernel` but isn't actually helpful for the use case we want. So I kept the `need-arrow` dep as a way for us to express that something needs arrow enabled, but rather than trying to do the import, it just issues a `compile_error` asking you to pick an arrow version. Perhaps we can get something more clever in the future, but for now let's unblock things. Also have test-utils depend on the arrow feature
release 0.7.0
## What changes are proposed in this pull request? Current chrono 0.4.40 breaks building arrow, pin chrono to a prior version that does not break arrow. ## How was this change tested? CI/CD --------- Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? The FFI expression visitor code incorrectly passes a `(u64, u64)` pair to `visit_literal_decimal` callback, representing the upper and lower half of an `i128` decimal value. It should actually be `(i64, u64)` to preserve signedness. ### This PR affects the following public APIs The expression visitor callback `visit_literal_decimal` now takes `i64` for the upper half of a 128-bit int value. ## How was this change tested? Updated the example code.
) ## What changes are proposed in this pull request? Make `get_partition_column_count` and `get_partition_columns` take a snapshot so engines can work this out at planning time without creating a scan. The previous methods to get this info out of a scan have been removed. ### This PR affects the following public APIs The old functions that took snapshots have been removed ## How was this change tested? New unit test
## What changes are proposed in this pull request? `Url` crate now has MSRV with default unicode backend of rustc `1.81`. instead of fighting this, we will just bump up our MSRV from `1.80` to `1.81` seeing as (1) a large part of the ecosystem (datafusion, polars (though still states `1.80` in README), delta-rs, etc.) already is on `1.81` and (2) `Url` is a rather foundational crate so if they bump it seems reasonable to assume that many consumers will too ### This PR affects the following public APIs bumping MSRV from `1.80` to `1.81` ## How was this change tested? MSRV test
…ent futures (delta-io#711) ## What changes are proposed in this pull request? The original `FileStream` API, though intended to concurrently make GET requests to the object store, actually made serial requests and relied on a hand-written poll function in order to implement `Stream`. This PR aims to make a minimal change in order to (1) increase performance for the JSON reader by issuing concurrent GET requests and (2) simplify the code by removing the need for a custom `Stream` and instead leverage existing functions/adapters to convert the files to read into a `Stream` and issue concurrent requests through the [`futures::stream::buffered`](https://docs.rs/futures/latest/futures/stream/struct.Buffered.html) adapter. This is effectively a similar improvement as in delta-io#595 but for the JSON reader. ### Specifically, the changes are: 1. replace the `FileStream::new_async_read_iterator()` call (the manually-implemented `Stream`) with an inline implementation of converting the files slice into a Stream (via `stream::iter`) and use the [`futures::stream::buffered`](https://docs.rs/futures/latest/futures/stream/struct.Buffered.html) adapter to concurrently execute file opening futures. It then sends results across an `mpsc` channel to bridge the async/sync gap. 2. JsonOpener no longer implements `FileOpener` (which requires a synchronous `fn open()` and instead directly exposes an `async fn open()` for easier/simpler use above. This removes all reliance on `FileStream`/`FileOpener` in the JSON reader. 3. adds a custom `ObjectStore` implementation: `OrderedGetStore` to deterministically control the ordering in which GET request futures are resolved ### This PR affects the following public APIs - `DefaultJsonHandler::with_readahead()` renamed to `DefaultJsonHandler::with_buffer_size()` - DefaultJsonHandler's default buffer size: 10 => 1000 - DefaultJsonHandler's default batch size: 1024 => 1000 ## How was this change tested? added test with a new `OrderedGetStore` which will resolve the GET requests in a jumbled order but we expect the test to return the natural order of requests. in a additionally, manually validated that we went from serial JSON file reads to concurrent reads
release 0.8.0
…elta-io#679) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Summary This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are: Item 1. Allow log segments to be built with V2 checkpoint files Item 2. Allow log segment `replay` functionality to retrieve actions from sidecar files if need be. This PR specifically adds support for Item 2. This PR **does not introduce full v2Checkpoints reader/writer support** as we are missing support for Item 1, meaning log segments can never have V2 checkpoint files in the first place. That functionality will be completed in [PR delta-io#685](delta-io#685) which is stacked on top of this PR. However, the changes to log `replay` done here are compatible with tables using V1 checkpoints, allowing us to safely merge the changes here. ### Changes For each batch of `EngineData` from a checkpoint file: 1. Use the new `SidecarVisitor` to scan each batch for sidecar file paths embedded in sidecar actions. 3. If sidecar file paths exist: - Read the corresponding sidecar files. - Generate an iterator over batches of actions within the sidecar files. - Insert the sidecar batches that contain the add actions necessary to reconstruct the table’s state into the top level iterator **- Note: the original checkpoint batch is still included in the iterator** 4. If no sidecar file paths exist, move to the next batch & leave the original checkpoint batch in the iterator. Notes: - If the `checkpoint_read_schema` does not have file actions, we do not need to scan the batch with the `SidecarVisitor` and can leave the batch as-is in the top-level iterator. - Multi-part checkpoints do not have sidecar actions, so we do not need to scan the batch with the `SidecarVisitor` and can leave the batch as-is in the top-level iterator. - A batch may not include add actions, but other actions (like txn, metadata, protocol). This is safe to leave in the iterator as the non-file actions will be ignored. resolves delta-io#670 <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Although log segments can not yet have V2 checkpoints, we can easily mock batches that include sidecar actions that we can encounter in V2 checkpoints. - `test_sidecar_to_filemeta_valid_paths` - Tests handling of sidecar paths that can either be: - A relative path within the _delta_log/_sidecars directory, but it is just file-name - paths that are relative and have a parent (i.e. directory component) - An absolute path. **Unit tests for process_single_checkpoint_batch:** - `test_checkpoint_batch_with_no_sidecars_returns_none` - Verifies that if no sidecar actions are present, the checkpoint batch is returned unchanged. - `test_checkpoint_batch_with_sidecars_returns_sidecar_batches` - Ensures that when sidecars are present, the corresponding sidecar files are read, and their batches are returned. - `test_checkpoint_batch_with_sidecar_files_that_do_not_exist` - Tests behavior when sidecar files referenced in the checkpoint batch do not exist, ensuring an error is returned. - `test_reading_sidecar_files_with_predicate` - Tests that sidecar files that do not match the passed predicate are skipped correctly **Unit tests for create_checkpoint_stream:** - `test_create_checkpoint_stream_errors_when_schema_has_remove_but_no_sidecar_action` - Validates that if the schema includes the remove action, it must also contain the sidecar column. - `test_create_checkpoint_stream_errors_when_schema_has_add_but_no_sidecar_action` - Validates that if the schema includes the add action, it must also contain the sidecar column. - `test_create_checkpoint_stream_returns_checkpoint_batches_as_is_if_schema_has_no_file_actions` - Checks that if the schema has no file actions, the checkpoint batches are returned unchanged - `test_create_checkpoint_stream_returns_checkpoint_batches_if_checkpoint_is_multi_part` - Ensures that for multi-part checkpoints, the batch is not visited, and checkpoint batches are returned as-is. - `test_create_checkpoint_stream_reads_parquet_checkpoint_batch_without_sidecars` - Tests reading a Parquet checkpoint batch and verifying it matches the expected result. - `test_create_checkpoint_stream_reads_json_checkpoint_batch_without_sidecars` - Verifies that JSON checkpoint batches are read correctly - `test_create_checkpoint_stream_reads_checkpoint_batch_with_sidecar` - Test ensuring that checkpoint files containing sidecar references return the additional corresponding sidecar batches correctly
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Summary This PR introduces foundational changes required for V2 checkpoint read support. The high-level changes required for v2 checkpoint support are: Item 1. Allow log segments to be built with V2 checkpoint files Item 2. Allow log segment replay functionality to retrieve actions from sidecar files if need be. This PR specifically adds support for Item 1. This PR enables support for the `v2Checkpoints` reader/writer table feature for delta kernel rust by 1. Allowing snapshots to now leverage UUID-named checkpoints as part of their log segment. 2. Adding the `v2Checkpoints` feature to the list of supported reader features. - This PR is stacked on Item 2 [here](delta-io#679). Golden table tests are included in this PR. - More integration tests will be introduced in a follow-up PR tracked here: delta-io#671 - This PR stacks changes on top of delta-io#679. For the correct file diff view, [please only review these commits](https://github.com/delta-io/delta-kernel-rs/pull/685/files/501c675736dd102a691bc2132c6e81579cf4a1a6..3dcd0859be048dc05f3e98223d0950e460633b60) resolves delta-io#688 ### Changes We already have the capability to recognize UUID-named checkpoint files with the variant `LogPathFileType::UuidCheckpoint(uuid)`. This PR does the folllowing: - Adds `LogPathFileType::UuidCheckpoint(_)` to the list of valid checkpoint file types that are collected during log listing - This addition allows V2 checkpoints to be included in log segments. - Adds `ReaderFeatures::V2Checkpoint` to the list of supported reader features - This addition allows protocol & metadata validation to pass for tables with the `v2Checkpoints` reader feature - Adds the `UnsupportedFeature` reader/writer feature for testing purposes. <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Test coverage for the changes required to support building log segments with V2 checkpoints: - `test_uuid_checkpoint_patterns` (already exists, small update) - Verifies the behavior of parsing log file paths that follow the UUID-naming scheme - `test_v2_checkpoint_supported` - Tests the `ensure_read_supported()` func appropriately validates protocol with `ReaderFeatures::V2Checkpoint` - `build_snapshot_with_uuid_checkpoint_json` - `build_snapshot_with_uuid_checkpoint_parquet` (already exists) - `build_snapshot_with_correct_last_uuid_checkpoint` Golden table tests: - `v2-checkpoint-json` - `v2-checkpoint-parquet` Potential todos: - is it worth introducing a preference for V2 checkpoints vs V1 checkpoints if both are present in the log for a version - what about a preference for checkpoints referenced by _last_checkpoint?
Updates HDFS dependencies to newest versions according to [compatibility matrix](https://github.com/datafusion-contrib/hdfs-native-object-store?tab=readme-ov-file#compatibility). ## How was this change tested? I expect current CI pipeline to cover this since there is a [HDFS integration test](delta-io@1f57962). Also, I have run tests successfully (apart from code coverage due to missing CI secret) on [my fork](rzepinskip@d87922d). --------- Co-authored-by: Nick Lanham <[email protected]>
…enabled (delta-io#664) ## What changes are proposed in this pull request? This PR adds two functions to TableConfiguration: 1) check whether appendOnly table feature is supported 2) check whether appendOnly table feature is enabled It also enabled writes on tables with `AppendOnly` writer feature. ## How was this change tested? I check that write is supported on Protocol with `WriterFeatures::AppendOnly`. --------- Co-authored-by: Zach Schuermann <[email protected]>
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR is part of building support for reading V2 checkpoints. delta-io#498 This PR ports over existing delta‑spark tests and the tables they create. This test coverage is necessary to ensure that V2 checkpoint files - whether written in JSON or Parquet, with or without sidecars - are read correctly and reliably. This PR stacks changes on top of delta-io#685 resolves delta-io#671 # How are these tests generated? The test cases are derived from `delta-spark`'s [CheckpointSuite](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48) which creates known valid tables, reads them, and asserts correctness. The process for adapting these tests is as follows: 1. I modified specific test cases in of interest in `delta-spark` to persist their generated tables. 2. These tables were then compressed into `.tar.zst` archives and copied over to delta-kernel-rs. 3. Each test in this PR loads a stored table, scans it, and asserts that the returned table state matches the expected state - ( derived from the corresponding table insertions in `delta-spark`.) e.g in delta-spark test . ``` // Append operations and assertions on checkpoint versions spark.range(1).repartition(1).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 1) assert(getV2CheckpointProvider(deltaLog).sidecarFileStatuses.size == 1) assert(getNumFilesInSidecarDirectory() == 1) spark.range(30).repartition(9).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 2) assert(getNumFilesInSidecarDirectory() == 3) spark.range(100).repartition(9).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 3) assert(getNumFilesInSidecarDirectory() == 5) spark.range(100).repartition(11).write.format("delta").mode("append").save(path) assert(getV2CheckpointProvider(deltaLog).version == 4) assert(getNumFilesInSidecarDirectory() == 9) } ``` Translates to an expected table state in the kernel: ``` let mut expected = [ header, vec!["| 0 |".to_string(); 3], generate_rows(30), generate_rows(100), generate_rows(100), generate_rows(1000), vec!["+-----+".to_string()], ] ``` <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> Tables from test-cases of interest in delta-spark's [`CheckpointSuite`](https://github.com/delta-io/delta/blob/1a0c9a8f4232d4603ba95823543f1be8a96c1447/spark/src/test/scala/org/apache/spark/sql/delta/CheckpointsSuite.scala#L48) have been compressed into `.tar.zst` archives. They are read by the kernel and the resulting tables are asserted for correctness. - `v2_checkpoints_json_with_sidecars` - `v2_checkpoints_parquet_with_sidecars` - `v2_checkpoints_json_without_sidecars` - `v2_checkpoints_parquet_without_sidecars` - `v2_classic_checkpoint_json` - `v2_classic_checkpoint_parquet` - `v2_checkpoints_parquet_with_last_checkpoint` - `v2_checkpoints_json_with_last_checkpoint`
## What changes are proposed in this pull request? Add basic support for partition pruning by combining two pieces of existing infra: 1. The log replay row visitor already needs to parse partition values and already filters out unwanted rows 2. The default predicate evaluator works directly with scalars Result: partition pruning gets applied during log replay, just before deduplication so we don't have to remember pruned files. WARNING: The implementation currently has a flaw, in case the history contains a table-replace that affected partition columns. For example, changing a value column into a non-nullable partition column, or an incompatible type change to a partition column. In such cases, the remove actions generated by the table-replace operation (for old files) would have the wrong type or even be entirely absent. While the code can handle an absent partition value, an incompatibly typed value would cause a parsing error that fails the whole query. Note that stats-based data skipping already has the same flaw, so we are not making the problem worse. We will fix the problem for both as a follow-up item, tracked by delta-io#712 NOTE: While this is a convenient way to achieve partition pruning in the immediate term, Delta [checkpoints](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1) can provide strongly-typed `stats_parsed` and `partitionValues_parsed` columns which would have a completely different access. * For `stats` vs. `stats_parsed`, the likely solution is simple enough because we already json-parse `stats` into a strongly-typed nested struct in order to evaluate the data skipping predicate over its record batch. We just avoid the parsing overhead if `stats_parsed` is already available. * The `partitionValues` field poses a bigger challenge, because it's a string-string map, not a JSON literal. In order to turn it into a strongly-typed nested struct, we would need a SQL expression that can extract the string values and try-cast them to the desired types. That's ugly enough we might prefer to keep completely different code paths for parsed vs. string partition values, but then there's a risk that partition pruning behavior changes depending on which path got invoked. ## How was this change tested? New unit tests, and adjusted one unit test that assumed no partition pruning.
## What changes are proposed in this pull request? Add `DeletionVectors` to supported writer features. We trivially support DVs since we never write DVs. Note as we implement DML in the future we need to ensure it correctly handles DVs ## How was this change tested? modified UT
) ## What changes are proposed in this pull request? Support writer version 2 and `Invariant` table (writer) feature. Note that we don't _actually_ support invariants, rather we enable writing to tables **without invariants** with version=2 or Invariant feature enabled. ### This PR affects the following public APIs Enable writes to version=2/Invariant enabled. ## How was this change tested? new UTs resolves delta-io#706
## What changes are proposed in this pull request? `MetadataValue::Number(i32)` was the previous values for metadata values, but identity columns are only longs, so updated MetadataValue::Number to be `MetadataValue::Number(i64)` instead. ## How was this change tested? I ran the tests, this doesn't change any existing functionality only the type. --------- Co-authored-by: Robert Pack <[email protected]>
## What changes are proposed in this pull request? The `actions-rs/toolchain` action is deprecated in favor of `actions-rust-lang/setup-rust-toolchain`. This PR updates the usages of the respective actions in the github workflows. THe new action already includes an integration with the rust-cache action, so no need to set that up separately anymore. This also sets up a dependabot configuration for `cargo` and `github-actions` which we may or may not choose to keep. ## How was this change tested? no code changes. --------- Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? Split arrow_expression.rs into a dedicated submodule. Move tests and utility functions to separate files for better organization and maintainability. This solves the [issue 745](delta-io#745). ## How was this change tested? existing UT Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? Small code readability improvement -- instead of a triple-nested match, flatten it out to a single level. ## How was this change tested? Existing unit tests. No new functionality.
…y(MapType)` (delta-io#757) title. fixing a panic: return error instead.
## What changes are proposed in this pull request? Our `list_from` implementation for the object_store based filesystem client is currently broken, since it does not behave as documented / required for that function. Specifically we should list all files in the parent folder for using the path as offset to list from. In a follow up PR we then need to lift the assumption that all URLs will always be under the same store to get proper URL handling. ### This PR affects the following public APIs - `DefaultEngine::new` no longer requires a `table_root` parameter. - `list_from` consistently returns keys greater than (`>`) the offset, previously the `sync-engines` client returned all keys (`>=`) ## How was this change tested? Additional tests asserting consistent `list_from` behavior for all our file system client implementations. --------- Signed-off-by: Robert Pack <[email protected]> Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? Current checks for a pre-signed url in our readers may mis-classify object store URLs that do not use store specific schemes (`s3://`m `az://` ...) as pre-signed urls. We extend the current checks, to also consider the presence of a query string and define a helper trait to ensure we can maintain the check in one place. ## How was this change tested? additional test to validate url matches.
…elta-io#766) small change: `replay` generally refers to the work we do to resolve actions during log replay. I've renamed `LogSegment::replay` to `LogSegment::read_actions` since this method does just that: reads actions, but doesn't actually do tangible 'replay' work (Add/Remove matching, etc.). just docs/rename. no material changes.
…expression pub(crate) (delta-io#767) ## What changes are proposed in this pull request? These were never intended to be public, so just fixing that. ### This PR affects the following public APIs - `visit_expression_internal`: is private - `unwrap_kernel_expression`: is `pub(crate)` ## How was this change tested? Existing tests
We don't want to just have all `action` types pub. This moves everything to `pub(crate)` but turns them `pub` with dev-visibility. Breaking change: actions types are now all private (`Metadata`, `Protocol`, `Add`, etc.)
…o#772) ## What changes are proposed in this pull request? According to [arrow docs] we should ensure that we have `-C target-cpu=native` for rustc to generate best available instructions (SIMD, etc.) for our architecture [arrow docs]: https://crates.io/crates/arrow
## What changes are proposed in this pull request? Adds a new required method: `new_null` API for creating a new single-row null literal `EngineData`. Then, we provide the `create_one` API for creating single-row `EngineData` by implementing a `SchemaTransform` (`LiteralExpressionTransform`) to transform the given schema + leaf values into an `Expression` which evaluates to literal values at the leaves of the schema. (implemented in a new private `ExpressionHandlerExtension` trait) 1. Adds the new required `fn new_null` to our `ExpressionHandler` trait (breaking) 2. Adds the new provided `fn create_one` to an `ExpressionHandlerExtension` trait 3. Implements `new_null` for `ArrowExpressionHandler` additionally, adds a new `fields_len()` method to `StructType`. ### This PR affects the following public APIs 1. breaking: new `new_null` API for `ExpressionHandler` 2. breaking: new `LiteralExpressionTransformError` ## How was this change tested? Bunch of new unit tests. For the nullability tests of our new `SchemaTransform` we came up with a set of 24 exhaustive test cases: ``` test cases: x, a, b are nullable (n) or not-null (!). we have 6 interesting nullability combinations: 1. n { n, n } 5. n { n, ! } 6. n { !, ! } 7. ! { n, n } 8. ! { n, ! } 9. ! { !, ! } and for each we want to test the four combinations of values ("a" and "b" just chosen as abitrary scalars): 1. (a, b) 2. (N, b) 4. (a, N) 5. (N, N) here's the full list of test cases with expected output: n { n, n } 1. (a, b) -> x (a, b) 2. (N, b) -> x (N, b) 3. (a, N) -> x (a, N) 4. (N, N) -> x (N, N) n { n, ! } 1. (a, b) -> x (a, b) 2. (N, b) -> x (N, b) 3. (a, N) -> Err 4. (N, N) -> x NULL n { !, ! } 1. (a, b) -> x (a, b) 2. (N, b) -> Err 3. (a, N) -> Err 4. (N, N) -> x NULL ! { n, n } 1. (a, b) -> x (a, b) 2. (N, b) -> x (N, b) 3. (a, N) -> x (a, N) 4. (N, N) -> x (N, N) ! { n, ! } 1. (a, b) -> x (a, b) 2. (N, b) -> x (N, b) 3. (a, N) -> Err 4. (N, N) -> NULL ! { !, ! } 1. (a, b) -> x (a, b) 2. (N, b) -> Err 3. (a, N) -> Err 4. (N, N) -> NULL ```
…to embeddable `FileActionsDeduplicator` (delta-io#769) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> **No behavioral changes were introduced, this is purely a refactoring effort** This PR extracts the core deduplication logic from the `AddRemoveDedupVisitor` in order to be shared with the incoming `CheckpointVisitor`. For a bigger picture view on how this refactor is helpful, please take a look at the following PR which implements the `CheckpointVisitor` with an embedded `FileActionsDeduplicator` that will rebase this PR once merged. [[link to PR]](delta-io#738). This `FileActionsDeduplicator` lives in the new top-level `log_replay` mod as it will be leveraged in the nested `scan/log_replay` mod and the incoming `checkpoints/log_replay` mod. There are also additional traits & structs that the two `log_replay` implementations will share via this new top-level mod. For an even wider view of the implementation of the `checkpoints` mod and the component re-use, please have a look at the following PR. [[link to PR]](delta-io#744) ## Summary of refactor 1. New `log_replay` mod 2. Moved `FileActionKey` definition from `scan/log_replay` to the new `log_replay` mod 3. New `FileActionDeduplicator` in the new `log_replay` mod - Includes the `check_and_record_seen` method which was simply **moved** from the `AddRemoveDedupVisitor` - Includes the `extract_file_action` method and `extract_dv_unique_id` private method which may be new concepts, but includes functionality which are both pieces of functionality pulled from the `AddRemoveDedupVisitor` to be shared with the incoming `CheckpointVisitor` <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> All existing tests pass ✅
) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR refactors test helper functions by moving them to the test_utils module: - string_array_to_engine_data - parse_json_batch - action_batch This change is a preparatory step for delta-io#738, which will leverage these functions in a new checkpoint module. <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> No behavioral changes, all current tests pass ✅
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR introduces the `CheckpointMetadata` action. This action is only allowed in checkpoints following [V2 spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#v2-spec). For more information: [[link to protocol]](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoint-metadata) This PR is part of the on-going effort to implement single-file checkpoint write support. For reference, [[link to write API proposal]](delta-io#779) There already exists a `CheckpointMetadata` named struct [[link]](https://github.com/delta-io/delta-kernel-rs/blob/9290930bbeb1100e7af98c228dbd339eea38143a/kernel/src/snapshot.rs#L149) which represents the `_last_checkpoint` file. This PR renames this struct to `LastCheckpointHint` <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> - `test_checkpoint_metadata_schema`: tests schema projection
…tCheckpointHint` (delta-io#789) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR renames `CheckpointMetadata` to `LastCheckpointHint` in order not to clash with the incoming `CheckpointMetadata` [[link to PR]](delta-io#781). Moved to another PR ~~This PR also changes the types of fields in the `LastCheckpointHint`. This was done to unblock the in-flight work for single-file checkpoint write support, which includes the creation of the `_last_checkpoint` file. As `usize` and `u64` primitives are not supported, and we want to avoid a mismatch of field types we read v.s. write, we are updating the types to `i64`. There is an overarching github issue for converting field types to unsigned integers (u64, usize) where semantically appropriate (such as for `LastCheckpointHint.version`... ) and adding proper support for these data type primitives throughout the codebase. delta-io#786 <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> No new behavioral changes. All current tests pass ✅
…io#800) builds currently broken due to rustc 1.86 changing an error message + new clippy lints. this PR updates expected output to the new error and does two clippy lints: 1. `next_back()` instead of `last()` on iterators 2. docstring indentation updates
…o#802) ## What changes are proposed in this pull request? Renamed ReaderFeatures and WriterFeatures to ReaderFeature and WriterFeature ## How was this change tested? `cargo test`
…als (delta-io#803) ## What changes are proposed in this pull request? The code for `Expression::references` uses an adhoc expression traversal instead of an expression transform, and the default parquet reader's `compute_field_indices` method uses an adhoc expression traversal instead of invoking `Expression::references`. Fix them both. ## How was this change tested? Existing unit tests.
…essor` trait (delta-io#774) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> **No behavioral changes were introduced, this is purely a refactoring effort** ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR refactors the newly-named `ScanLogReplayProcessor` to implement the newly introduced `LogReplayProcessor` trait in order to unify components for the incoming `CheckpointLogReplayProcessor`. For a bigger picture view of how this refactor is helpful, please have a look at the following PR, which introduces the `CheckpointLogReplayProcessor` which implements the `LogReplayProcessor` trait, that will rebase this PR once merged delta-io#744 ### Summary of changes - Introduced the `LogReplayProcessor` trait in the top-level `log_replay` mod - Renamed `LogReplayScanner` in scan/log_replay -> `ScanLogReplayProcessor` - Updated `ScanLogReplayProcessor` to implement the `LogReplayProcessor` trait - requires pushing down the `add_transform`, `logical_schema`, and `transform` fields to the `ScanLogReplayProcessor` for accessibility in the `process_actions_batch` trait method, as the trait method implementation has fixed parameters. - Simplified the `scan_action_iter` function to use the trait's `apply_to_iterator` method <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> All existing tests pass ✅
## What changes are proposed in this pull request? Rust doesn't encourage the `get_` prefix for getters because it's redundant and anyway a getter is allowed to have the same name as the field it exposes. Remove the prefix from the various engine interfaces. Additionally, rename `get_evaluator` as `new_expression_evaluator` to accurately reflect that it is _NOT_ a getter at all, but actually creates a new expression evaluator. Finally, we also rename `ExpressionHandler` to `EvaluationHandler` because that trait is used to create expression _evaluators_, not expressions. Additionally, future work will differentiate generic "expressions" from "predicates" (boolean-valued expressions with special evaluation semantics), and that will likely necessitate defining a `EvaluationHandler::new_predicate_evaluator` method alongside the `new_expression_evaluator` method. ### This PR affects the following public APIs All the methods and traits we change are public. ## How was this change tested? Rename-only operation, no functional changes. Compilation suffices.
…io#782) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> This PR introduces the helper methods: - `new_uuid_parquet_checkpoint` which creates a new `ParsedCheckpointPath<Url>` for a uuid-named parquet checkpoint file at the specified version. The UUID-naming scheme looks like: `n.checkpoint.u.parquet`, where u is a UUID and n is the snapshot version that this checkpoint represents. - `new_classic_parquet_checkpoint` which creates a new `ParsedCheckpointPath<Url>` for a classic-named parquet checkpoint file at the specified version. The classic-naming scheme looks like: `n.checkpoint.parquet`, where n is the snapshot version that this checkpoint represents. - **Updates the `uuid` dependency to always include `v4` and `fast-rng` features:** - This ensures that `uuid::new_v4()` is always available. - The `fast-rng` feature improves performance when generating UUIDs. For more information on the two checkpoint naming-schemes: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#uuid-named-checkpoint https://github.com/delta-io/delta/blob/master/PROTOCOL.md#classic-checkpoint This PR is part of the on-going effort to implement single-file checkpoint write support. For reference, [[link to write API proposal]](delta-io#779) <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> - `test_new_uuid_parquet_checkpoint` - verifies UUID-named Parquet checkpoint creation with proper attributes. - `test_new_classic_parquet_checkpoint` - verifies classic-named Parquet checkpoint creation with proper attributes.
## What changes are proposed in this pull request? Continuing the cleanup started by delta-io#804, rename the class to eliminate the last vestiges of (ancient) "client" nomenclature. ### This PR affects the following public APIs The renamed class (and module) are public. ## How was this change tested? Rename-only operation, no functional changes. Compilation suffices.
This PR enables incremental snapshot updates. This is done with a new `Snapshot::try_new_from(...)` which takes an `Arc<Snapshot>` and an optional version (None = latest version) to incrementally create a new snapshot from the existing one. The heuristic is as follows: 1. if the new version == existing version, just return the existing snapshot 2. if the new version < existing version, error since the engine shouldn't really be here 3. list from (existing checkpoint version + 1, or version 1 if no checkpoint) onward (create a new 'incremental' `LogSegment`) 4. if no new commits/checkpoint, return existing snapshot (if requested version matches), else create new `LogSegment` 5. check for a checkpoint: a. if new checkpoint is found: just create a new snapshot from that checkpoint (and commits after it) b. if no new checkpoint is found: do lightweight P+M replay on the latest commits In addition to the 'main' `Snapshot::try_new_from()` API, the following incremental APIs were introduced to support the above implementation: 1. `TableConfiguration::try_new_from(...)` 2. splitting `LogSegment::read_metadata()` into `LogSegment::read_metadata()` and `LogSegment::protocol_and_metadata()` 3. new `LogSegment.checkpoint_version` field resolves delta-io#489
## What changes are proposed in this pull request? Instead of `Protocol` retaining a list of `Strings` representing lists of `ReaderFeature`/`WriterFeature`, we move to embed the fully parsed `ReaderFeature`s and `WriterFeature`s. This changes `Protocol.reader_features` and `Protocol.writer_features` fields from`Option<Vec<String>>` to Optional vecs of `ReaderFeatures` and `WriterFeatures` respectively. Critically, this includes a new `Unknown` variant which allows us to parse all possible strings in the protocol into `ReaderFeature`s and `WriterFeature`s but later detect if unknown features are present and fail when ensuring reader/writer features are supported. ### This PR affects the following public APIs Breaking: new `ReaderFeature::Unknown(String)` and `WriterFeature::Unknown(String)` variants (note that `Protocol` and fields are `pub(crate)`) ## How was this change tested? UT modification --------- Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? Rename `ScanData` to `ScanMetadata` and `Scan::scan_data` to `Scan::scan_metadata` (and corresponding FFI). Additionally, renames `TableChangesScanData` to `TableChangesScanMetadata`. Additional docs/refactor coming in delta-io#768 ### This PR affects the following public APIs breaking changes: 1. rename `ScanData` to `ScanMetadata` 2. rename `Scan::scan_data()` to `Scan::scan_metadata()` 3. (ffi) rename `free_kernel_scan_data()` to `free_scan_metadata_iter()` 4. (ffi) rename `kernel_scan_data_next()` to `scan_metadata_next()` 5. (ffi) rename `visit_scan_data()` to `visit_scan_metadata()` 6. (ffi) rename `kernel_scan_data_init()` to `scan_metadata_iter_init()` 7. (ffi) rename `KernelScanDataIterator` to `ScanMetadataIterator` 8. (ffi) rename `SharedScanDataIterator` to `SharedScanMetadataIterator` ## How was this change tested? existing resolves delta-io#816
…ta` type (delta-io#768) ## What changes are proposed in this pull request? 1. Updated `ScanMetata` from typed tuple to struct. ScanMetadata is now a struct with fields: - filtered_data: A `FilteredEngineData` instance. - transforms: A vector of transformations to be applied to the data read from the files 2. Introduction of `FilteredEngineData` type: Couples `EngineData` with a selection vector indicating which rows to process. This type is returned from the`scan_metadata` API and the incoming `checkpoint` API 3. Updates `visit_scan_files` parameters to accept `ScanMetadata` to avoid de-structuring. 4. Corresponding FFI changes for `visit_scan_files` to accept `ScanMetadata` param All current tests pass.
release 0.9.0
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Key changes resolves delta-io#737. This PR implements the `CheckpointVisitor` necessary for filtering a stream of actions into a stream of actions to be included in a checkpoint file. This leverages the `FileActionDeduplicator` [[link to PR]](delta-io#769). This PR introduces the `checkpoint` mod, and implements the visitor in the new `checkpoint/log_replay` mod. Comprehensive module documents are included in the new modules which provide an overview of the incoming code additions, along with it's goal. ### Checkpoint Content A **complete V1 checkpoint** encapsulates: 1. All FILE actions that make up the state of a version of a table: - Add actions (after action reconciliation) - Unexpired remove actions ([remove tombstones](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file)) 2. All NON-FILE actions that make up the state of a version of a table: - Protocol action - Metadata action - Txn actions A **single-file V2 checkpoint** is simply a super-set of the actions included in the V1 checkpoint schema, with the addition of the `CheckpointMetadata` action (which must be generated on every write). Since single-file v2 checkpoints will also leverage this visitor, we have chosen to name it the general `CheckpointVisitor` Note: - CDC, CommitInfo, Sidecar, and CheckpointMetadata actions are NOT part of the **V1** checkpoint schema. - Sidecar and CheckpointMetadata actions are part of the **V2** checkpoint schema. ### The new `CheckpointVisitor` This visitor selects the **FILE** actions for a V1 spec checkpoint via a selection vector: 1. Processes add/remove actions with proper deduplication based on path and deletion vector ID pairs 2. Optimization: Only tracks already seen file paths in **commit files**, as actions in checkpoint files are the last batches to be processed, and do not conflict with other actions in checkpoint files. 3. Applies tombstone expiration logic by filtering out remove actions with deletion timestamps older than the minimum file retention timestamp This visitor also selects the **NON-FILE** actions for a V1 spec checkpoint via a selection vector: 1. Ensures exactly one protocol action is included (the newest one encountered) 2. Ensures exactly one metadata action is included (the newest one encountered) 3. Deduplicates transaction (txn) actions by app ID to include only the newest action for each app ID <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> `test_checkpoint_visitor` - Tests basic functionality with both file and non-file actions, verifying correct counts and selection vector. `test_checkpoint_visitor_boundary_cases_for_tombstone_expiration` - Tests how tombstone expiration handles threshold boundary conditions. `test_checkpoint_visitor_conflicting_file_actions_in_log_batch` - Verifies duplicate path handling in log batches (keeping first, skipping second). `test_checkpoint_visitor_file_actions_in_checkpoint_batch` - Tests that duplicate file actions are included in checkpoint batches. `test_checkpoint_visitor_conflicts_with_deletion_vectors` - Tests file deduplication with deletion vectors to ensure uniqueness. `test_checkpoint_visitor_already_seen_non_file_actions` - Verifies that pre-populated actions are skipped correctly. `test_checkpoint_visitor_duplicate_non_file_actions` - Tests deduplication of non-file actions (protocol, metadata, transactions).
## What changes are proposed in this pull request? In delta-io#699 table_root was removed from params, just fix the docs after this change. Co-authored-by: Zach Schuermann <[email protected]>
## What changes are proposed in this pull request? The `predicates` module has a confusing name because it sits alongside the `expressions` module but does not actually define predicates (= boolean-valued expressions) at all. Instead, it contains kernel's implementation of predicate evaluation, used for e.g. data skipping and partition pruning. Rename the module to `kernel_predicates` and rename corresponding classes like `PredicateEvaluator` to `KernelPredicateEvaluator`, to make their purpose more clear. This also helps clear the way for us to split out predicates as a separate concept from normal expressions, see delta-io#765. ## How was this change tested? Module and class renames only. Compilation suffices.
Konstantin Bogdanov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.