Skip to content

Read metadata and protocol information from Delta checksum files#28381

Open
adam-richardson-openai wants to merge 4 commits intotrinodb:masterfrom
adam-richardson-openai:dev/as3richa/delta-checksum-upstream
Open

Read metadata and protocol information from Delta checksum files#28381
adam-richardson-openai wants to merge 4 commits intotrinodb:masterfrom
adam-richardson-openai:dev/as3richa/delta-checksum-upstream

Conversation

@adam-richardson-openai
Copy link

@adam-richardson-openai adam-richardson-openai commented Feb 20, 2026

Description

Read metadata and protocol information from Delta checksum files, when configured and where available

Compliant writers of Delta tables may optionally write "checksum" files alongside each commit. These checksum files contain a variety of (optional) useful information, including the Delta table metadata and protocol information. See https://github.com/delta-io/delta/blob/488c916931ca9d210f4cadd2d5520e0274d26b04/PROTOCOL.md#version-checksum-file for the full checksum file spec

Trino needs to load the table metadata and protocol information at planning time. Today, this is done by identifying and loading the latest table checkpoint, as well as replaying all intervening commits up to the latest. This can be extremely slow and expensive, as checkpoints can be enormous and there may be many intervening commits

Instead, we can simply determine the latest commit in the table, load the corresponding checksum file (if it exists), and parse the metadata and protocol information (if available in the checksum file). This takes only a single listing operation and a single load of a small JSON file, as opposed to potentially loading many files in the Delta log based approach (some of which may be extremely large depending on the size and configuration of the table)

If there is no checksum file for the latest eligible commit in the table, or if the checksum file doesn't capture both the metadata and the protocol information for the table, we fall back to the existing approach of scanning the Delta log. (Checksum files are considered optional under the Delta spec, as are all fields therein)

This new behavior is gated behind a session property, load_metadata_from_checksum_file, which in turn defaults to the value of the delta.load-metadata-from-checksum-file configuration. The config value itself defaults to true, since we expect this change to be a straightforward performance optimization in the overwhelming majority of cases

This optimization is particularly effective for tables using the v1 checkpoint spec, since v1 checkpoints files may be very large and heavy

We drove internal performance testing, using queries like

SELECT 1 FROM <table> LIMIT 1

where <table> is a large table using the v1 checkpoint spec. We observed that time spent in analysis fell from 10s on average to well under 500ms

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta
* Supporting reading metadata and protocol information from Delta checksum files where available (configurable via session property `load_metadata_from_checksum_file` and configuration `delta.load-metadata-from-checksum-file`)

@cla-bot
Copy link

cla-bot bot commented Feb 20, 2026

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the delta-lake Delta Lake connector label Feb 20, 2026
@adam-richardson-openai
Copy link
Author

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

I emailed my signed CLA to cla@trino.io moments ago

@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from 08bb9c5 to fd9787c Compare February 20, 2026 02:56
@cla-bot
Copy link

cla-bot bot commented Feb 20, 2026

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from fd9787c to de2bcd7 Compare February 20, 2026 06:10
@cla-bot
Copy link

cla-bot bot commented Feb 20, 2026

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for reading Delta table metadata and protocol information from checksum files (.crc files) when available, providing a significant performance optimization for tables with large v1 checkpoints. The feature is controlled by a new configuration property delta.load_metadata_from_checksum_file (defaulting to true) and corresponding session property load_metadata_from_checksum_file.

Changes:

  • Added support for reading metadata and protocol information from Delta checksum files, falling back gracefully to transaction log scanning when checksum files are unavailable or incomplete
  • Introduced configuration and session properties to control the new checksum file loading behavior
  • Enhanced test coverage with comprehensive unit and integration tests for checksum file parsing, fallback behavior, and error handling

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
DeltaLakeConfig.java Added new configuration property load_metadata_from_checksum_file (defaults to true)
DeltaLakeSessionProperties.java Added corresponding session property for checksum metadata loading
DeltaLakeVersionChecksum.java New class representing the structure of Delta checksum files with metadata and protocol entries
TransactionLogParser.java Added methods getLatestCommitVersion and readVersionChecksumFile to support checksum file operations
DeltaLakeMetadata.java Refactored getTableHandle to attempt loading from checksum files first before falling back to transaction log
DeltaLakeTableMetadataScheduler.java Refactored isSameTransactionVersion method to accept version directly, supporting both snapshot and version checks
TestTransactionLogParser.java Added comprehensive tests for checksum file reading and parsing edge cases
TestDeltaLakeMetadata.java Added integration tests for checksum loading, fallback behavior, and error handling scenarios
TestDeltaLakeConfig.java Updated test to validate default value of new configuration property
TestDeltaLakeFileOperations.java Updated file operation tracking to account for checksum file reads
TestDeltaLakeBasic.java Updated error message assertions to accommodate new error messages from checksum loading path
TestDeltaLakeAlluxio*.java Updated Alluxio cache operation tests to include checksum file interactions
TestDeltaLakeActiveFilesCache.java Updated to disable checksum loading for reproducing specific cache staleness issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@adam-richardson-openai
Copy link
Author

Just for clarity/posterity -- I force-pushed this branch a couple times with additional changes to address test failures, to avoid trashing the commit history and since there had been no ongoing review. Now that reviewers are engaged, I'll put subsequent fixes in their own commits!

@cla-bot
Copy link

cla-bot bot commented Feb 20, 2026

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@adam-richardson-openai
Copy link
Author

Based on Copilot's feedback, I went from snake_case to kebab-case for the configuration property. I have updated the PR description to reflect this change, but have not yet updated the original commit to avoid thrashing the history. The commit message must be updated prior to merge

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation @adam-richardson-openai

Looking forward for you to address the comments

@cla-bot cla-bot bot added the cla-signed label Feb 21, 2026
@github-actions github-actions bot added the docs label Feb 22, 2026
@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch 2 times, most recently from 5e4421e to aab2cf1 Compare February 23, 2026 01:03
@adam-richardson-openai adam-richardson-openai changed the title Read metadata and protocol information from Delta checksum files, when configured and where available Read metadata and protocol information from Delta checksum files Feb 23, 2026
@adam-richardson-openai
Copy link
Author

I substantially reworked the new tests in aab2cf1. Summary:

  • I eliminated all cases of mocking or writing of synthetic files, in favor of new fixtures generated using Spark
  • I added several new tests relating to fallback logic to TestDeltaLakeFileOperations. These mostly replace old tests in TestDeltaLakeMetadata that were excessively mock-heavy and that have since been deleted
  • I preserved a few basic smoketests/sanity-check shaped tests in TestDeltaLakeMetadata, using fixtures rather than munging the table metadata in-band. While I think these tests are useful, I want to flag that they required some additional complexity to support referencing fixture tables in the context of the existing suite, so I'm also okay to remove them if preferred

@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch 2 times, most recently from 929e33c to 49f7d80 Compare February 24, 2026 20:43
@adam-richardson-openai
Copy link
Author

I believe the CI failure is an unrelated network or infra flake, not a consequence of my change

@wendigo
Copy link
Contributor

wendigo commented Feb 24, 2026

@adam-richardson-openai I've restarted failed jobs

@adam-richardson-openai
Copy link
Author

@adam-richardson-openai I've restarted failed jobs

Thanks! Passed.

@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from 49f7d80 to 0032a60 Compare February 24, 2026 23:22
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the nit comments are being addressed, @chenjian2664 pls review the contribution

@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from 0032a60 to ddaa09b Compare February 25, 2026 00:08
Compliant Delta writers may emit optional checksum files alongside
commits containing metadata and protocol information. Instead of
loading the latest checkpoint and replaying intervening commits (which
can be expensive, especially for large v1 checkpoints), Trino can read
the latest commit’s checksum file to obtain this information with a
single listing and small JSON read. Ref.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#version-checksum-file

If the checksum file is missing or does not contain both metadata and
protocol, we fall back to the existing Delta log scanning approach.

Behavior is gated by session property load_metadata_from_checksum_file
(defaulting to config delta.load_metadata_from_checksum_file, which
defaults to true). Internal testing reduced analysis time for large
v1-checkpoint tables from ~10s to <500ms.

Co-authored-by: Eric Hwang <eh@openai.com>
Co-authored-by: Fred Liu <fredliu@openai.com>
@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from ddaa09b to 12a5cd1 Compare February 25, 2026 00:11
@adam-richardson-openai
Copy link
Author

adam-richardson-openai commented Feb 25, 2026

Addressed all the remaining nit comments. Thank you @findinpath!

I'll confess that my coding style is pretty comment-heavy. Per Marius' feedback, I removed all the comments that I'd consider obvious or noisy, but did leave a few in places that I think are relevant. If we still want to remove any or all of the remaining comments, I'm more than happy to make that change

Copy link
Contributor

@chenjian2664 chenjian2664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked the tests yet, the major concern is the listing of the transaction log dir
please share more about the tests

throws IOException
{
long latestCommitVersion = -1;
FileIterator files = fileSystem.listFiles(Location.of(getTransactionLogDir(tableLocation)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam-richardson-openai this will list _delta_log every time, and the request doesn't have cache(not likely the TableSnapshot), could you share what's the size of the tables your were tested (how many log files under _delta_log), and any regression found?
It's seems quite suitable if the writer to write the correct checksum file very frequently.

Comment on lines +852 to +856
MetadataEntry metadataEntry = checksum.getMetadata();
ProtocolEntry protocolEntry = checksum.getProtocol();
if (metadataEntry == null || protocolEntry == null) {
return Optional.empty();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MetadataEntry and ProtocolEntry are trustworthy, it is reasonable to pass their results when loading descriptors from logs, we can refine it as a follow-up

Copy link
Author

@adam-richardson-openai adam-richardson-openai Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand this comment. Are you saying that we should assume that metadata and protocol are always present if a checksum file exists?

If this what you mean, then this discussion is related: https://github.com/trinodb/trino/pull/28381/changes#r2840656385. Basically I feel we should err towards being somewhat permissive, to defend against non-compliant writes. At minimum if we wanted to be stricter here, we would need to regenerate some old noncompliant fixtures!

Please let me know if you were making a different point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the current logic returns empty when both metadata and protocol are null, we can reuse the computed entry when only one is present, which avoids recomputation later - but this is an unlikely edge case, just for the incompliant clients, we could do it as follow up

Read metadata and protocol information from Delta checksum files

Compliant Delta writers may emit optional checksum files alongside
commits containing metadata and protocol information. Instead of
loading the latest checkpoint and replaying intervening commits (which
can be expensive, especially for large v1 checkpoints), Trino can read
the latest commit’s checksum file to obtain this information with a
single listing and small JSON read. Ref.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#version-checksum-file

If the checksum file is missing or does not contain both metadata and
protocol, we fall back to the existing Delta log scanning approach.

Behavior is gated by session property load_metadata_from_checksum_file
(defaulting to config delta.load_metadata_from_checksum_file, which
defaults to false). Internal testing reduced analysis time for large
v1-checkpoint tables from ~10s to <500ms.

Co-authored-by: Eric Hwang <eh@openai.com>
Co-authored-by: Fred Liu <fredliu@openai.com>
@adam-richardson-openai adam-richardson-openai force-pushed the dev/as3richa/delta-checksum-upstream branch from 7263d40 to 4e1c515 Compare February 26, 2026 22:11
@adam-richardson-openai
Copy link
Author

please share more about the tests

We added the following fixtures:

  • checksum: Table with valid checksums files for every commit
  • checksum_missing_latest: Valid table containing some checksum files, but missing the checksum file for the latest commit
  • checksum_without_metadata: Table with checksum files, but where the checksum file for the latest commit is missing metadata (technically invalid per the Delta spec, but supported by our implementation)
  • checksum_invalid_json/checksum_invalid_json_mapping/checksum_trailing_json_content: Tables with various shapes of invalid or corrupted checksum files for the latest commit

We use these files to power the following tests:

  • TestDeltaLakeBasic

    • testLoadMetadataFromChecksumFileMatchesTransactionLog: verifies that checksum-enabled and checksum-disabled reads produce the same visible table metadata (DESCRIBE, SHOW CREATE TABLE, $properties) for valid, missing-latest-checksum, and checksum-without-metadata fixtures
    • testLoadMetadataFromChecksumFileFallsBackForMalformedChecksum: verifies that malformed checksum files fall back cleanly to transaction-log loading and still produce the same visible metadata as the checksum-disabled path
  • TestDeltaLakeFileOperations

    • (Note: we use DESCRIBE in several tests here to cleanly test the metadata/Delta log interactions separate from data file interactions)
    • testDescribeWithLoadMetadataFromChecksumFileEnabledDoesNotReadTransactionLogOrCheckpoint: verifies that DESCRIBE on the checksum fixture uses only the checksum path, not the transaction log (when enabled)
    • testDescribeWithLoadMetadataFromChecksumFileDisabledReadsTransactionLogAndCheckpoint: verifies that DESCRIBE on the checksum fixture uses the transaction log rather than checksums (when disabled)
    • testDescribeWithLoadMetadataFromChecksumFileMissingChecksumFallsBackToTransactionLogAndCheckpoint: verifies that a missing latest checksum file causes DESCRIBE to fall back to checkpoint and transaction log reads
    • testDescribeWithLoadMetadataFromChecksumFileWithoutMetadataFallsBackToTransactionLogAndCheckpoint: verifies that a checksum file lacking required metadata causes DESCRIBE to fall back to checkpoint and transaction log reads
    • testSelectFromChecksumTable: verifies file accesses for a SELECT query on the checksum fixture, with checksum loading enabled
    • testSelectFromChecksumTableVersionTimeTravel: verifies file accesses for a time travel SELECT query on the checksum fixture, with checksum loading enabled
    • testSelectFromChecksumTableTimestampTimeTravel: verifies file accesses for a timestamp-based time travel SELECT query on the checksum fixture, with checksum loading enabled
  • TestTransactionLogParser

    • (All unit tests)
    • testFindLatestCommitVersionChecksumFileInfo: verifies that findLatestCommitVersionChecksumFileInfo correctly identifies the latest commit and determines whether it has a corresponding checksum file
    • testReadVersionChecksumFile: verifies that readVersionChecksumFile returns the expected result for a valid version checksum file
    • testReadVersionChecksumFileMissingFile: verifies that attempting to read a nonexistent checksum file returns empty rather than failing
    • testReadVersionChecksumFile{InvalidJson,InvalidJsonMapping,JsonWithTrailingContent}: verifies that various kinds of invalid checksum JSON are treated as a soft failure and return empty
  • TestDeltaLakeAlluxioCacheFileOperations

    • testCacheFileOperationsWithChecksumFilesEnabled/Disabled: verifies the Alluxio cache trace for a simple table scan over a fixture with checkpoints, when checksum-based metadata loading is enabled/disabled

@adam-richardson-openai
Copy link
Author

I have addressed all of the outstanding requests for changes. I sanitized and reworked the tests slightly mostly to reflect the new default of "false" for the checksum feature, and have shared a summary in a previous comment on all new tests and their coverage over the change

These comment threads are outstanding:

I will circle back with additional data on our internal performance testing within the next couple days

endVersion.isPresent());
}

private DeltaLakeTableDescriptor loadDescriptor(ConnectorSession session, SchemaTableName tableName, DeltaMetastoreTable table, TrinoFileSystem fileSystem, String tableLocation, Optional<ConnectorTableVersion> startVersion, Optional<ConnectorTableVersion> endVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split the parameters onto multiple lines, and the SchemaTableName is redundant, we could get it from DeltaMetastoreTable


private MaterializedResult loadVisibleTableMetadata(String tableName, Session session, @Language("SQL") String sql)
{
assertUpdate("CALL system.flush_metadata_cache(schema_name => CURRENT_SCHEMA, table_name => '%s')".formatted(tableName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the reason for this line here, why do we needs to execute for every statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this is linked to emptying TransactionLogAccess cache.
Maybe call the method computeActualWithEmptyCache

copyDirectoryContents(Path.of(getResourceLocation(fixture).toURI()), tableLocation);
assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')".formatted(tableName, tableLocation.toUri()));
try {
assertMetadataAndProtocolQueriesMatch(tableName, loadMetadataFromChecksumFileEnabledSession, loadMetadataFromChecksumFileDisabledSession);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add a method to only verify one statement at a time, i,e(just FYI):

private void assertReadingMetadataAndProtocolFromChecksum(String tableName, String sql)
    {
        MaterializedResult checksumEnabledResult = loadVisibleTableMetadata(tableName, loadMetadataFromChecksumFileSession(true), sql);
        MaterializedResult checksumDisabledResult = loadVisibleTableMetadata(tableName, loadMetadataFromChecksumFileSession(false), sql);
        assertThat(checksumEnabledResult).isEqualTo(checksumDisabledResult);
    }

wrap all three statements into one method doesn't increase the readability

* @see deltalake.checksum
*/
@Test
public void testDescribeWithLoadMetadataFromChecksumFileEnabledDoesNotReadTransactionLogOrCheckpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is quite long, nit, how about:
testLoadMetadataFromChecksumFileForDescribe

* @see deltalake.checksum
*/
@Test
public void testDescribeWithLoadMetadataFromChecksumFileDisabledReadsTransactionLogAndCheckpoint()
Copy link
Contributor

@chenjian2664 chenjian2664 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit rename to testLoadMetadataWithFromChecksumFileDisabled or testLoadMetadataWithFromChecksumFileDisabledForDescribe

* @see deltalake.checksum_missing_latest
*/
@Test
public void testDescribeWithLoadMetadataFromChecksumFileMissingChecksumFallsBackToTransactionLogAndCheckpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, rename to testLoadMetadataFromMissingLatestChecksumFileForDescribe

* @see deltalake.checksum_without_metadata
*/
@Test
public void testDescribeWithLoadMetadataFromChecksumFileWithoutMetadataFallsBackToTransactionLogAndCheckpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, ditto, use a shorter name if possible, once you done, remember to update the tableName in the test as well

Comment on lines +1320 to +1323
String catalog = getSession().getCatalog().orElseThrow();
Session session = Session.builder(getSession())
.setCatalogSessionProperty(catalog, "load_metadata_from_checksum_file", "true")
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is likely that we could extract the logic as a method

.add(new FileOperation(CHECKPOINT, "00000000000000000001.checkpoint.parquet", "InputFile.length"))
.add(new FileOperation(CHECKPOINT, "00000000000000000001.checkpoint.parquet", "InputFile.newInput"))
.add(new FileOperation(TRANSACTION_LOG_JSON, "00000000000000000002.json", "InputFile.length"))
.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to drop the table

assertThat(checksumEnabledProperties).isEqualTo(checksumDisabledProperties);
}

private MaterializedResult loadVisibleTableMetadata(String tableName, Session session, @Language("SQL") String sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method name is also rather misleading.
the method executes a generic query while the name of the method is loadVisibleTableMetadata

@adam-richardson-openai
Copy link
Author

adam-richardson-openai commented Mar 2, 2026

Due to competing internal priorities, I will put this PR down for a few days and circle back on the remaining comments later. Thank you for all the review thus far! All of the recent minor feedback items make sense to me

For this PR to be a uniform performance win, we'll need to ship the listFilesStartingFrom optimization as discussed in https://github.com/trinodb/trino/pull/28381/changes#r2869932531. I have all the relevant code changes in our internal fork, but will need a few hours of work to get those changes in a state where they can be merged to the upstream

In parallel, I actually found a compelling alternative implementation strategy for this PR that may simplify the change overall (e.g. this alternative strategy doesn't require any listFiles calls). The one-sentence summary is that we can extend TransactionLogAccess#getMetadataAndProtocolEntry to read the version checksum file instead, rather than conditionally picking between different access patterns in DeltaLakeMetadata. I would like to do some basic performance testing on this revised take, which will dictate whether to continue with the original approach or to pivot to the simplified approach

Thankfully, much of the review and iteration effort is shared between the two approaches, since much of it was focused on tests and fixtures

I will share more information on this in the next few days

@findinpath
Copy link
Contributor

The one-sentence summary is that we can extend TransactionLogAccess#getMetadataAndProtocolEntry to read the version checksum file instead, rather than conditionally picking between different access patterns in DeltaLakeMetadata.

I would argue that using listFilesStartingFrom from last_checkpoint is cheaper than actually loading all the transaction log files in TableSnapshot.
I'm looking forward though to see your new implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

5 participants