Read metadata and protocol information from Delta checksum files#28381
Read metadata and protocol information from Delta checksum files#28381adam-richardson-openai wants to merge 4 commits intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
I emailed my signed CLA to cla@trino.io moments ago |
08bb9c5 to
fd9787c
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
fd9787c to
de2bcd7
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There was a problem hiding this comment.
Pull request overview
This pull request adds support for reading Delta table metadata and protocol information from checksum files (.crc files) when available, providing a significant performance optimization for tables with large v1 checkpoints. The feature is controlled by a new configuration property delta.load_metadata_from_checksum_file (defaulting to true) and corresponding session property load_metadata_from_checksum_file.
Changes:
- Added support for reading metadata and protocol information from Delta checksum files, falling back gracefully to transaction log scanning when checksum files are unavailable or incomplete
- Introduced configuration and session properties to control the new checksum file loading behavior
- Enhanced test coverage with comprehensive unit and integration tests for checksum file parsing, fallback behavior, and error handling
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| DeltaLakeConfig.java | Added new configuration property load_metadata_from_checksum_file (defaults to true) |
| DeltaLakeSessionProperties.java | Added corresponding session property for checksum metadata loading |
| DeltaLakeVersionChecksum.java | New class representing the structure of Delta checksum files with metadata and protocol entries |
| TransactionLogParser.java | Added methods getLatestCommitVersion and readVersionChecksumFile to support checksum file operations |
| DeltaLakeMetadata.java | Refactored getTableHandle to attempt loading from checksum files first before falling back to transaction log |
| DeltaLakeTableMetadataScheduler.java | Refactored isSameTransactionVersion method to accept version directly, supporting both snapshot and version checks |
| TestTransactionLogParser.java | Added comprehensive tests for checksum file reading and parsing edge cases |
| TestDeltaLakeMetadata.java | Added integration tests for checksum loading, fallback behavior, and error handling scenarios |
| TestDeltaLakeConfig.java | Updated test to validate default value of new configuration property |
| TestDeltaLakeFileOperations.java | Updated file operation tracking to account for checksum file reads |
| TestDeltaLakeBasic.java | Updated error message assertions to accommodate new error messages from checksum loading path |
| TestDeltaLakeAlluxio*.java | Updated Alluxio cache operation tests to include checksum file interactions |
| TestDeltaLakeActiveFilesCache.java | Updated to disable checksum loading for reproducing specific cache staleness issues |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSessionProperties.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
|
Just for clarity/posterity -- I force-pushed this branch a couple times with additional changes to address test failures, to avoid trashing the commit history and since there had been no ongoing review. Now that reviewers are engaged, I'll put subsequent fixes in their own commits! |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
Based on Copilot's feedback, I went from snake_case to kebab-case for the configuration property. I have updated the PR description to reflect this change, but have not yet updated the original commit to avoid thrashing the history. The commit message must be updated prior to merge |
...-lake/src/main/java/io/trino/plugin/deltalake/metastore/DeltaLakeTableMetadataScheduler.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
findinpath
left a comment
There was a problem hiding this comment.
Great observation @adam-richardson-openai
Looking forward for you to address the comments
...in/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileOperations.java
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java
Show resolved
Hide resolved
5e4421e to
aab2cf1
Compare
|
I substantially reworked the new tests in aab2cf1. Summary:
|
929e33c to
49f7d80
Compare
|
I believe the CI failure is an unrelated network or infra flake, not a consequence of my change |
|
@adam-richardson-openai I've restarted failed jobs |
Thanks! Passed. |
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
49f7d80 to
0032a60
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
findinpath
left a comment
There was a problem hiding this comment.
Assuming that the nit comments are being addressed, @chenjian2664 pls review the contribution
0032a60 to
ddaa09b
Compare
Compliant Delta writers may emit optional checksum files alongside commits containing metadata and protocol information. Instead of loading the latest checkpoint and replaying intervening commits (which can be expensive, especially for large v1 checkpoints), Trino can read the latest commit’s checksum file to obtain this information with a single listing and small JSON read. Ref. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#version-checksum-file If the checksum file is missing or does not contain both metadata and protocol, we fall back to the existing Delta log scanning approach. Behavior is gated by session property load_metadata_from_checksum_file (defaulting to config delta.load_metadata_from_checksum_file, which defaults to true). Internal testing reduced analysis time for large v1-checkpoint tables from ~10s to <500ms. Co-authored-by: Eric Hwang <eh@openai.com> Co-authored-by: Fred Liu <fredliu@openai.com>
ddaa09b to
12a5cd1
Compare
|
Addressed all the remaining nit comments. Thank you @findinpath! I'll confess that my coding style is pretty comment-heavy. Per Marius' feedback, I removed all the comments that I'd consider obvious or noisy, but did leave a few in places that I think are relevant. If we still want to remove any or all of the remaining comments, I'm more than happy to make that change |
chenjian2664
left a comment
There was a problem hiding this comment.
Haven't looked the tests yet, the major concern is the listing of the transaction log dir
please share more about the tests
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeVersionChecksum.java
Outdated
Show resolved
Hide resolved
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeVersionChecksum.java
Show resolved
Hide resolved
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Outdated
Show resolved
Hide resolved
| throws IOException | ||
| { | ||
| long latestCommitVersion = -1; | ||
| FileIterator files = fileSystem.listFiles(Location.of(getTransactionLogDir(tableLocation))); |
There was a problem hiding this comment.
@adam-richardson-openai this will list _delta_log every time, and the request doesn't have cache(not likely the TableSnapshot), could you share what's the size of the tables your were tested (how many log files under _delta_log), and any regression found?
It's seems quite suitable if the writer to write the correct checksum file very frequently.
...-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogParser.java
Show resolved
Hide resolved
| MetadataEntry metadataEntry = checksum.getMetadata(); | ||
| ProtocolEntry protocolEntry = checksum.getProtocol(); | ||
| if (metadataEntry == null || protocolEntry == null) { | ||
| return Optional.empty(); | ||
| } |
There was a problem hiding this comment.
Since MetadataEntry and ProtocolEntry are trustworthy, it is reasonable to pass their results when loading descriptors from logs, we can refine it as a follow-up
There was a problem hiding this comment.
I want to make sure I understand this comment. Are you saying that we should assume that metadata and protocol are always present if a checksum file exists?
If this what you mean, then this discussion is related: https://github.com/trinodb/trino/pull/28381/changes#r2840656385. Basically I feel we should err towards being somewhat permissive, to defend against non-compliant writes. At minimum if we wanted to be stricter here, we would need to regenerate some old noncompliant fixtures!
Please let me know if you were making a different point
There was a problem hiding this comment.
Because the current logic returns empty when both metadata and protocol are null, we can reuse the computed entry when only one is present, which avoids recomputation later - but this is an unlikely edge case, just for the incompliant clients, we could do it as follow up
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java
Outdated
Show resolved
Hide resolved
...-lake/src/main/java/io/trino/plugin/deltalake/metastore/DeltaLakeTableMetadataScheduler.java
Outdated
Show resolved
Hide resolved
Read metadata and protocol information from Delta checksum files Compliant Delta writers may emit optional checksum files alongside commits containing metadata and protocol information. Instead of loading the latest checkpoint and replaying intervening commits (which can be expensive, especially for large v1 checkpoints), Trino can read the latest commit’s checksum file to obtain this information with a single listing and small JSON read. Ref. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#version-checksum-file If the checksum file is missing or does not contain both metadata and protocol, we fall back to the existing Delta log scanning approach. Behavior is gated by session property load_metadata_from_checksum_file (defaulting to config delta.load_metadata_from_checksum_file, which defaults to false). Internal testing reduced analysis time for large v1-checkpoint tables from ~10s to <500ms. Co-authored-by: Eric Hwang <eh@openai.com> Co-authored-by: Fred Liu <fredliu@openai.com>
7263d40 to
4e1c515
Compare
We added the following fixtures:
We use these files to power the following tests:
|
|
I have addressed all of the outstanding requests for changes. I sanitized and reworked the tests slightly mostly to reflect the new default of "false" for the checksum feature, and have shared a summary in a previous comment on all new tests and their coverage over the change These comment threads are outstanding:
I will circle back with additional data on our internal performance testing within the next couple days |
| endVersion.isPresent()); | ||
| } | ||
|
|
||
| private DeltaLakeTableDescriptor loadDescriptor(ConnectorSession session, SchemaTableName tableName, DeltaMetastoreTable table, TrinoFileSystem fileSystem, String tableLocation, Optional<ConnectorTableVersion> startVersion, Optional<ConnectorTableVersion> endVersion) |
There was a problem hiding this comment.
split the parameters onto multiple lines, and the SchemaTableName is redundant, we could get it from DeltaMetastoreTable
|
|
||
| private MaterializedResult loadVisibleTableMetadata(String tableName, Session session, @Language("SQL") String sql) | ||
| { | ||
| assertUpdate("CALL system.flush_metadata_cache(schema_name => CURRENT_SCHEMA, table_name => '%s')".formatted(tableName)); |
There was a problem hiding this comment.
Could you clarify the reason for this line here, why do we needs to execute for every statement?
There was a problem hiding this comment.
Probably this is linked to emptying TransactionLogAccess cache.
Maybe call the method computeActualWithEmptyCache
| copyDirectoryContents(Path.of(getResourceLocation(fixture).toURI()), tableLocation); | ||
| assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')".formatted(tableName, tableLocation.toUri())); | ||
| try { | ||
| assertMetadataAndProtocolQueriesMatch(tableName, loadMetadataFromChecksumFileEnabledSession, loadMetadataFromChecksumFileDisabledSession); |
There was a problem hiding this comment.
How about add a method to only verify one statement at a time, i,e(just FYI):
private void assertReadingMetadataAndProtocolFromChecksum(String tableName, String sql)
{
MaterializedResult checksumEnabledResult = loadVisibleTableMetadata(tableName, loadMetadataFromChecksumFileSession(true), sql);
MaterializedResult checksumDisabledResult = loadVisibleTableMetadata(tableName, loadMetadataFromChecksumFileSession(false), sql);
assertThat(checksumEnabledResult).isEqualTo(checksumDisabledResult);
}
wrap all three statements into one method doesn't increase the readability
| * @see deltalake.checksum | ||
| */ | ||
| @Test | ||
| public void testDescribeWithLoadMetadataFromChecksumFileEnabledDoesNotReadTransactionLogOrCheckpoint() |
There was a problem hiding this comment.
the name is quite long, nit, how about:
testLoadMetadataFromChecksumFileForDescribe
| * @see deltalake.checksum | ||
| */ | ||
| @Test | ||
| public void testDescribeWithLoadMetadataFromChecksumFileDisabledReadsTransactionLogAndCheckpoint() |
There was a problem hiding this comment.
nit rename to testLoadMetadataWithFromChecksumFileDisabled or testLoadMetadataWithFromChecksumFileDisabledForDescribe
| * @see deltalake.checksum_missing_latest | ||
| */ | ||
| @Test | ||
| public void testDescribeWithLoadMetadataFromChecksumFileMissingChecksumFallsBackToTransactionLogAndCheckpoint() |
There was a problem hiding this comment.
nit, rename to testLoadMetadataFromMissingLatestChecksumFileForDescribe
| * @see deltalake.checksum_without_metadata | ||
| */ | ||
| @Test | ||
| public void testDescribeWithLoadMetadataFromChecksumFileWithoutMetadataFallsBackToTransactionLogAndCheckpoint() |
There was a problem hiding this comment.
nit, ditto, use a shorter name if possible, once you done, remember to update the tableName in the test as well
| String catalog = getSession().getCatalog().orElseThrow(); | ||
| Session session = Session.builder(getSession()) | ||
| .setCatalogSessionProperty(catalog, "load_metadata_from_checksum_file", "true") | ||
| .build(); |
There was a problem hiding this comment.
It is likely that we could extract the logic as a method
| .add(new FileOperation(CHECKPOINT, "00000000000000000001.checkpoint.parquet", "InputFile.length")) | ||
| .add(new FileOperation(CHECKPOINT, "00000000000000000001.checkpoint.parquet", "InputFile.newInput")) | ||
| .add(new FileOperation(TRANSACTION_LOG_JSON, "00000000000000000002.json", "InputFile.length")) | ||
| .build()); |
There was a problem hiding this comment.
remember to drop the table
| assertThat(checksumEnabledProperties).isEqualTo(checksumDisabledProperties); | ||
| } | ||
|
|
||
| private MaterializedResult loadVisibleTableMetadata(String tableName, Session session, @Language("SQL") String sql) |
There was a problem hiding this comment.
the method name is also rather misleading.
the method executes a generic query while the name of the method is loadVisibleTableMetadata
|
Due to competing internal priorities, I will put this PR down for a few days and circle back on the remaining comments later. Thank you for all the review thus far! All of the recent minor feedback items make sense to me For this PR to be a uniform performance win, we'll need to ship the In parallel, I actually found a compelling alternative implementation strategy for this PR that may simplify the change overall (e.g. this alternative strategy doesn't require any Thankfully, much of the review and iteration effort is shared between the two approaches, since much of it was focused on tests and fixtures I will share more information on this in the next few days |
I would argue that using |
Description
Read metadata and protocol information from Delta checksum files, when configured and where available
Compliant writers of Delta tables may optionally write "checksum" files alongside each commit. These checksum files contain a variety of (optional) useful information, including the Delta table metadata and protocol information. See https://github.com/delta-io/delta/blob/488c916931ca9d210f4cadd2d5520e0274d26b04/PROTOCOL.md#version-checksum-file for the full checksum file spec
Trino needs to load the table metadata and protocol information at planning time. Today, this is done by identifying and loading the latest table checkpoint, as well as replaying all intervening commits up to the latest. This can be extremely slow and expensive, as checkpoints can be enormous and there may be many intervening commits
Instead, we can simply determine the latest commit in the table, load the corresponding checksum file (if it exists), and parse the metadata and protocol information (if available in the checksum file). This takes only a single listing operation and a single load of a small JSON file, as opposed to potentially loading many files in the Delta log based approach (some of which may be extremely large depending on the size and configuration of the table)
If there is no checksum file for the latest eligible commit in the table, or if the checksum file doesn't capture both the metadata and the protocol information for the table, we fall back to the existing approach of scanning the Delta log. (Checksum files are considered optional under the Delta spec, as are all fields therein)
This new behavior is gated behind a session property,
load_metadata_from_checksum_file, which in turn defaults to the value of thedelta.load-metadata-from-checksum-fileconfiguration. The config value itself defaults to true, since we expect this change to be a straightforward performance optimization in the overwhelming majority of casesThis optimization is particularly effective for tables using the v1 checkpoint spec, since v1 checkpoints files may be very large and heavy
We drove internal performance testing, using queries like
where
<table>is a large table using the v1 checkpoint spec. We observed that time spent in analysis fell from 10s on average to well under 500msAdditional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: