Skip to content

[Iceberg v3] Row lineage#27836

Open
dain wants to merge 7 commits intotrinodb:masterfrom
dain:row-lineage
Open

[Iceberg v3] Row lineage#27836
dain wants to merge 7 commits intotrinodb:masterfrom
dain:row-lineage

Conversation

@dain
Copy link
Member

@dain dain commented Jan 3, 2026

Description

This PR adds comprehensive support for Iceberg v3 row lineage, enabling Trino to read and preserve row identity metadata ($row_id and $last_updated_sequence_number).

Changes include:

  • Reading lineage columns: Expose $row_id and $last_updated_sequence_number as queryable columns for v3 tables
  • OPTIMIZE is now enabled for v3 tables and preserves lineage data
  • UPDATE and MERGE operations preserve the original $row_id for modified rows, maintaining row identity across updates
  • Enable cleanup procedures (expire_snapshots, remove_orphan_files) on v3 tables
  • Run Iceberg connector tests against both v2 and v3 format versions

Release notes

(X) Release notes are required, with the following suggested text:

## Iceberg
* Add support for Iceberg v3 row lineage, including reading `$row_id` and `$last_updated_sequence_number` columns, preserving row identity during UPDATE/MERGE operations, and OPTIMIZE support. ({issue}`issuenumber)

@cla-bot cla-bot bot added the cla-signed label Jan 3, 2026
@github-actions github-actions bot added iceberg Iceberg connector lakehouse labels Jan 3, 2026
@dain dain force-pushed the row-lineage branch 2 times, most recently from 6782e91 to 39782c8 Compare January 3, 2026 05:38
@dain dain marked this pull request as ready for review January 3, 2026 07:17
@chenjian2664
Copy link
Contributor

chenjian2664 commented Jan 6, 2026

Is the second commit "add support for Iceberg v3 deletion vectors" added intentionally?

@dain
Copy link
Member Author

dain commented Jan 9, 2026

Is the second commit "add support for Iceberg v3 deletion vectors" added intentionally?

Yep. The first two commits are the base from another PR. Row lineage needs deletions to work to fully test that rowid works for update commands.

@dain dain changed the title Add Iceberg v3 row lineage support [Iceberg v3] Row lineage Jan 11, 2026
@chenjian2664 chenjian2664 self-requested a review January 16, 2026 07:58
Copy link
Contributor

@chenjian2664 chenjian2664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed: "Add support for reading $row_id and $last_updated_sequence_number"

@@ -647,6 +647,7 @@ void testV3InsertProducesRowLineageMetadata()
assertUpdate("CREATE TABLE " + tableName + " (id INTEGER, v VARCHAR) WITH (format = 'PARQUET', format_version = 3)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, we should also test AVRO and ORC format

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. These columns are synthetic and handled by the engine directly. I don't think file format has any impact on this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the test case exercise the IcebergPageSourceProvider changes (for AVRO and ORC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that comment after I pushed. I just finally got through applying all of the comments below. This is now covered.

Comment on lines +1070 to +1075
if (column.isLastUpdatedSequenceNumberColumn()) {
transforms.transform(new DataSequenceNumberTransform(dataSequenceNumber, ordinal));
}
else if (column.isRowIdColumn() && fileFirstRowId.isPresent()) {
appendRowNumberColumn = true;
transforms.transform(new RowIdTransform(fileFirstRowId.get(), ordinal));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@chenjian2664 chenjian2664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed: "Allow table procedures to declare which columns to read"

Optional<TableLayout> layout = metadata.getLayoutForTableExecute(session, executeHandle);

List<Symbol> symbols = visibleFields(tableScanPlan);
Set<String> expectedColumnNames = metadata.getColumnNamesForTableExecute(session, executeHandle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about returns Optional<List<ColumnHandle>> ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is required and null is not an allowed response. The connector metadata has a default implementation so this is backwards compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praveen2112 Please take a look

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @electrum pointed out that we could be using ColumHandles here instead of names. I was reacting to the optional part not the column handle part.

String procedureName,
Map<String, Object> executeProperties);

Set<String> getColumnNamesForTableExecute(Session session, TableExecuteHandle tableExecuteHandle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Involve @Praveen2112 @ebyhr
Please help to review this commit

@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Feb 10, 2026
@dain dain added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Feb 17, 2026
@dain dain force-pushed the row-lineage branch 6 times, most recently from f6e342b to 72789e0 Compare February 27, 2026 02:31
else if (!fileColumnsByIcebergId.containsKey(column.getBaseColumnIdentity().getId())) {
Object initialDefault = getInitialDefault(tableSchema, column.getBaseColumnIdentity().getId());
transforms.constantValue(nativeValueToBlock(column.getType(), initialDefault));
if (column.isLastUpdatedSequenceNumberColumn()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be else if on the outer level, after the other if (column.isXxx)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is correct. This handles the case where the row id or LUSN is not present in file and we must synthesize them, otherwise we must read them from the file.

@@ -647,6 +647,7 @@ void testV3InsertProducesRowLineageMetadata()
assertUpdate("CREATE TABLE " + tableName + " (id INTEGER, v VARCHAR) WITH (format = 'PARQUET', format_version = 3)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the test case exercise the IcebergPageSourceProvider changes (for AVRO and ORC)

}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check to verify that the pendingSourceRowId is null - that we don't have unhandled row ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this

private static Block createRowIdBlock(Page inputPage, int dataColumnCount, int[] additionPositions, int additionCount)
{
// For V3, we need to extract source_row_id from the merge row ID for UPDATE_INSERT rows.
// UPDATE_DELETE is immediately followed by UPDATE_INSERT, so we track pending source row IDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE_DELETE is immediately followed by UPDATE_INSERT, where is the logic that guarantee it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this with more explict handling verification. Generally, none of these update systems actually verify this stuff, but I am happy to add it.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this PR is ready for merge yet. Please request another review round before you merge this PR. The repeated UPDATE scenario is broken:

CREATE TABLE test (name varchar) WITH (format_version = 3);
INSERT INTO test VALUES 'alice', 'bob';
INSERT INTO test VALUES 'carol', 'david';
SELECT name, "$row_id", "$last_updated_sequence_number" FROM test;
 name  | $row_id | $last_updated_sequence_number
-------+---------+-------------------------------
 alice |       0 |                             2
 carol |       2 |                             3
 bob   |       1 |                             2
 david |       3 |                             3

UPDATE test SET name = 'BOB' WHERE name = 'bob';
SELECT name, "$row_id", "$last_updated_sequence_number" FROM test;
 name  | $row_id | $last_updated_sequence_number
-------+---------+-------------------------------
 carol |       2 |                             3
 david |       3 |                             3
 BOB   |       1 |                             4
 alice |       0 |                             2

UPDATE test SET name = 'BOB1' WHERE name = 'BOB';
SELECT name, "$row_id", "$last_updated_sequence_number" FROM test;
 name  | $row_id | $last_updated_sequence_number
-------+---------+-------------------------------
 carol |       2 |                             3
 BOB1  |       4 |                             5
 alice |       0 |                             2
 david |       3 |                             3

The bottom BOB1 row should return 1 on $row_id column.

protected TimeUnit storageTimePrecision;

protected BaseIcebergConnectorTest(IcebergFileFormat format)
protected BaseIcebergConnectorTest(IcebergFileFormat format, int formatVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test SHOW STATS with the new metadata columns. I believe it returns incorrect results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add MV test case:

    @Test
    void testRowLineageWithMaterializedViews()
    {
        try (TestTable table = newTrinoTable("test_materialized_views", "(id int, name varchar) WITH (format_version = 3)")) {
            assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, 'Alice'), (2, 'Bob')", 2);

            String materializedViewName = "test_materialized_view_" + randomNameSuffix();
            assertUpdate("CREATE MATERIALIZED VIEW " + materializedViewName + " AS SELECT id, name, \"$row_id\", \"$last_updated_sequence_number\" FROM " + table.getName());

            assertUpdate("REFRESH MATERIALIZED VIEW " + materializedViewName, 2);

            assertThat(query("SELECT id, name, \"$row_id\", \"$last_updated_sequence_number\" FROM " + materializedViewName))
                    .matches("""
                            VALUES (1, VARCHAR 'Alice', BIGINT '0', BIGINT '2'),
                                   (2, 'Bob', BIGINT '1', BIGINT '2')
                            """);

            assertUpdate("UPDATE " + table.getName() + " SET name = 'Alice Updated' WHERE id = 1", 1);

            assertUpdate("REFRESH MATERIALIZED VIEW " + materializedViewName, 2);

            assertThat(query("SELECT id, name, \"$row_id\", \"$last_updated_sequence_number\" FROM " + materializedViewName))
                    .matches("""
                            VALUES (1, VARCHAR 'Alice Updated', BIGINT '0', BIGINT '3'),
                                   (2, 'Bob', BIGINT '1', BIGINT '2')
                            """);

            assertUpdate("DROP MATERIALIZED VIEW " + materializedViewName);
        }
    }

Column name and variable were not updated to be merge specific when the
merge PR was being reviewed.
@dain
Copy link
Member Author

dain commented Mar 4, 2026

@ebyhr I believe this is ready to go now

@ebyhr
Copy link
Member

ebyhr commented Mar 4, 2026

@dain Thanks for addressing the comments. I'll review this PR again tomorrow or shortly later.

@dain
Copy link
Member Author

dain commented Mar 5, 2026

@ebyhr any updates?

@ebyhr
Copy link
Member

ebyhr commented Mar 6, 2026

@dain Sorry, I had to work on a different issue yesterday. It looks like there’s still a bug with the Avro format.

CREATE TABLE test (name varchar) WITH (format_version = 3, format = 'AVRO');
INSERT INTO test VALUES 'alice', 'bob';
INSERT INTO test VALUES 'carol', 'david';
UPDATE test SET name = 'BOB' WHERE name = 'bob';
SELECT name, "$row_id", "$last_updated_sequence_number" FROM test;
 name  | $row_id | $last_updated_sequence_number
-------+---------+-------------------------------
 carol |       2 |                             3
 david |       3 |                             3
 alice |       0 |                             2
 BOB   |       4 |                             4

BOB should return 1 as $row_id.

@ebyhr
Copy link
Member

ebyhr commented Mar 6, 2026

The partition table has a bug regardless of the file format:

CREATE TABLE test (name varchar, x bigint) WITH (format_version = 3, partitioning = ARRAY['x']);
INSERT INTO test VALUES ('alice', 1), ('bob', 2);
INSERT INTO test VALUES ('carol', 1), ('david', 2);
UPDATE test SET name = 'BOB' WHERE name = 'bob';
SELECT name, "$row_id" FROM test;
 name  | $row_id
-------+---------
 BOB   |       4
 carol |       2
 david |       3
 alice |       1
(4 rows)

BOB should return 0 as $row_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector lakehouse stale stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.

Development

Successfully merging this pull request may close these issues.

5 participants