Skip to content

Commit 99b971c

Browse files
[fix](parquet) Fix struct column reading error when all queried fields are missing after schema evolution (#59586)
### What problem does this PR solve? - relate pr: #57204 **Problem Summary:** When querying struct fields in Iceberg tables after schema evolution, if all queried struct fields are missing in old Parquet files, the code fails with error: ``` File column name 'removed' not found in struct children ``` **Root Cause:** When all queried struct sub-fields are missing in the old Parquet file (e.g., newly added fields after schema evolution), the code needs to find a reference column from the file schema to get repetition level (RL) and definition level (DL) information. However, if the reference column (e.g., `removed`) was dropped from the table schema, calling `root_node->get_children_node_by_file_column_name()` will fail because the column doesn't exist in `root_node`. **Scenario:** 1. Create table with struct containing: `removed`, `rename`, `keep`, `drop_and_add` 2. Insert data (creates Parquet file with these fields) 3. Perform schema evolution: DROP `a_struct.removed`, DROP then ADD `a_struct.drop_and_add` (gets new field ID), ADD `a_struct.added` 4. Query `struct_element(a_struct, 'drop_and_add')` or `struct_element(a_struct, 'added')` on the old file 5. The query fails because: - All queried fields (`drop_and_add`, `added`) are missing in the old file - Code tries to use `removed` as reference column (it exists in file but was dropped from table schema) - Accessing `removed` via `root_node` fails because it doesn't exist in table schema ### Solution: Use `TableSchemaChangeHelper::ConstNode::get_instance()` instead of looking up from `root_node` for the reference column. Since the reference column is only used to get RL/DL information (not for schema mapping), using `ConstNode` is safe and avoids the issue where the reference column doesn't exist in `root_node`. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
1 parent b625efd commit 99b971c

File tree

4 files changed

+514
-4
lines changed

4 files changed

+514
-4
lines changed

be/src/vec/exec/format/parquet/vparquet_column_reader.cpp

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -932,10 +932,14 @@ Status StructColumnReader::read_column_data(
932932
size_t field_rows = 0;
933933
bool field_eof = false;
934934

935-
// Use root_node to get the correct child node for the reference column
936-
// reference_file_column_name is the file column name, use get_children_node_by_file_column_name
937-
auto ref_child_node =
938-
root_node->get_children_node_by_file_column_name(reference_file_column_name);
935+
// Use ConstNode for the reference column instead of looking up from root_node.
936+
// The reference column is only used to get RL/DL information for determining the number
937+
// of elements in the struct. It may be a column that has been dropped from the table
938+
// schema (e.g., 'removed' field), but still exists in older parquet files.
939+
// Since we don't need schema mapping for this column (we just need its RL/DL levels),
940+
// using ConstNode is safe and avoids the issue where the reference column doesn't exist
941+
// in root_node (because it was dropped from table schema).
942+
auto ref_child_node = TableSchemaChangeHelper::ConstNode::get_instance();
939943
not_missing_orig_column_size = temp_column->size();
940944

941945
RETURN_IF_ERROR((*reference_reader)
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
use demo.test_db;
2+
3+
DROP TABLE IF EXISTS test_struct_evolution;
4+
5+
-- Test case for struct schema evolution bug
6+
-- Bug scenario: When querying a struct field after schema evolution, if all queried fields are missing
7+
-- in old Parquet files, the code tries to find a reference column from file schema. However, if the
8+
-- reference column (e.g., 'removed') was dropped from table schema, accessing it via root_node will fail.
9+
--
10+
-- Steps to reproduce:
11+
-- 1. Create table with struct containing: removed, rename, keep, drop_and_add
12+
-- 2. Insert data (creates Parquet file with these fields)
13+
-- 3. DROP a_struct.removed - removes field from table schema
14+
-- 4. DROP a_struct.drop_and_add then ADD a_struct.drop_and_add - gets new field ID
15+
-- 5. ADD a_struct.added - adds new field
16+
-- 6. Query struct_element(a_struct, 'drop_and_add') or struct_element(a_struct, 'added')
17+
-- -> This will fail because all queried fields are missing in old file, and the reference
18+
-- column 'removed' doesn't exist in root_node (it was dropped from table schema)
19+
20+
-- Step 1: Create table
21+
CREATE TABLE test_struct_evolution (
22+
id BIGINT,
23+
a_struct STRUCT<removed: BIGINT, rename: BIGINT, keep: BIGINT, drop_and_add: BIGINT>
24+
) USING ICEBERG
25+
TBLPROPERTIES ('write.format.default' = 'parquet', 'format-version' = 2);
26+
27+
-- Step 2: Insert data (creates Parquet file with original schema)
28+
INSERT INTO test_struct_evolution
29+
SELECT 1, named_struct('removed', 10, 'rename', 11, 'keep', 12, 'drop_and_add', 13);
30+
31+
-- Step 3: Schema evolution - drop removed field
32+
ALTER TABLE test_struct_evolution DROP COLUMN a_struct.removed;
33+
34+
-- Step 4: Rename field (field ID stays the same)
35+
ALTER TABLE test_struct_evolution RENAME COLUMN a_struct.rename TO renamed;
36+
37+
-- Step 5: Drop and add drop_and_add (new field ID)
38+
ALTER TABLE test_struct_evolution DROP COLUMN a_struct.drop_and_add;
39+
ALTER TABLE test_struct_evolution ADD COLUMN a_struct.drop_and_add BIGINT;
40+
41+
-- Step 6: Add new field
42+
ALTER TABLE test_struct_evolution ADD COLUMN a_struct.added BIGINT;
43+
44+
-- Step 7: Insert new data after schema evolution (creates new Parquet file)
45+
INSERT INTO test_struct_evolution
46+
SELECT 2, named_struct('renamed', 21, 'keep', 22, 'drop_and_add', 23, 'added', 24);
47+
48+
-- Now the table contains two Parquet files:
49+
-- - Old file: contains removed, rename, keep, drop_and_add (old field ID)
50+
-- - New file: contains renamed, keep, drop_and_add (new field ID), added
51+
--
52+
-- Querying struct_element(a_struct, 'drop_and_add') or struct_element(a_struct, 'added')
53+
-- on the old file will trigger the bug
54+
55+
-- ============================================================
56+
-- ORC format test table (for completeness, though ORC doesn't have the same bug)
57+
-- ============================================================
58+
DROP TABLE IF EXISTS test_struct_evolution_orc;
59+
60+
-- Create ORC format table with same schema evolution scenario
61+
CREATE TABLE test_struct_evolution_orc (
62+
id BIGINT,
63+
a_struct STRUCT<removed: BIGINT, rename: BIGINT, keep: BIGINT, drop_and_add: BIGINT>
64+
) USING ICEBERG
65+
TBLPROPERTIES ('write.format.default' = 'orc', 'format-version' = 2);
66+
67+
-- Insert initial data (creates ORC file with original schema)
68+
INSERT INTO test_struct_evolution_orc
69+
SELECT 1, named_struct('removed', 10, 'rename', 11, 'keep', 12, 'drop_and_add', 13);
70+
71+
-- Schema evolution - same operations as Parquet table
72+
ALTER TABLE test_struct_evolution_orc DROP COLUMN a_struct.removed;
73+
ALTER TABLE test_struct_evolution_orc RENAME COLUMN a_struct.rename TO renamed;
74+
ALTER TABLE test_struct_evolution_orc DROP COLUMN a_struct.drop_and_add;
75+
ALTER TABLE test_struct_evolution_orc ADD COLUMN a_struct.drop_and_add BIGINT;
76+
ALTER TABLE test_struct_evolution_orc ADD COLUMN a_struct.added BIGINT;
77+
78+
-- Insert new data after schema evolution (creates new ORC file)
79+
INSERT INTO test_struct_evolution_orc
80+
SELECT 2, named_struct('renamed', 21, 'keep', 22, 'drop_and_add', 23, 'added', 24);
81+
82+
-- ============================================================
83+
-- Case sensitivity test table (mixed case field names)
84+
-- ============================================================
85+
DROP TABLE IF EXISTS test_struct_evolution_case;
86+
87+
-- Test case for struct schema evolution with mixed case field names
88+
-- This tests that case sensitivity is handled correctly when:
89+
-- - Field names have mixed case (e.g., REMOVED, rename, keep, drop_and_add)
90+
-- - Schema evolution operations are performed
91+
-- - Querying struct fields with different case patterns
92+
93+
-- Step 1: Create table with mixed case field names
94+
CREATE TABLE test_struct_evolution_case (
95+
id BIGINT,
96+
a_struct STRUCT<REMOVED: BIGINT, rename: BIGINT, keep: BIGINT, drop_and_add: BIGINT>
97+
) USING ICEBERG
98+
TBLPROPERTIES ('write.format.default' = 'parquet', 'format-version' = 2);
99+
100+
-- Step 2: Insert data (creates Parquet file with original schema)
101+
INSERT INTO test_struct_evolution_case
102+
SELECT 1, named_struct('REMOVED', 10, 'rename', 11, 'keep', 12, 'drop_and_add', 13);
103+
104+
-- Step 3: Schema evolution - drop REMOVED field (uppercase)
105+
ALTER TABLE test_struct_evolution_case DROP COLUMN a_struct.REMOVED;
106+
107+
-- Step 4: Rename field (field ID stays the same)
108+
ALTER TABLE test_struct_evolution_case RENAME COLUMN a_struct.rename TO renamed;
109+
110+
-- Step 5: Drop and add drop_and_add with case change (new field ID)
111+
-- Initial: drop_and_add (lowercase), after re-add: DROP_AND_ADD (uppercase)
112+
ALTER TABLE test_struct_evolution_case DROP COLUMN a_struct.drop_and_add;
113+
ALTER TABLE test_struct_evolution_case ADD COLUMN a_struct.DROP_AND_ADD BIGINT;
114+
115+
-- Step 6: Add new field
116+
ALTER TABLE test_struct_evolution_case ADD COLUMN a_struct.added BIGINT;
117+
118+
-- Step 7: Insert new data after schema evolution (creates new Parquet file)
119+
-- Note: Use DROP_AND_ADD (uppercase) in the new data
120+
INSERT INTO test_struct_evolution_case
121+
SELECT 2, named_struct('renamed', 21, 'keep', 22, 'DROP_AND_ADD', 23, 'added', 24);
122+
123+
-- ============================================================
124+
-- ORC format test table with mixed case (for completeness)
125+
-- ============================================================
126+
DROP TABLE IF EXISTS test_struct_evolution_case_orc;
127+
128+
-- Create ORC format table with same schema evolution scenario and mixed case
129+
CREATE TABLE test_struct_evolution_case_orc (
130+
id BIGINT,
131+
a_struct STRUCT<REMOVED: BIGINT, rename: BIGINT, keep: BIGINT, drop_and_add: BIGINT>
132+
) USING ICEBERG
133+
TBLPROPERTIES ('write.format.default' = 'orc', 'format-version' = 2);
134+
135+
-- Insert initial data (creates ORC file with original schema)
136+
INSERT INTO test_struct_evolution_case_orc
137+
SELECT 1, named_struct('REMOVED', 10, 'rename', 11, 'keep', 12, 'drop_and_add', 13);
138+
139+
-- Schema evolution - same operations as Parquet table
140+
ALTER TABLE test_struct_evolution_case_orc DROP COLUMN a_struct.REMOVED;
141+
ALTER TABLE test_struct_evolution_case_orc RENAME COLUMN a_struct.rename TO renamed;
142+
-- Drop and add with case change: drop_and_add (lowercase) -> DROP_AND_ADD (uppercase)
143+
ALTER TABLE test_struct_evolution_case_orc DROP COLUMN a_struct.drop_and_add;
144+
ALTER TABLE test_struct_evolution_case_orc ADD COLUMN a_struct.DROP_AND_ADD BIGINT;
145+
ALTER TABLE test_struct_evolution_case_orc ADD COLUMN a_struct.added BIGINT;
146+
147+
-- Insert new data after schema evolution (creates new ORC file)
148+
-- Note: Use DROP_AND_ADD (uppercase) in the new data
149+
INSERT INTO test_struct_evolution_case_orc
150+
SELECT 2, named_struct('renamed', 21, 'keep', 22, 'DROP_AND_ADD', 23, 'added', 24);
151+
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
-- This file is automatically generated. You should know what you did if you want to edit this
2+
-- !desc --
3+
id bigint Yes true \N
4+
a_struct struct<renamed:bigint,keep:bigint,drop_and_add:bigint,added:bigint> Yes true \N
5+
6+
-- !select_all --
7+
1 {"renamed":11, "keep":12, "drop_and_add":null, "added":null}
8+
2 {"renamed":21, "keep":22, "drop_and_add":23, "added":24}
9+
10+
-- !struct_keep --
11+
12
12+
22
13+
14+
-- !struct_renamed --
15+
11
16+
21
17+
18+
-- !struct_drop_and_add --
19+
\N
20+
23
21+
22+
-- !struct_added --
23+
\N
24+
24
25+
26+
-- !struct_full --
27+
{"renamed":11, "keep":12, "drop_and_add":null, "added":null}
28+
{"renamed":21, "keep":22, "drop_and_add":23, "added":24}
29+
30+
-- !struct_predicate_1 --
31+
1
32+
33+
-- !struct_predicate_2 --
34+
1
35+
36+
-- !struct_predicate_3 --
37+
1
38+
39+
-- !struct_predicate_4 --
40+
2
41+
42+
-- !struct_multi --
43+
11 12 \N \N
44+
21 22 23 24
45+
46+
-- !struct_distinct --
47+
11 \N 12
48+
21 24 22
49+
50+
-- !orc_desc --
51+
id bigint Yes true \N
52+
a_struct struct<renamed:bigint,keep:bigint,drop_and_add:bigint,added:bigint> Yes true \N
53+
54+
-- !orc_select_all --
55+
1 {"renamed":11, "keep":12, "drop_and_add":null, "added":null}
56+
2 {"renamed":21, "keep":22, "drop_and_add":23, "added":24}
57+
58+
-- !orc_struct_keep --
59+
12
60+
22
61+
62+
-- !orc_struct_renamed --
63+
11
64+
21
65+
66+
-- !orc_struct_drop_and_add --
67+
\N
68+
23
69+
70+
-- !orc_struct_added --
71+
\N
72+
24
73+
74+
-- !orc_struct_full --
75+
{"renamed":11, "keep":12, "drop_and_add":null, "added":null}
76+
{"renamed":21, "keep":22, "drop_and_add":23, "added":24}
77+
78+
-- !orc_struct_multi --
79+
11 12 \N \N
80+
21 22 23 24
81+
82+
-- !case_desc --
83+
id bigint Yes true \N
84+
a_struct struct<renamed:bigint,keep:bigint,drop_and_add:bigint,added:bigint> Yes true \N
85+
86+
-- !case_select_all --
87+
1 {"renamed":11, "keep":12, "drop_and_add":null, "added":null}
88+
2 {"renamed":21, "keep":22, "drop_and_add":23, "added":24}
89+
90+
-- !case_struct_keep --
91+
12
92+
22
93+
94+
-- !case_struct_renamed --
95+
11
96+
21
97+
98+
-- !case_struct_drop_and_add --
99+
\N
100+
23
101+
102+
-- !case_struct_added --
103+
\N
104+
24
105+
106+
-- !case_struct_full --
107+
{"renamed":11, "keep":12, "drop_and_add":null, "added":null}
108+
{"renamed":21, "keep":22, "drop_and_add":23, "added":24}
109+
110+
-- !case_struct_predicate_1 --
111+
1
112+
113+
-- !case_struct_predicate_2 --
114+
1
115+
116+
-- !case_struct_predicate_3 --
117+
1
118+
119+
-- !case_struct_predicate_4 --
120+
2
121+
122+
-- !case_struct_multi --
123+
11 12 \N \N
124+
21 22 23 24
125+
126+
-- !case_struct_distinct --
127+
11 \N 12
128+
21 24 22
129+
130+
-- !case_orc_desc --
131+
id bigint Yes true \N
132+
a_struct struct<renamed:bigint,keep:bigint,drop_and_add:bigint,added:bigint> Yes true \N
133+
134+
-- !case_orc_select_all --
135+
1 {"renamed":11, "keep":12, "drop_and_add":null, "added":null}
136+
2 {"renamed":21, "keep":22, "drop_and_add":23, "added":24}
137+
138+
-- !case_orc_struct_keep --
139+
12
140+
22
141+
142+
-- !case_orc_struct_renamed --
143+
11
144+
21
145+
146+
-- !case_orc_struct_drop_and_add --
147+
\N
148+
23
149+
150+
-- !case_orc_struct_added --
151+
\N
152+
24
153+
154+
-- !case_orc_struct_full --
155+
{"renamed":11, "keep":12, "drop_and_add":null, "added":null}
156+
{"renamed":21, "keep":22, "drop_and_add":23, "added":24}
157+
158+
-- !case_orc_struct_multi --
159+
11 12 \N \N
160+
21 22 23 24
161+

0 commit comments

Comments
 (0)