Skip to content

Antalya 26.1 Backport of #95476, #98360 - enable prewhere for iceberg (and fixes)#1581

Open
mkmkme wants to merge 4 commits intoantalya-26.1from
backports/antalya-26.1/95476
Open

Antalya 26.1 Backport of #95476, #98360 - enable prewhere for iceberg (and fixes)#1581
mkmkme wants to merge 4 commits intoantalya-26.1from
backports/antalya-26.1/95476

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented Mar 26, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Enable PREWHERE optimization for Iceberg tables. (ClickHouse#95476 by @scanhex12)
Fix LOGICAL_ERROR exception in the Parquet V3 reader when PREWHERE references a column not present in the Parquet file (ClickHouse#98360 by @alexey-milovidov)

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

Workflow [PR], commit [aca8923]

…ere-external-columns

Fix exception in Parquet PREWHERE when column is not in file
@mkmkme mkmkme changed the title Antalya 26.1 Backport of #95476 - enable prewhere for iceberg Antalya 26.1 Backport of #95476, #98360 - enable prewhere for iceberg (and fixes) Mar 26, 2026
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 26, 2026

AI audit note: This review comment was generated by AI (claude-4.6-sonnet-medium-thinking).

Audit update for PR #1581 (Antalya 26.1 backport of ClickHouse#95476, ClickHouse#98360 – enable PREWHERE for Iceberg):

Confirmed defects:

Medium: PREWHERE crashes on Iceberg tables that have undergone schema evolution (column renames)

  • Impact: Any SELECT with a filterable WHERE condition (auto-promoted to PREWHERE by the planner) on an Iceberg table whose Parquet files predate a column rename will throw a runtime exception at read time, making those files unreadable once prewhere is active.
  • Anchor: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp / createReader (schema_changed + mapper path, line ~856); src/Storages/ObjectStorage/StorageObjectStorage.cpp TODO block lines ~222-241
  • Trigger: Iceberg table has at least one column rename in its schema history; pre-rename Parquet files are still present; any query with a filterable WHERE/PREWHERE on the renamed column is executed against a table-engine (not table function).
  • Why defect: When getInitialSchemaByPath() returns a non-null schema (underlying_format_read_schema_id != current schema_id), initial_header is built with OLD ClickHouse column names (e.g. old_b). The prewhere_info passed alongside it was constructed against the CURRENT ClickHouse schema (b). In Reader::preparePrewhere(), extended_sample_block.getPositionByName("b") is called on a block that only contains old_b, throwing "Not found column".

Code evidence — StorageObjectStorage.cpp, the TODO predating this PR:

/// TODO: Known problems with datalake prewhere:
///  * If the iceberg table went through schema evolution, columns read from file
///    may need to be renamed or typecast before applying prewhere. [...]
///    We should probably change this and delete that additional code?

Code evidence — StorageObjectStorageSource.cpp, the schema-changed path:

return std::make_shared<FormatFilterInfo>(
    format_filter_info->filter_actions_dag,
    format_filter_info->context.lock(),
    mapper,
    format_filter_info->row_level_filter,
    format_filter_info->prewhere_info  // ← references CURRENT column names,
);                                     //   but initial_header has OLD names
  • Upstream status: This defect is NOT introduced by this Altinity PR. It originates from upstream PR enable prewhere for iceberg ClickHouse/ClickHouse#95476 (scanhex12/enable_prewhere_for_iceberg, merged Jan 2026) and is present verbatim in the current upstream master. History:
  • Fix direction (short): When schema_changed == true and a mapper is present, either (a) strip prewhere_info from the new FormatFilterInfo (safe fallback to WHERE for evolved files), or (b) remap the prewhere DAG column names through the mapper before handing them to the Parquet reader.
  • Regression test direction (short): Iceberg table, write Parquet files, rename a column, write more files, SELECT with WHERE on the renamed column — assert no exception and correct results from both old and new files.

Coverage summary:

  • Scope reviewed: 5 changed source files + 2 new test files; Parquet reader INPUT-node fix, Iceberg supportsPrewhere() dispatch chain (DataLakeConfiguration template + StorageIcebergConfiguration delegation), FormatFilterInfo schema-change path in createReader, StorageObjectStorage construction-time guard, integration and stateless tests.
  • Categories failed: schema-evolution × PREWHERE interaction (confirmed, upstream origin)
  • Categories passed: INPUT-node external_columns fix correctness; compile-time USE_AVRO flag semantics; base-class supportsPrewhere() default regression check; FormatFilterInfo shared-ptr lifetime and thread safety; error-message improvement in preparePrewhere; stateless test 04001 logic; StorageIcebergConfiguration::getImpl() init ordering; Paimon/DeltaLake/Hudi remain correctly excluded from prewhere.
  • Assumptions/limits: Schema-evolution path analysis is static; no Iceberg tables with column-rename history were runtime-tested; upstream tracking issue does not yet exist.

@svb-alt svb-alt requested a review from arthurpassos March 26, 2026 12:36
Copy link
Copy Markdown

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants