Antalya 26.1 Backport of #98140, #99230, #99231 and #96545 - Parquet metadata cache (upstream impl) and arrow library version bump#1574
Conversation
…t_metadata_cache_v2 Parquet metadata cache v2
Update arrow to version 23.0.0
Avoid crash possible crash in Parquet with metadata cache enabled
…rquet_format_check_to_metadata_cache Add parquet format check to metadata cache
|
@codex review the backports - compare the changes and conflicts with the original prs if possible |
|
Codex Review: Something went wrong. Try again later by commenting “@codex review”. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Ran an audit. here's the result.Side note, here's a comment from AI:
Good job! AI audit note: This review comment was generated by AI (claude-4.6-opus-high-thinking). Audit for PR #1574 — Antalya 26.1 backport of ClickHouse#98140, ClickHouse#99230, ClickHouse#99231, ClickHouse#96545 (Parquet metadata cache + Arrow 23): Confirmed defectsMedium:
Low: Log message inside cache-miss loader says "got metadata from cache" (exists in upstream)
Low: Misleading argument comments in
Coverage summary
|
mkmkme
left a comment
There was a problem hiding this comment.
I can't fully cover all of those backports myself, but the brief look seems okay and the AI review approved it saying that it was backported accurately. So LGTM
|
Here is an audit review after the latest changes, @arthurpassos I assume this low risk defect are not related? AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1574 (Parquet metadata cache backport + Arrow 23 bump): Confirmed defectsLow: Cache-miss loader logs a cache-hit messageImpact: Misleading trace diagnostics during cache troubleshooting; no direct correctness impact, but it can hide real cache-behavior signals during incidents. Anchor: Trigger: Any cache miss path that executes the loader lambda. Affected transition: Why defect: The loader lambda runs when cache lookup misses, but the emitted log string says metadata was obtained "from cache", contradicting the actual transition and subsequent miss accounting. Affected subsystem and blast radius: Parquet metadata cache observability ( Smallest logical reproduction steps: Enable trace logging, run two reads of a Parquet object with Logical fault-injection mapping: Inject cold-cache state (empty cache) to force loader execution and expose incorrect hit/miss log semantics. Fix direction (short): Change loader log text to "loaded metadata from file" (or equivalent) and keep hit/miss logs only in post- Regression test direction (short): Add a trace-log assertion test on cold-cache read ensuring miss path never emits "from cache". Code evidence: auto load_fn_wrapper = [&]()
{
auto metadata = load_fn();
LOG_TRACE(log, "got metadata from cache {} | {}", key.file_path, key.etag);
return std::make_shared<ParquetMetadataCacheCell>(std::move(metadata));
};
auto result = Base::getOrSet(key, load_fn_wrapper);
if (result.second)
ProfileEvents::increment(ProfileEvents::ParquetMetadataCacheMisses);Low: Argument comments are mislabeled in
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Backport of:
ClickHouse#98140 - Added a new SLRU cache for Parquet metadata to improve read performance by removing the need to re-download files just to read metadata.
ClickHouse#99230 - Add parquet format check to metadata cache
ClickHouse#99231 - Avoid possible crash in Parquet with metadata cache enabled
ClickHouse#96545 - bump arrow version
It also removes the antalya implementation of parquet metadata caching
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: