Antalya 25.8: Fix use after free in rescheduleTasksFromReplica by ianton-ru · Pull Request #1561 · Altinity/ClickHouse

ianton-ru · 2026-03-20T12:10:25Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1554
Fixed incorrect using iterator to removed element.

Backport of #1493
Fix file identifier for distributed tasks
(logic was backported in #1237)

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

ianton-ru · 2026-03-20T12:11:00Z

@codex review

github-actions · 2026-03-20T12:11:22Z

Workflow [PR], commit [c6dda09]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c71caec37e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T12:17:51Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+            auto file_identifier = file->getAbsolutePath().value_or(file->getPath());
+            auto file_replica_idx = getReplicaForFile(file_identifier);
+            unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));


Preserve identifier semantics when re-queueing lost-replica task

When a replica disappears mid-request, this branch re-queues the task using getAbsolutePath().value_or(getPath()) as the map key, but getPreQueuedFile later looks up queued tasks by send_over_whole_archive ? getPathOrPathToArchiveIfArchive() : getIdentifier(). For bucketed objects (getIdentifier() includes bucket suffix) or archive mode, the key no longer matches, so pre-queued tasks are skipped, and emplace can collapse multiple buckets with the same path into one entry.

Useful? React with 👍 / 👎.

ianton-ru · 2026-03-20T15:05:44Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50453ff019

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T15:10:31Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+            auto file_identifier = file->getAbsolutePath().value_or(file->getPath());
+            auto file_replica_idx = getReplicaForFile(file_identifier);
+            unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));


Requeue with canonical file identifier

When a replica is lost mid-getNextTask, this branch re-inserts the task using file->getAbsolutePath().value_or(file->getPath()), but the rest of the distributor now keys tasks via getFileIdentifier(...)/getIdentifier() (including bucket suffixes and archive normalization). In bucket-splitting mode, multiple buckets from the same path collapse to one unprocessed_files key here, and later getPreQueuedFile lookups by canonical identifier won't match this entry, so some tasks can be skipped or dropped after failover.

Useful? React with 👍 / 👎.

ianton-ru · 2026-03-20T15:21:35Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbe007e689

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T15:28:21Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+        if (processed_file_list_ptr == replica_to_files_to_be_processed.end())
+        { // It is possible that replica was lost after check in the begining of the method
+            auto file_identifier = getFileIdentifier(file);
+            auto file_replica_idx = getReplicaForFile(file_identifier);
+            unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));


Stop returning task after requeueing lost replica work

When getNextTask detects that the current replica was removed, it re-enqueues the selected file into unprocessed_files/connection_to_files but still returns the same file at the end of the function. In the race where rescheduleTasksFromReplica() runs between task selection and this check, the same object becomes available to another replica while this caller can still process it, which can lead to duplicate processing/results if the caller proceeds with the returned task.

Useful? React with 👍 / 👎.

Technically it can't be.
getNextTask calls from processReadTaskRequest, what is called form processPacket as well as rescheduleTasksFromReplica. So rescheduleTasksFromReplica can't be called for the same replica simultaneously with getNextTask.
Comment was written incorrectly, code fixed

ianton-ru · 2026-03-20T17:47:17Z

@codex review

chatgpt-codex-connector · 2026-03-20T17:52:03Z

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mkmkme · 2026-03-23T11:46:17Z

Just for the record

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1561 (Fix use after free in rescheduleTasksFromReplica):

Confirmed defects

No confirmed defects in reviewed scope.

Coverage summary

Scope reviewed: StorageObjectStorageStableTaskDistributor (getNextTask, getPreQueuedFile, getMatchingFileFromIterator, getAnyUnprocessedFile, rescheduleTasksFromReplica), new getFileIdentifier, header declaration, and integration test rename in tests/integration/test_s3_cluster/test.py.
Categories failed: (none)
Categories passed: use-after-free / iterator invalidation after erase (addressed by std::move + iterate copy, and by not retaining find iterators across unlocked work); data race on replica_to_files_to_be_processed / getReplicaForFile (addressed by holding mutex around map reads and final push_back); consistent unprocessed_files keys via getFileIdentifier vs prior mixed getPath / getIdentifier / archive paths; lock ordering (sequential std::lock_guard, no nested same-mutex deadlock in reviewed paths); exception paths on LOGICAL_ERROR when replica dropped between phases (fail-closed throw vs UB).
Assumptions/limits: Findings are from static review of the published PR diff; local checkout is not guaranteed to match antalya-25.8 + this PR, so identifier helpers (e.g. getAbsolutePathFromObjectInfo vs getIdentifier) should be re-checked on the exact merge target if it has diverged.

alsugiliazova · 2026-03-26T15:12:08Z

PR #1561 CI Verification Report

CI Failures

1. `test_graceful_shutdown[1-5]` — Known Pre-existing Issue, OK to Merge

Job: Integration tests (amd_asan, flaky check)
Log: integration_run_flaky0_0.log

The CI flaky check ran test_s3_cluster/test.py (the file modified by this PR) with 5 parameterized iterations of test_graceful_shutdown:

Param	Status	Duration
`[1-5]`	FAIL	16.26s
`[2-5]`	OK	48.38s
`[3-5]`	OK	48.24s
`[4-5]`	OK	49.06s
`[5-5]`	OK	48.39s

Error: assert errors == 0 → assert 1 == 0. The failing run completed in 16s vs ~48s for passing runs, indicating the error occurred during early query execution.

114 other tests in the flaky check passed.

Clarification: This PR fixes the use-after-free crash (the issue from #1554), which it does successfully. The remaining test_graceful_shutdown failure is a separate, known issue tracked in #1521 and is not yet fixed. The PR is confirmed OK to merge with this failure present.

2. `Start ClickHouse Server` — Infrastructure Failure

Job: Stateless tests (amd_asan, distributed plan, parallel, 1/2)

Server failed to start due to MinIO client errors:

mc: <ERROR> Unable to list folder. Requested path '/home/ubuntu/.../clickminio' not found

The shard 2/2 of the same stateless suite passed successfully.

Related to PR: No — MinIO infrastructure setup issue

3. Swarms Regression (x86 and aarch64) — Not PR-related

4. GrypeScan — CVE in Alpine Base Image

Related to PR: No — Base image vulnerability scan

Dropped Jobs

All 15 integration test shards were dropped:

Integration tests (amd_asan, old analyzer, 1-6/6)
Integration tests (amd_binary, 1-5/5)
Integration tests (arm_binary, distributed plan, 1-4/4)

Additionally dropped: sequential stateless suites, stress tests, AST fuzzers, BuzzHouse, install packages, compatibility checks.

test_graceful_shutdown was only validated via the flaky check, not in the full integration test shards.

Conclusion

Verdict: Ready to merge .

ianton-ru added 2 commits March 20, 2026 13:05

Fix unsynchronized access to replica_to_files_to_be_processed

78304b3

Fix use after free in rescheduleTasksFromReplica

c71caec

ianton-ru added antalya bugfix antalya-25.8 labels Mar 20, 2026

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

Fix file identifier in rescheduleTasksFromReplica

bbe007e

ianton-ru force-pushed the bugfix/antalya-25.8/1554_fix_crash branch from 50453ff to bbe007e Compare March 20, 2026 15:21

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

Fix logical error catch

c6dda09

mkmkme approved these changes Mar 23, 2026

View reviewed changes

ianton-ru changed the title ~~Fix use after free in rescheduleTasksFromReplica~~ Antalya 25.8: Fix use after free in rescheduleTasksFromReplica Mar 24, 2026

alsugiliazova mentioned this pull request Mar 26, 2026

Antalya 26.1 Backport of #97062 - Improve catalog show tables query #1552

Merged

25 tasks

alsugiliazova added the verified Verified by QA label Mar 26, 2026

Conversation

ianton-ru commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

ianton-ru commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 20, 2026

Uh oh!

mkmkme commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confirmed defects

Coverage summary

Uh oh!

alsugiliazova commented Mar 26, 2026

PR #1561 CI Verification Report

CI Failures

1. test_graceful_shutdown[1-5] — Known Pre-existing Issue, OK to Merge

2. Start ClickHouse Server — Infrastructure Failure

3. Swarms Regression (x86 and aarch64) — Not PR-related

4. GrypeScan — CVE in Alpine Base Image

Dropped Jobs

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ianton-ru commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading

mkmkme commented Mar 23, 2026 •

edited

Loading

1. `test_graceful_shutdown[1-5]` — Known Pre-existing Issue, OK to Merge

2. `Start ClickHouse Server` — Infrastructure Failure