Skip to content

Antalya 25.8: Fix use after free in rescheduleTasksFromReplica#1561

Open
ianton-ru wants to merge 4 commits intoantalya-25.8from
bugfix/antalya-25.8/1554_fix_crash
Open

Antalya 25.8: Fix use after free in rescheduleTasksFromReplica#1561
ianton-ru wants to merge 4 commits intoantalya-25.8from
bugfix/antalya-25.8/1554_fix_crash

Conversation

@ianton-ru
Copy link
Copy Markdown

@ianton-ru ianton-ru commented Mar 20, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1554
Fixed incorrect using iterator to removed element.

Backport of #1493
Fix file identifier for distributed tasks
(logic was backported in #1237)

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

Workflow [PR], commit [c6dda09]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c71caec37e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +74 to +76
auto file_identifier = file->getAbsolutePath().value_or(file->getPath());
auto file_replica_idx = getReplicaForFile(file_identifier);
unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve identifier semantics when re-queueing lost-replica task

When a replica disappears mid-request, this branch re-queues the task using getAbsolutePath().value_or(getPath()) as the map key, but getPreQueuedFile later looks up queued tasks by send_over_whole_archive ? getPathOrPathToArchiveIfArchive() : getIdentifier(). For bucketed objects (getIdentifier() includes bucket suffix) or archive mode, the key no longer matches, so pre-queued tasks are skipped, and emplace can collapse multiple buckets with the same path into one entry.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50453ff019

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +74 to +76
auto file_identifier = file->getAbsolutePath().value_or(file->getPath());
auto file_replica_idx = getReplicaForFile(file_identifier);
unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Requeue with canonical file identifier

When a replica is lost mid-getNextTask, this branch re-inserts the task using file->getAbsolutePath().value_or(file->getPath()), but the rest of the distributor now keys tasks via getFileIdentifier(...)/getIdentifier() (including bucket suffixes and archive normalization). In bucket-splitting mode, multiple buckets from the same path collapse to one unprocessed_files key here, and later getPreQueuedFile lookups by canonical identifier won't match this entry, so some tasks can be skipped or dropped after failover.

Useful? React with 👍 / 👎.

@ianton-ru ianton-ru force-pushed the bugfix/antalya-25.8/1554_fix_crash branch from 50453ff to bbe007e Compare March 20, 2026 15:21
@ianton-ru
Copy link
Copy Markdown
Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbe007e689

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +72 to +76
if (processed_file_list_ptr == replica_to_files_to_be_processed.end())
{ // It is possible that replica was lost after check in the begining of the method
auto file_identifier = getFileIdentifier(file);
auto file_replica_idx = getReplicaForFile(file_identifier);
unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop returning task after requeueing lost replica work

When getNextTask detects that the current replica was removed, it re-enqueues the selected file into unprocessed_files/connection_to_files but still returns the same file at the end of the function. In the race where rescheduleTasksFromReplica() runs between task selection and this check, the same object becomes available to another replica while this caller can still process it, which can lead to duplicate processing/results if the caller proceeds with the returned task.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can't be.
getNextTask calls from processReadTaskRequest, what is called form processPacket as well as rescheduleTasksFromReplica. So rescheduleTasksFromReplica can't be called for the same replica simultaneously with getNextTask.
Comment was written incorrectly, code fixed

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mkmkme
Copy link
Copy Markdown
Collaborator

mkmkme commented Mar 23, 2026

Just for the record


AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1561 (Fix use after free in rescheduleTasksFromReplica):

Confirmed defects

No confirmed defects in reviewed scope.

Coverage summary

  • Scope reviewed: StorageObjectStorageStableTaskDistributor (getNextTask, getPreQueuedFile, getMatchingFileFromIterator, getAnyUnprocessedFile, rescheduleTasksFromReplica), new getFileIdentifier, header declaration, and integration test rename in tests/integration/test_s3_cluster/test.py.
  • Categories failed: (none)
  • Categories passed: use-after-free / iterator invalidation after erase (addressed by std::move + iterate copy, and by not retaining find iterators across unlocked work); data race on replica_to_files_to_be_processed / getReplicaForFile (addressed by holding mutex around map reads and final push_back); consistent unprocessed_files keys via getFileIdentifier vs prior mixed getPath / getIdentifier / archive paths; lock ordering (sequential std::lock_guard, no nested same-mutex deadlock in reviewed paths); exception paths on LOGICAL_ERROR when replica dropped between phases (fail-closed throw vs UB).
  • Assumptions/limits: Findings are from static review of the published PR diff; local checkout is not guaranteed to match antalya-25.8 + this PR, so identifier helpers (e.g. getAbsolutePathFromObjectInfo vs getIdentifier) should be re-checked on the exact merge target if it has diverged.

@ianton-ru ianton-ru changed the title Fix use after free in rescheduleTasksFromReplica Antalya 25.8: Fix use after free in rescheduleTasksFromReplica Mar 24, 2026
@alsugiliazova
Copy link
Copy Markdown
Member

PR #1561 CI Verification Report

CI Failures

1. test_graceful_shutdown[1-5] — Known Pre-existing Issue, OK to Merge

Job: Integration tests (amd_asan, flaky check)
Log: integration_run_flaky0_0.log

The CI flaky check ran test_s3_cluster/test.py (the file modified by this PR) with 5 parameterized iterations of test_graceful_shutdown:

Param Status Duration
[1-5] FAIL 16.26s
[2-5] OK 48.38s
[3-5] OK 48.24s
[4-5] OK 49.06s
[5-5] OK 48.39s

Error: assert errors == 0assert 1 == 0. The failing run completed in 16s vs ~48s for passing runs, indicating the error occurred during early query execution.

114 other tests in the flaky check passed.

Clarification: This PR fixes the use-after-free crash (the issue from #1554), which it does successfully. The remaining test_graceful_shutdown failure is a separate, known issue tracked in #1521 and is not yet fixed. The PR is confirmed OK to merge with this failure present.

2. Start ClickHouse Server — Infrastructure Failure

Job: Stateless tests (amd_asan, distributed plan, parallel, 1/2)

Server failed to start due to MinIO client errors:

mc: <ERROR> Unable to list folder. Requested path '/home/ubuntu/.../clickminio' not found

The shard 2/2 of the same stateless suite passed successfully.

Related to PR: No — MinIO infrastructure setup issue

3. Swarms Regression (x86 and aarch64) — Not PR-related

4. GrypeScan — CVE in Alpine Base Image

Related to PR: No — Base image vulnerability scan

Dropped Jobs

All 15 integration test shards were dropped:

  • Integration tests (amd_asan, old analyzer, 1-6/6)
  • Integration tests (amd_binary, 1-5/5)
  • Integration tests (arm_binary, distributed plan, 1-4/4)

Additionally dropped: sequential stateless suites, stress tests, AST fuzzers, BuzzHouse, install packages, compatibility checks.

test_graceful_shutdown was only validated via the flaky check, not in the full integration test shards.

Conclusion

Verdict: Ready to merge .

@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants