Antalya 25.8: Fix use after free in rescheduleTasksFromReplica#1561
Antalya 25.8: Fix use after free in rescheduleTasksFromReplica#1561ianton-ru wants to merge 4 commits intoantalya-25.8from
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c71caec37e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto file_identifier = file->getAbsolutePath().value_or(file->getPath()); | ||
| auto file_replica_idx = getReplicaForFile(file_identifier); | ||
| unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx)); |
There was a problem hiding this comment.
Preserve identifier semantics when re-queueing lost-replica task
When a replica disappears mid-request, this branch re-queues the task using getAbsolutePath().value_or(getPath()) as the map key, but getPreQueuedFile later looks up queued tasks by send_over_whole_archive ? getPathOrPathToArchiveIfArchive() : getIdentifier(). For bucketed objects (getIdentifier() includes bucket suffix) or archive mode, the key no longer matches, so pre-queued tasks are skipped, and emplace can collapse multiple buckets with the same path into one entry.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50453ff019
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto file_identifier = file->getAbsolutePath().value_or(file->getPath()); | ||
| auto file_replica_idx = getReplicaForFile(file_identifier); | ||
| unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx)); |
There was a problem hiding this comment.
Requeue with canonical file identifier
When a replica is lost mid-getNextTask, this branch re-inserts the task using file->getAbsolutePath().value_or(file->getPath()), but the rest of the distributor now keys tasks via getFileIdentifier(...)/getIdentifier() (including bucket suffixes and archive normalization). In bucket-splitting mode, multiple buckets from the same path collapse to one unprocessed_files key here, and later getPreQueuedFile lookups by canonical identifier won't match this entry, so some tasks can be skipped or dropped after failover.
Useful? React with 👍 / 👎.
50453ff to
bbe007e
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bbe007e689
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (processed_file_list_ptr == replica_to_files_to_be_processed.end()) | ||
| { // It is possible that replica was lost after check in the begining of the method | ||
| auto file_identifier = getFileIdentifier(file); | ||
| auto file_replica_idx = getReplicaForFile(file_identifier); | ||
| unprocessed_files.emplace(file_identifier, std::make_pair(file, file_replica_idx)); |
There was a problem hiding this comment.
Stop returning task after requeueing lost replica work
When getNextTask detects that the current replica was removed, it re-enqueues the selected file into unprocessed_files/connection_to_files but still returns the same file at the end of the function. In the race where rescheduleTasksFromReplica() runs between task selection and this check, the same object becomes available to another replica while this caller can still process it, which can lead to duplicate processing/results if the caller proceeds with the returned task.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Technically it can't be.
getNextTask calls from processReadTaskRequest, what is called form processPacket as well as rescheduleTasksFromReplica. So rescheduleTasksFromReplica can't be called for the same replica simultaneously with getNextTask.
Comment was written incorrectly, code fixed
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Just for the record AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1561 (Fix use after free in rescheduleTasksFromReplica): Confirmed defectsNo confirmed defects in reviewed scope. Coverage summary
|
PR #1561 CI Verification ReportCI Failures1.
|
| Param | Status | Duration |
|---|---|---|
[1-5] |
FAIL | 16.26s |
[2-5] |
OK | 48.38s |
[3-5] |
OK | 48.24s |
[4-5] |
OK | 49.06s |
[5-5] |
OK | 48.39s |
Error: assert errors == 0 → assert 1 == 0. The failing run completed in 16s vs ~48s for passing runs, indicating the error occurred during early query execution.
114 other tests in the flaky check passed.
Clarification: This PR fixes the use-after-free crash (the issue from #1554), which it does successfully. The remaining test_graceful_shutdown failure is a separate, known issue tracked in #1521 and is not yet fixed. The PR is confirmed OK to merge with this failure present.
2. Start ClickHouse Server — Infrastructure Failure
Job: Stateless tests (amd_asan, distributed plan, parallel, 1/2)
Server failed to start due to MinIO client errors:
mc: <ERROR> Unable to list folder. Requested path '/home/ubuntu/.../clickminio' not found
The shard 2/2 of the same stateless suite passed successfully.
Related to PR: No — MinIO infrastructure setup issue
3. Swarms Regression (x86 and aarch64) — Not PR-related
4. GrypeScan — CVE in Alpine Base Image
Related to PR: No — Base image vulnerability scan
Dropped Jobs
All 15 integration test shards were dropped:
- Integration tests (amd_asan, old analyzer, 1-6/6)
- Integration tests (amd_binary, 1-5/5)
- Integration tests (arm_binary, distributed plan, 1-4/4)
Additionally dropped: sequential stateless suites, stress tests, AST fuzzers, BuzzHouse, install packages, compatibility checks.
test_graceful_shutdown was only validated via the flaky check, not in the full integration test shards.
Conclusion
Verdict: Ready to merge .
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Solved #1554
Fixed incorrect using iterator to removed element.
Backport of #1493
Fix file identifier for distributed tasks
(logic was backported in #1237)
Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: