fix(event-tracker): prevent duplicate chaos experiment triggers under concurrent reconciles#5409
Open
WHOIM1205 wants to merge 3 commits intolitmuschaos:masterfrom
Open
fix(event-tracker): prevent duplicate chaos experiment triggers under concurrent reconciles#5409WHOIM1205 wants to merge 3 commits intolitmuschaos:masterfrom
WHOIM1205 wants to merge 3 commits intolitmuschaos:masterfrom
Conversation
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
Contributor
Author
|
hey @ispeakc0de This fixes a race in the EventTrackerPolicy reconciler that could trigger the same chaos experiment multiple times under concurrent reconciles by using optimistic concurrency and moving side effects outside the retry loop. |
Contributor
|
@WHOIM1205 could you fix the pipeline failures and see if the changes are still valid in the latest version? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix race condition causing duplicate chaos experiment triggers in EventTrackerPolicy controller
Summary
This PR fixes a critical race condition in the
EventTrackerPolicycontroller that could cause the same chaos experiment to be triggered multiple times under concurrent reconciles.The issue was caused by a combination of:
SendRequest) executed inside a retry loopIsTriggered=falsestateThe fix replaces the broken locking logic with a Kubernetes-idiomatic optimistic concurrency approach that guarantees exactly-once experiment triggering.
What was broken
Root issues
sync.Mutexwas instantiated insideReconcile(), so each reconcile had its own lockSendRequest()(experiment trigger) was executed before the CR status update was safely committedUpdate()conflict, the reconcile retried and re-triggered the experimentEventTrackerPolicy, all seeingIsTriggered=falseImpact
The fix
This PR introduces a two-phase, conflict-safe execution model.
Phase 1 — Atomically claim trigger intent
retry.RetryOnConflictEventTrackerPolicyIsTriggered = "true"before triggeringPhase 2 — Execute side effects
SendRequest()outside the retry loopNo mutexes. No shared memory. Fully Kubernetes-native.
Why this is safe
How to reproduce (before this fix)
EventTrackerPolicywithResult=ConditionPassedandIsTriggered=false