Skip to content

fix(event-tracker): prevent duplicate chaos experiment triggers under concurrent reconciles#5409

Open
WHOIM1205 wants to merge 3 commits intolitmuschaos:masterfrom
WHOIM1205:fix/eventtracker-race-condition
Open

fix(event-tracker): prevent duplicate chaos experiment triggers under concurrent reconciles#5409
WHOIM1205 wants to merge 3 commits intolitmuschaos:masterfrom
WHOIM1205:fix/eventtracker-race-condition

Conversation

@WHOIM1205
Copy link
Contributor

Fix race condition causing duplicate chaos experiment triggers in EventTrackerPolicy controller

Summary

This PR fixes a critical race condition in the EventTrackerPolicy controller that could cause the same chaos experiment to be triggered multiple times under concurrent reconciles.

The issue was caused by a combination of:

  • A no-op local mutex created per reconcile
  • Side effects (SendRequest) executed inside a retry loop
  • Concurrent reconciles observing stale IsTriggered=false state

The fix replaces the broken locking logic with a Kubernetes-idiomatic optimistic concurrency approach that guarantees exactly-once experiment triggering.


What was broken

Root issues

  • sync.Mutex was instantiated inside Reconcile(), so each reconcile had its own lock
  • SendRequest() (experiment trigger) was executed before the CR status update was safely committed
  • On Update() conflict, the reconcile retried and re-triggered the experiment
  • Multiple reconciles could race on the same EventTrackerPolicy, all seeing IsTriggered=false

Impact

  • Duplicate chaos experiments running simultaneously
  • Multiple chaos-runner pods competing for the same targets
  • Unpredictable chaos results
  • Resource exhaustion in production clusters
  • Silent CI / GitOps pipeline corruption

The fix

This PR introduces a two-phase, conflict-safe execution model.

Phase 1 — Atomically claim trigger intent

  • Uses retry.RetryOnConflict
  • Re-reads the latest EventTrackerPolicy
  • Marks IsTriggered = "true" before triggering
  • Commits the update atomically

Phase 2 — Execute side effects

  • Triggers experiments after the update succeeds
  • Executes SendRequest() outside the retry loop
  • Guarantees each experiment is triggered exactly once

No mutexes. No shared memory. Fully Kubernetes-native.


Why this is safe

  • Works correctly with:
    • Multiple controller replicas
    • Leader election failover
    • Controller restarts
  • Uses Kubernetes optimistic locking instead of in-process synchronization
  • Avoids side effects inside retry loops
  • Preserves existing behavior while eliminating duplicates

How to reproduce (before this fix)

  1. Deploy the event-tracker controller
  2. Create an EventTrackerPolicy with Result=ConditionPassed and IsTriggered=false
apiVersion: eventtracker.litmuschaos.io/v1
kind: EventTrackerPolicy
metadata:
  name: test-policy
  namespace: litmus
status:
  - resourceName: trigger-config
    experimentID: test-experiment-123
    result: ConditionPassed
    isTriggered: "false"

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205
Copy link
Contributor Author

hey @ispeakc0de

This fixes a race in the EventTrackerPolicy reconciler that could trigger the same chaos experiment multiple times under concurrent reconciles by using optimistic concurrency and moving side effects outside the retry loop.

@Saranya-jena
Copy link
Contributor

Saranya-jena commented Mar 17, 2026

@WHOIM1205 could you fix the pipeline failures and see if the changes are still valid in the latest version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants