Skip to content

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397

Open
Sanchit2662 wants to merge 3 commits intolitmuschaos:masterfrom
Sanchit2662:fix/subscription-memory-leak-deadlock
Open

fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397
Sanchit2662 wants to merge 3 commits intolitmuschaos:masterfrom
Sanchit2662:fix/subscription-memory-leak-deadlock

Conversation

@Sanchit2662
Copy link

@Sanchit2662 Sanchit2662 commented Jan 14, 2026

Summary

This PR fixes a critical concurrency issue in the ChaosCenter GraphQL subscription layer that could lead to unbounded memory growth and a process-wide deadlock under normal UI usage.

Specifically, GetInfraEvents subscriptions were leaking channels after client disconnects, and SendInfraEvent could block indefinitely while holding a shared mutex. Over time, this caused the GraphQL server to become unresponsive with no crash logs or clear error signals.

The fix ensures proper subscription cleanup, prevents blocking sends, and hardens related cleanup paths against concurrent map access.


Fix

1. Proper subscription cleanup on disconnect

Channels are now removed from the publisher slice when the subscription context is cancelled:

go func() {
    <-ctx.Done()
    data_store.Store.Mutex.Lock()
    channels := data_store.Store.InfraEventPublish[projectID]
    for i, ch := range channels {
        if ch == infraEvent {
            data_store.Store.InfraEventPublish[projectID] =
                append(channels[:i], channels[i+1:]...)
            break
        }
    }
    data_store.Store.Mutex.Unlock()
}()

2. Non-blocking event delivery to prevent deadlocks

Event publishing no longer blocks on slow or disconnected subscribers:

for _, observer := range r.InfraEventPublish[infra.ProjectID] {
    select {
    case observer <- &newEvent:
    default:
        // skip slow/dead subscriber
    }
}

This ensures one stalled subscription cannot block the entire system.


3. Thread-safe cleanup in related subscriptions

Cleanup paths in GetPodLog, GetKubeObject, and GetKubeNamespace now properly guard map deletes with the shared mutex, preventing concurrent map access panics.


Impact

  • Memory leak eliminated: subscription channels are no longer leaked.
  • Deadlock prevented: event publishing cannot block while holding the mutex.
  • Improved resilience: slow or disconnected clients degrade gracefully.
  • Stability improved: prevents rare but severe production outages in ChaosCenter.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)

Checklist

- Add proper cleanup in GetInfraEvents to remove channels on disconnect
- Use non-blocking sends in SendInfraEvent to prevent mutex deadlock
- Add mutex protection to map deletes in GetPodLog, GetKubeObject, GetKubeNamespace

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662
Copy link
Author

Sanchit2662 commented Jan 15, 2026

Hi @PriteshKiri, @amityt , @SarthakJain26
I’ve updated the PR to address the issue and adjusted the implementation accordingly. This helps avoid a potential memory leak and deadlock in the GraphQL subscription flow by improving how subscriptions are cleaned up and how events are delivered.

Whenever you get a chance, I’d really appreciate a review. Thanks!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses concurrency problems in the ChaosCenter GraphQL subscription layer, focusing on preventing blocked publishers and cleaning up subscription listeners to avoid leaked channels and map access hazards.

Changes:

  • Made SendInfraEvent publish using a non-blocking channel send to avoid indefinitely blocking while holding the shared mutex.
  • Added GetInfraEvents subscription cleanup to remove the subscriber channel on ctx.Done().
  • Wrapped several subscription cleanup delete(...) operations (ExperimentLog, KubeObjectData, KubeNamespaceData) with the shared mutex.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go Switches infra event fan-out to non-blocking sends to prevent deadlocks.
chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go Adds disconnect cleanup for infra event subscriptions and mutex-protects cleanup deletes for several subscription maps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1043 to 1052
r.Mutex.Lock()
if r.InfraEventPublish != nil {
for _, observer := range r.InfraEventPublish[infra.ProjectID] {
observer <- &newEvent
// Use non-blocking send to prevent deadlock if channel buffer is full
select {
case observer <- &newEvent:
default:
// Channel full or no receiver, skip to prevent blocking
}
}
data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...)
break
}
}
Comment on lines 347 to +350
logrus.Print("CLOSED LOG LISTENER: ", request.InfraID, request.PodName)
data_store.Store.Mutex.Lock()
delete(data_store.Store.ExperimentLog, reqID.String())
data_store.Store.Mutex.Unlock()
delete(data_store.Store.KubeObjectData, reqID.String())
data_store.Store.Mutex.Unlock()
}()
go r.chaosExperimentHandler.GetKubeObjData(reqID.String(), request, *data_store.Store)
Comment on lines 385 to +389
<-ctx.Done()
logrus.Println("Closed KubeNamespace Listener")
data_store.Store.Mutex.Lock()
delete(data_store.Store.KubeNamespaceData, reqID.String())
data_store.Store.Mutex.Unlock()
@SarthakJain26
Copy link
Contributor

@Sanchit2662 please check the comments from copilot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants