fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397
fix(graphql): prevent memory leak and deadlock in subscription resolvers#5397Sanchit2662 wants to merge 3 commits intolitmuschaos:masterfrom
Conversation
- Add proper cleanup in GetInfraEvents to remove channels on disconnect - Use non-blocking sends in SendInfraEvent to prevent mutex deadlock - Add mutex protection to map deletes in GetPodLog, GetKubeObject, GetKubeNamespace Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
|
Hi @PriteshKiri, @amityt , @SarthakJain26 Whenever you get a chance, I’d really appreciate a review. Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR addresses concurrency problems in the ChaosCenter GraphQL subscription layer, focusing on preventing blocked publishers and cleaning up subscription listeners to avoid leaked channels and map access hazards.
Changes:
- Made
SendInfraEventpublish using a non-blocking channel send to avoid indefinitely blocking while holding the shared mutex. - Added
GetInfraEventssubscription cleanup to remove the subscriber channel onctx.Done(). - Wrapped several subscription cleanup
delete(...)operations (ExperimentLog,KubeObjectData,KubeNamespaceData) with the shared mutex.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
chaoscenter/graphql/server/pkg/chaos_infrastructure/service.go |
Switches infra event fan-out to non-blocking sends to prevent deadlocks. |
chaoscenter/graphql/server/graph/chaos_infrastructure.resolvers.go |
Adds disconnect cleanup for infra event subscriptions and mutex-protects cleanup deletes for several subscription maps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| r.Mutex.Lock() | ||
| if r.InfraEventPublish != nil { | ||
| for _, observer := range r.InfraEventPublish[infra.ProjectID] { | ||
| observer <- &newEvent | ||
| // Use non-blocking send to prevent deadlock if channel buffer is full | ||
| select { | ||
| case observer <- &newEvent: | ||
| default: | ||
| // Channel full or no receiver, skip to prevent blocking | ||
| } | ||
| } |
| data_store.Store.InfraEventPublish[projectID] = append(channels[:i], channels[i+1:]...) | ||
| break | ||
| } | ||
| } |
| logrus.Print("CLOSED LOG LISTENER: ", request.InfraID, request.PodName) | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.ExperimentLog, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
| delete(data_store.Store.KubeObjectData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() | ||
| }() | ||
| go r.chaosExperimentHandler.GetKubeObjData(reqID.String(), request, *data_store.Store) |
| <-ctx.Done() | ||
| logrus.Println("Closed KubeNamespace Listener") | ||
| data_store.Store.Mutex.Lock() | ||
| delete(data_store.Store.KubeNamespaceData, reqID.String()) | ||
| data_store.Store.Mutex.Unlock() |
|
@Sanchit2662 please check the comments from copilot |
Summary
This PR fixes a critical concurrency issue in the ChaosCenter GraphQL subscription layer that could lead to unbounded memory growth and a process-wide deadlock under normal UI usage.
Specifically,
GetInfraEventssubscriptions were leaking channels after client disconnects, andSendInfraEventcould block indefinitely while holding a shared mutex. Over time, this caused the GraphQL server to become unresponsive with no crash logs or clear error signals.The fix ensures proper subscription cleanup, prevents blocking sends, and hardens related cleanup paths against concurrent map access.
Fix
1. Proper subscription cleanup on disconnect
Channels are now removed from the publisher slice when the subscription context is cancelled:
2. Non-blocking event delivery to prevent deadlocks
Event publishing no longer blocks on slow or disconnected subscribers:
This ensures one stalled subscription cannot block the entire system.
3. Thread-safe cleanup in related subscriptions
Cleanup paths in
GetPodLog,GetKubeObject, andGetKubeNamespacenow properly guard map deletes with the shared mutex, preventing concurrent map access panics.Impact
Types of changes
Checklist