Skip to content

Fix unbounded memory growth in queue event tracking#13156

Open
moktamd wants to merge 1 commit intogradio-app:mainfrom
moktamd:fix/queue-memory-leak
Open

Fix unbounded memory growth in queue event tracking#13156
moktamd wants to merge 1 commit intogradio-app:mainfrom
moktamd:fix/queue-memory-leak

Conversation

@moktamd
Copy link
Copy Markdown

@moktamd moktamd commented Mar 27, 2026

Fixes #13154

event_ids_to_events, pending_event_ids_session, and event_analytics in Queue grow without bound because completed events are never removed from these dicts. Under sustained load this causes continuous memory growth until the process is killed.

Changes:

  • Convert all three dicts to LRUCache (already used for pending_messages_per_session) so they are bounded even if explicit cleanup is missed
  • Remove entries from event_ids_to_events and pending_event_ids_session when event processing completes in the finally block of process_events
  • Clean up event_ids_to_events and empty session keys in clean_events when events are cancelled from the queue

event_ids_to_events, pending_event_ids_session, and event_analytics
grew without bound because completed events were never removed.

- Convert all three dicts to LRUCache with bounded size
- Explicitly remove event entries when processing completes
- Clean up empty session keys from pending_event_ids_session
- Remove event mappings in clean_events when events are cancelled
Copy link
Copy Markdown

@mahdirajaee mahdirajaee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-targeted fix for unbounded memory growth. Converting pending_event_ids_session, event_ids_to_events, and event_analytics from plain dict to LRUCache with bounded sizes (2000, 2000, and 10000 respectively) is a clean solution that caps memory usage for long-running servers. The cleanup logic in clean_events and process_events is also good — actively evicting entries from event_ids_to_events and pending_event_ids_session when events complete or are cancelled ensures the LRU doesn't just passively evict stale data but proactively stays lean. One edge case to consider: the LRUCache(2000) for pending_event_ids_session means if a server handles more than 2000 concurrent sessions, the oldest session's pending event IDs get evicted, which could cause those events to become orphaned (never cleaned up from the event queue). For high-traffic deployments this limit might need to be configurable. Also, the set subtraction self.pending_event_ids_session[session_hash] -= removed_ids in clean_events looks correct, but consider using discard in a loop instead if the removed IDs might not all be present in the set, to avoid potential KeyError issues — though set subtraction handles missing elements gracefully, so this should be fine.

@moktamd
Copy link
Copy Markdown
Author

moktamd commented Mar 31, 2026

Thanks for the review. Good point about the 2000 session limit — it matches the existing pending_messages_per_session LRUCache(2000) on the line above, so it's consistent with what the codebase already uses. Happy to make it configurable if maintainers prefer.

@freddyaboulton
Copy link
Copy Markdown
Collaborator

Will take a look at this. I think setting a limit on performance metrics would be breaking and some of these changes are not needed as those data structures are popped from in the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Queue memory leak

3 participants