-
Notifications
You must be signed in to change notification settings - Fork 797
fix: repro memory leak and deadlock under high payloads and high concurrency #5269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: michaelfeil <me@michaelfeil.eu>
|
👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughIntroduces a complete OpenAI-like pipeline example for the Python bindings with integrated memory profiling. Includes documentation, a backend service with optional proxy mode and request handling, a frontend chat service with MockEngine wrapper, a load testing utility for high-concurrency scenarios, and a shared memory monitoring module for tracking allocations and performance metrics during high-throughput payload processing. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In @lib/bindings/python/examples/pipeline_openai/backend.py:
- Line 1: Update the copyright header comment at the top of the file by changing
"2025" to "2026" so the SPDX copyright line reads the current year; locate the
top-of-file comment string beginning with "SPDX-FileCopyrightText" and replace
the year token.
- Around line 72-80: The bare "raise" in the finally block after awaiting
endpoint.serve_endpoint(RequestHandler(proxy_client).generate) will cause
RuntimeError on normal exit; remove it from the finally and either remove
re-raising entirely or re-raise only when an exception occurred by changing the
structure to try: await endpoint.serve_endpoint(...) except Exception as e:
raise to propagate errors (or omit the except if you don't want propagation) and
keep the cleanup in finally (monitor.log_memory and monitor_task.cancel) so
shutdown can complete cleanly.
In @lib/bindings/python/examples/pipeline_openai/frontend.py:
- Line 1: Update the SPDX copyright header year from 2025 to 2026 in the file
that contains the header comment (the top-line string that begins with "#
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved."); replace "2025" with "2026" so the header reads "...
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
- Line 64: The HTTP status code used in the exception is incorrect: replace
HttpError(401, "Failed to contact pipeline after retries") with a 5xx code that
reflects an upstream service/connection failure (e.g., HttpError(502, "Failed to
contact pipeline after retries") or HttpError(503, ...)); update the HttpError
invocation in the same spot to use 502 (Bad Gateway) so the error semantically
indicates the pipeline was unreachable rather than an authentication issue.
In @lib/bindings/python/examples/pipeline_openai/load_test.py:
- Around line 1-5: Add the required SPDX header by inserting two comment lines
immediately after the existing shebang (#!/usr/bin/env python3) and before the
module docstring (the triple-quoted string): include an SPDX-FileCopyrightText
line with the correct year and owner and an SPDX-License-Identifier line with
the project license (e.g., "SPDX-FileCopyrightText: 2026 Your Name or Org" and
"SPDX-License-Identifier: Apache-2.0"), replacing placeholders with the real
values.
- Line 231: The variable start_time is assigned twice causing the first
assignment to be unused; remove the redundant initial assignment (or consolidate
to a single assignment) so only one start_time is set before it's used—locate
both assignments to start_time in load_test.py and keep the correct one (the
later timing capture) and delete the earlier shadowing assignment.
In @lib/bindings/python/examples/pipeline_openai/memory_monitor.py:
- Around line 1-4: Add the required SPDX copyright header to the top of
memory_monitor.py by inserting a copyright notice and an SPDX-License-Identifier
line (e.g., "Copyright YEAR YourOrganization" and "SPDX-License-Identifier:
<license-id>") as the very first non-empty lines so the header precedes the
module docstring; ensure YEAR and the copyright holder are correct and the SPDX
identifier matches the project license.
In @lib/bindings/python/examples/pipeline_openai/README.md:
- Line 2: Fix the typo in the README sentence that currently reads "This OpenAI
Compatible pipeline is profiling memory allocations of dynamo itself in a
high-thoughput for large payloads." by replacing the misspelled word "thoughput"
with "throughput" so the sentence reads correctly.
🧹 Nitpick comments (6)
lib/bindings/python/examples/pipeline_openai/README.md (1)
5-10: Add language specifiers to fenced code blocks.Per markdown linting, fenced code blocks should have a language specified for syntax highlighting and accessibility.
📝 Suggested fix
files: -``` +```text frontend.py -> HTTP Frontend, implemented in Python with the Rust bindings backend.py -> proxy and backend implementation load_test.py -> Sends http requests to frontend memory_monitor.py -> Monitors pid memory usageStartup, best run from three different terminals in this order:
-+bash
python ./backend.py &
python ./backend.py --proxy-mode &
python ./frontend.pyThen -``` +```bash python load_test.py</details> Also applies to: 13-17, 20-22 </blockquote></details> <details> <summary>lib/bindings/python/examples/pipeline_openai/backend.py (1)</summary><blockquote> `41-43`: **Rename unused loop variable `chunk` to `_chunk`.** The loop variable is not used within the loop body. <details> <summary>📝 Suggested fix</summary> ```diff - async for chunk in stream: + async for _chunk in stream: i += 1 yield f"chunk{i}"lib/bindings/python/examples/pipeline_openai/memory_monitor.py (1)
42-53: Clarify the intent of the property access and note potential thread-safety concern.
Line 45: The bare
self.initial_memoryaccess is used to trigger lazy initialization of_initial_memory, but this reads as a useless expression. Consider making it explicit.The
request_count += 1increment is not thread-safe under concurrent access. While acceptable for a profiling tool where approximate counts suffice, be aware this may undercount under high concurrency.📝 Suggested fix for clarity
def increment_request(self): """Increment request counter and log if needed""" self.request_count += 1 - self.initial_memory + _ = self.initial_memory # Ensure initial memory is captured on first requestlib/bindings/python/examples/pipeline_openai/frontend.py (1)
66-66: Rename unused loop variableoutputto_output.📝 Suggested fix
- async for output in stream: + async for _output in stream:lib/bindings/python/examples/pipeline_openai/load_test.py (2)
22-22: Move import to module level.The
dequeimport should be at the top of the file with other imports, not inside__init__.📝 Suggested fix
At the top of the file (after line 12):
from collections import dequeThen remove line 22 from
__init__:def __init__(self, base_url: str = "http://localhost:8000"): self.base_url = base_url self.session = None - # Use collections.deque with maxlen to prevent unbounded memory growth - from collections import deque - + # Use deque with maxlen to prevent unbounded memory growth self.request_times = deque(maxlen=10000) # Keep only last 10k request times
170-176: Explicitdelstatements don't meaningfully accelerate garbage collection.Python's reference counting handles cleanup when variables go out of scope. The
del taskanddel tasksstatements at the end of each loop iteration don't provide meaningful GC benefits since these references are immediately reassigned in the next iteration anyway.You can simplify by removing these lines:
- # Explicitly clean up task references - for task in tasks: - if not task.done(): - task.cancel() - # Help GC by removing references - del task - del tasks + # Cancel any incomplete tasks (defensive) + for task in tasks: + if not task.done(): + task.cancel()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/bindings/python/examples/pipeline_openai/README.mdlib/bindings/python/examples/pipeline_openai/backend.pylib/bindings/python/examples/pipeline_openai/frontend.pylib/bindings/python/examples/pipeline_openai/load_test.pylib/bindings/python/examples/pipeline_openai/memory_monitor.py
🧰 Additional context used
🧬 Code graph analysis (1)
lib/bindings/python/examples/pipeline_openai/frontend.py (5)
lib/bindings/python/src/dynamo/_core.pyi (3)
HttpAsyncEngine(923-930)HttpService(904-910)DistributedRuntime(36-85)lib/bindings/python/src/dynamo/llm/exceptions.py (1)
HttpError(13-35)lib/bindings/python/src/dynamo/runtime/__init__.py (1)
dynamo_worker(24-51)lib/bindings/python/examples/pipeline_openai/memory_monitor.py (4)
create_monitor(67-71)setup_background_monitor(74-85)log_memory(30-40)increment_request(42-64)lib/bindings/python/examples/pipeline_openai/backend.py (2)
generate(32-48)worker(52-80)
🪛 GitHub Actions: Copyright Checks
lib/bindings/python/examples/pipeline_openai/memory_monitor.py
[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.
lib/bindings/python/examples/pipeline_openai/load_test.py
[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.
lib/bindings/python/examples/pipeline_openai/backend.py
[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.
lib/bindings/python/examples/pipeline_openai/frontend.py
[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.
🪛 LanguageTool
lib/bindings/python/examples/pipeline_openai/README.md
[grammar] ~2-~2: Ensure spelling is correct
Context: ... allocations of dynamo itself in a high-thoughput for large payloads. files: ``` fronten...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
lib/bindings/python/examples/pipeline_openai/README.md
5-5: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
13-13: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
20-20: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.10)
lib/bindings/python/examples/pipeline_openai/memory_monitor.py
45-45: Found useless expression. Either assign it to a variable or remove it.
(B018)
lib/bindings/python/examples/pipeline_openai/load_test.py
69-69: Do not catch blind exception: Exception
(BLE001)
129-129: Do not catch blind exception: Exception
(BLE001)
130-130: Use explicit conversion flag
Replace with conversion flag
(RUF010)
167-167: Do not catch blind exception: Exception
(BLE001)
243-243: Do not catch blind exception: Exception
(BLE001)
lib/bindings/python/examples/pipeline_openai/backend.py
41-41: Loop control variable chunk not used within loop body
Rename unused chunk to _chunk
(B007)
lib/bindings/python/examples/pipeline_openai/frontend.py
59-59: Do not catch blind exception: Exception
(BLE001)
66-66: Loop control variable output not used within loop body
Rename unused output to _output
(B007)
138-138: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/bindings/python/examples/pipeline_openai/load_test.py (1)
31-40: Consider setting connection limits to prevent resource exhaustion.
limit=0(unlimited) combined withlimit_per_host=2000could exhaust file descriptors or memory under extreme load. For a load testing tool this may be intentional, but consider adding a reasonable upper bound (e.g.,limit=5000) to prevent runaway resource consumption during extended tests.
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
|
@michaelfeil thanks for this! will take a look not only in terms of fixing but in adding similar to our test suite. |
|
@biswapanda -fyi - one thing different from 0.6 is the new tcp request plane |
|
@michaelfeil - one question - do you see this behavior when not using the python bindings as well? |
|
@nnshah1 Never not used the python bindings - not possible to migrate the codebase (15k) to rust for the sake of the actual memory leak. Just wondering on the repro on dynamo 0.8 / this branch. Specifically, i think this one should not have 2GB overhead of missing ram after stopping to serve traffic. async def generate(self, request, context):
stream = await self.proxy_client.random(request)
i = 0
async for chunk in stream:
i += 1
yield f"chunk{i}" |
|
In theory, Similar issues on http endpoint: |
agreed - was just curious - we can test on our side to see if that makes a difference |
|
-- starting a frontend via: https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/frontend/main.py? |
Signed-off-by: michaelfeil me@michaelfeil.eu
Overview:
Motivation: I noticed that the Frontend Pod is suspect of some unlimited memory growth issue. Here a picture from ~200 prod pods.

I wrote a minimal repo to prove this is also happening in Dynamo upstream, not just on my private fork. Typically, the memory can grow by ~10-40GB/day, maybe after processing 200k-1M requests. The resolution was/is a un-graceful termination by k8s, dropping ALL streams and init.
Critical Observations RUN 1:
Critical Observations RUN 2:
Details:
To repro, I wrote the following minimal DAG:
I am then sending short or long context requests:
Run 1:
./load_test.py --payload-size 10 --concurrency 48running 30 token requestsResults:
Progress: 10000/10000 requests, rate: 1863.8 req/s
Completed: 10000, Errors: 0
Run 1- continued with larger requests
After small requests work fine, lets run some approx 200k context requests. Very typical for agentic workflows, assuming large kv-cache, but payload is transmitted every turn.
Memory grew a bit! Lets re-run
20000sized request the same server again.Lets increase to
./load_test.py --payload-size 200000 --concurrency 48Lets increase to
./load_test.py --payload-size 200000 --concurrency 96A lot of growth. And thats just after 50k requests. Lets move on an run the benchmark over night. If it was just small buffers, it would disappear/saturate. For this, I increase to 10000k requests, and run the bench again.
RUN 2 super small payloads but high concurrency -- deadlock
On my 0.6.0 branch where i wrote this script originally for, my testing concurrency was maxed at 1000.
Somehow 1000 deadlocks on the current main branch, while it works just fine on 0.6.0.
Running the benchmark with concurrency 400, everything is snappy.
Running at concurrency 1000
After this, even curl http://localhost:8000/v1/models is broken. The python memory stats keep printing (not a GIL deadlock, otherwise no more async prints from the memory reporter). Feels like a tokio deadlock.
Attempts to fix it:
Tried:
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
Documentation
New Features
✏️ Tip: You can customize this high-level summary in your review settings.