fix: repro memory leak and deadlock under high payloads and high concurrency #5269

michaelfeil · 2026-01-08T11:33:43Z

Signed-off-by: michaelfeil me@michaelfeil.eu

Overview:

Motivation: I noticed that the Frontend Pod is suspect of some unlimited memory growth issue. Here a picture from ~200 prod pods.
I wrote a minimal repo to prove this is also happening in Dynamo upstream, not just on my private fork. Typically, the memory can grow by ~10-40GB/day, maybe after processing 200k-1M requests. The resolution was/is a un-graceful termination by k8s, dropping ALL streams and init.

Critical Observations RUN 1:

Sending long context request causes memory leak, short context messages
The backend.py is super super simple. The proxy configuration it still gathers 1.4B of RAM after the requests are done. The non-proxy configuration is better of. Pods that are in the middle of the DAG seem to suffer the most.

Critical Observations RUN 2:

Its very easy to "stall"/"deadlock" the entire pipeline. This is actually a easteregg that was unexpected. My fork is off of 0.6.0 and does not have this behavior.

Details:

To repro, I wrote the following minimal DAG:

frontend -> backend-proxy (proxy's to prefil/decoder worker) -> backend (similar to prefill worker)

I am then sending short or long context requests:

Run 1:

Spawning the there components: frontend -> backend-proxy -> backend
./load_test.py --payload-size 10 --concurrency 48 running 30 token requests
Results:
Progress: 10000/10000 requests, rate: 1863.8 req/s
Completed: 10000, Errors: 0

Load test completed!
Total requests sent: 10000
Successful requests: 10100
Total errors: 0
Average response time: 0.025s
Total test time: 5.37s

Run 1- continued with larger requests

After small requests work fine, lets run some approx 200k context requests. Very typical for agentic workflows, assuming large kv-cache, but payload is transmitted every turn.

./load_test.py --payload-size 20000 --concurrency 48

Memory grew a bit! Lets re-run 20000 sized request the same server again.

Lets increase to ./load_test.py --payload-size 200000 --concurrency 48

Lets increase to ./load_test.py --payload-size 200000 --concurrency 96

# RUNNING ./load_test.py --payload-size 20000 --concurrency 48 now 
[FRONTEND] [495.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [510.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [525.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [540.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
# RUNNING ./load_test.py --payload-size 200000 --concurrency 48 now 
[FRONTEND] [555.6s] Background check: Memory: RSS=960.87MB (+910.50MB), VMS=30957.11MB, Requests=31666, Last GC freed=0.00MB
[FRONTEND] [570.2s] After 35000 requests: Memory: RSS=980.10MB (+929.72MB), VMS=31056.72MB, Requests=35001, Last GC freed=0.00MB
[FRONTEND] [570.6s] Background check: Memory: RSS=1180.07MB (+1129.70MB), VMS=31132.97MB, Requests=35079, Last GC freed=0.00MB
[FRONTEND] [585.6s] Background check: Memory: RSS=1127.55MB (+1077.18MB), VMS=31037.60MB, Requests=38563, Last GC freed=0.00MB
[FRONTEND] [592.1s] After 40000 requests: Memory: RSS=1089.83MB (+1039.45MB), VMS=31037.63MB, Requests=40000, Last GC freed=0.00MB
[FRONTEND]   GC collected 33 objects, freed 0.00MB
[FRONTEND] [600.6s] Background check: Memory: RSS=1065.80MB (+1015.43MB), VMS=31129.16MB, Requests=40401, Last GC freed=0.00MB
[FRONTEND] [615.6s] Background check: Memory: RSS=1065.80MB (+1015.43MB), VMS=31121.15MB, Requests=40401, Last GC freed=0.00MB
# RUNNING ./load_test.py --payload-size 200000 --concurrency 96 now 
[FRONTEND] [630.6s] Background check: Memory: RSS=1654.90MB (+1604.52MB), VMS=31106.66MB, Requests=40930, Last GC freed=0.00MB
[FRONTEND] [645.6s] Background check: Memory: RSS=1861.16MB (+1810.78MB), VMS=31301.21MB, Requests=44029, Last GC freed=0.00MB
[FRONTEND] [650.3s] After 45000 requests: Memory: RSS=2192.25MB (+2141.88MB), VMS=31347.40MB, Requests=45000, Last GC freed=0.00MB
[FRONTEND] [660.6s] Background check: Memory: RSS=1986.45MB (+1936.07MB), VMS=31309.22MB, Requests=47192, Last GC freed=0.00MB
[FRONTEND] [673.6s] After 50000 requests: Memory: RSS=2333.06MB (+2282.69MB), VMS=31328.31MB, Requests=50000, Last GC freed=0.00MB
[FRONTEND] [675.6s] Background check: Memory: RSS=2015.35MB (+1964.97MB), VMS=31316.85MB, Requests=50446, Last GC freed=0.00MB

A lot of growth. And thats just after 50k requests. Lets move on an run the benchmark over night. If it was just small buffers, it would disappear/saturate. For this, I increase to 10000k requests, and run the bench again.

RUN 2 super small payloads but high concurrency -- deadlock

On my 0.6.0 branch where i wrote this script originally for, my testing concurrency was maxed at 1000.
Somehow 1000 deadlocks on the current main branch, while it works just fine on 0.6.0.
Running the benchmark with concurrency 400, everything is snappy.

pipeline_openai  $ ./load_test.py --payload-size 2 --concurrency 400
✓ Backend is accessible at http://localhost:8000
Running initial test...
Starting load test: 100 requests with 10 concurrent
Request 0 completed in 0.006s
Average response time: 0.006s

Load test completed!
Total requests sent: 100
Successful requests: 100
Total errors: 0
Average response time: 0.006s

Running full load test...
Starting load test: 10000 requests with 400 concurrent
Request 0 completed in 0.092s
Average response time: 0.007s
....
Request 9000 completed in 0.093s
Average response time: 0.158s
Progress: 10000/10000 requests, rate: 1639.8 req/s
  Completed: 10000, Errors: 0

Load test completed!
Total requests sent: 10000
Successful requests: 10100
Total errors: 0
Average response time: 0.210s
Total test time: 6.10s

Running at concurrency 1000

./load_test.py --payload-size 2 --concurrency 1000
✓ Backend is accessible at http://localhost:8000
Running initial test...
Starting load test: 100 requests with 10 concurrent
Request 0 completed in 0.006s
Average response time: 0.006s

Load test completed!
Total requests sent: 100
Successful requests: 100
Total errors: 0
Average response time: 0.006s

Running full load test...
Starting load test: 10000 requests with 1000 concurrent
Request 0 completed in 0.299s
Average response time: 0.009s
Request 1000 completed in 0.133s
Average response time: 0.277s
Request 2000 completed in 0.118s
Average response time: 0.270s
Request 3000 completed in 0.155s
Average response time: 0.185s
Request 4000 completed in 0.095s
Average response time: 0.225s
Progress: 5000/10000 requests, rate: 1480.8 req/s
  Completed: 5000, Errors: 0
Request 5000 completed in 0.108s
Average response time: 0.272s
Request 6000 completed in 0.124s
Average response time: 0.243s
Request 7000 completed in 0.110s
Average response time: 0.231s # STALLS FOR 100s of seconds here, after 7k processed requests. Are we starving the runtime here??
Total error count: 100
Total error count: 200
Total error count: 300
Total error count: 400
Total error count: 500
Total error count: 600
Total error count: 700
Total error count: 800
Total error count: 900
Total error count: 1000
Total error count: 1100
Total error count: 1200
Total error count: 1300
Total error count: 1400
Total error count: 1500
Total error count: 1600
Total error count: 1700
Total error count: 1800

After this, even curl http://localhost:8000/v1/models is broken. The python memory stats keep printing (not a GIL deadlock, otherwise no more async prints from the memory reporter). Feels like a tokio deadlock.

pipeline_openai  $ ./load_test.py --payload-size 2 --concurrency 1000
✗ Cannot connect to backend at http://localhost:8000:

Attempts to fix it:

Tried:

memray, starting the application. Says a lot of python String allocations + NvChatCompletion types are responsible for memory usage.
Attach all kind of timeouts on the python engine.rs so that e.g. operations time out / queues get drained.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive guide for Memory Profiling Pipeline example with detailed setup and startup instructions.
New Features
- Added complete example pipeline demonstrating an OpenAI-like chat service with integrated memory monitoring.
- Introduced load testing utility for evaluating performance and stability under high-concurrency, high-payload scenarios.
- Added memory profiling and monitoring capabilities for tracking resource usage across pipeline components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: michaelfeil <me@michaelfeil.eu>

copy-pr-bot · 2026-01-08T11:33:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-01-08T11:33:52Z

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2026-01-08T11:37:25Z

Walkthrough

Introduces a complete OpenAI-like pipeline example for the Python bindings with integrated memory profiling. Includes documentation, a backend service with optional proxy mode and request handling, a frontend chat service with MockEngine wrapper, a load testing utility for high-concurrency scenarios, and a shared memory monitoring module for tracking allocations and performance metrics during high-throughput payload processing.

Changes

Cohort / File(s)	Summary
Documentation `lib/bindings/python/examples/pipeline_openai/README.md`	Describes the Memory Profiling Pipeline, documents involved modules (frontend.py, backend.py, load_test.py, memory_monitor.py), and provides startup instructions
Memory Monitoring Infrastructure `lib/bindings/python/examples/pipeline_openai/memory_monitor.py`	Introduces MemoryMonitor class and factory functions to track RSS/VMS memory, request counts, and periodic memory logging via background asyncio tasks; environment-driven activation via DYNAMO_MEMORY_PROFILE
Pipeline Components `lib/bindings/python/examples/pipeline_openai/backend.py`, `lib/bindings/python/examples/pipeline_openai/frontend.py`	Backend service defines RequestHandler with generate method (streams proxied chunks or emits mock chunks) and worker entry point with proxy mode support; Frontend service defines MockEngine wrapper and worker entry point that sets up HttpService with chat completions, integrates memory monitoring, and manages service lifecycle
Load Testing `lib/bindings/python/examples/pipeline_openai/load_test.py`	Introduces LoadTestDebugger for high-concurrency HTTP load testing; manages session lifecycle, connectivity checks, request execution with streaming, latency tracking, error logging, and orchestrates concurrent batches with result summaries

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop through pipelines, memory traced,
Backend and frontend interlaced,
Load-testing hops with concurrent cheer,
Monitoring every byte held dear!
OpenAI dreams in Python's embrace,
A profiled pipeline, memory's grace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	PR description provides extensive technical details about memory issues, reproduction steps with concrete metrics, and code samples, but lacks structured organization against template requirements.	Reorganize description to clearly map findings to template sections: move reproduction setup to 'Details', specify files (frontend.py, backend.py, load_test.py, memory_monitor.py) under 'Where should the reviewer start', and clarify the GitHub issue number in 'Related Issues'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: repro memory leak and deadlock under high payloads and high concurrency' directly aligns with the changeset, which adds a complete Memory Profiling Pipeline example demonstrating memory monitoring and load testing under high-concurrency, high-payload scenarios.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In @lib/bindings/python/examples/pipeline_openai/backend.py:
- Line 1: Update the copyright header comment at the top of the file by changing
"2025" to "2026" so the SPDX copyright line reads the current year; locate the
top-of-file comment string beginning with "SPDX-FileCopyrightText" and replace
the year token.
- Around line 72-80: The bare "raise" in the finally block after awaiting
endpoint.serve_endpoint(RequestHandler(proxy_client).generate) will cause
RuntimeError on normal exit; remove it from the finally and either remove
re-raising entirely or re-raise only when an exception occurred by changing the
structure to try: await endpoint.serve_endpoint(...) except Exception as e:
raise to propagate errors (or omit the except if you don't want propagation) and
keep the cleanup in finally (monitor.log_memory and monitor_task.cancel) so
shutdown can complete cleanly.

In @lib/bindings/python/examples/pipeline_openai/frontend.py:
- Line 1: Update the SPDX copyright header year from 2025 to 2026 in the file
that contains the header comment (the top-line string that begins with "#
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved."); replace "2025" with "2026" so the header reads "...
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
- Line 64: The HTTP status code used in the exception is incorrect: replace
HttpError(401, "Failed to contact pipeline after retries") with a 5xx code that
reflects an upstream service/connection failure (e.g., HttpError(502, "Failed to
contact pipeline after retries") or HttpError(503, ...)); update the HttpError
invocation in the same spot to use 502 (Bad Gateway) so the error semantically
indicates the pipeline was unreachable rather than an authentication issue.

In @lib/bindings/python/examples/pipeline_openai/load_test.py:
- Around line 1-5: Add the required SPDX header by inserting two comment lines
immediately after the existing shebang (#!/usr/bin/env python3) and before the
module docstring (the triple-quoted string): include an SPDX-FileCopyrightText
line with the correct year and owner and an SPDX-License-Identifier line with
the project license (e.g., "SPDX-FileCopyrightText: 2026 Your Name or Org" and
"SPDX-License-Identifier: Apache-2.0"), replacing placeholders with the real
values.
- Line 231: The variable start_time is assigned twice causing the first
assignment to be unused; remove the redundant initial assignment (or consolidate
to a single assignment) so only one start_time is set before it's used—locate
both assignments to start_time in load_test.py and keep the correct one (the
later timing capture) and delete the earlier shadowing assignment.

In @lib/bindings/python/examples/pipeline_openai/memory_monitor.py:
- Around line 1-4: Add the required SPDX copyright header to the top of
memory_monitor.py by inserting a copyright notice and an SPDX-License-Identifier
line (e.g., "Copyright YEAR YourOrganization" and "SPDX-License-Identifier:
<license-id>") as the very first non-empty lines so the header precedes the
module docstring; ensure YEAR and the copyright holder are correct and the SPDX
identifier matches the project license.

In @lib/bindings/python/examples/pipeline_openai/README.md:
- Line 2: Fix the typo in the README sentence that currently reads "This OpenAI
Compatible pipeline is profiling memory allocations of dynamo itself in a
high-thoughput for large payloads." by replacing the misspelled word "thoughput"
with "throughput" so the sentence reads correctly.

🧹 Nitpick comments (6)

lib/bindings/python/examples/pipeline_openai/README.md (1)
5-10: Add language specifiers to fenced code blocks.

Per markdown linting, fenced code blocks should have a language specified for syntax highlighting and accessibility.
📝 Suggested fix
 files:
-```
+```text
 frontend.py -> HTTP Frontend, implemented in Python with the Rust bindings
 backend.py  -> proxy and backend implementation
 load_test.py -> Sends http requests to frontend
 memory_monitor.py -> Monitors pid memory usage
Startup, best run from three different terminals in this order:
- +bash
python ./backend.py &
python ./backend.py --proxy-mode &
python ./frontend.py
Then
-```
+```bash
python load_test.py
</details>


Also applies to: 13-17, 20-22

</blockquote></details>
<details>
<summary>lib/bindings/python/examples/pipeline_openai/backend.py (1)</summary><blockquote>

`41-43`: **Rename unused loop variable `chunk` to `_chunk`.**

The loop variable is not used within the loop body.


<details>
<summary>📝 Suggested fix</summary>

```diff
-            async for chunk in stream:
+            async for _chunk in stream:
                 i += 1
                 yield f"chunk{i}"
lib/bindings/python/examples/pipeline_openai/memory_monitor.py (1)
42-53: Clarify the intent of the property access and note potential thread-safety concern.

Line 45: The bare self.initial_memory access is used to trigger lazy initialization of _initial_memory, but this reads as a useless expression. Consider making it explicit.

The request_count += 1 increment is not thread-safe under concurrent access. While acceptable for a profiling tool where approximate counts suffice, be aware this may undercount under high concurrency.
📝 Suggested fix for clarity
     def increment_request(self):
         """Increment request counter and log if needed"""
         self.request_count += 1
-        self.initial_memory
+        _ = self.initial_memory  # Ensure initial memory is captured on first request
lib/bindings/python/examples/pipeline_openai/frontend.py (1)
66-66: Rename unused loop variable output to _output.
📝 Suggested fix
-            async for output in stream:
+            async for _output in stream:
lib/bindings/python/examples/pipeline_openai/load_test.py (2)
22-22: Move import to module level.

The deque import should be at the top of the file with other imports, not inside __init__.
📝 Suggested fix

At the top of the file (after line 12):
from collections import deque
Then remove line 22 from __init__:
     def __init__(self, base_url: str = "http://localhost:8000"):
         self.base_url = base_url
         self.session = None
-        # Use collections.deque with maxlen to prevent unbounded memory growth
-        from collections import deque
-
+        # Use deque with maxlen to prevent unbounded memory growth
         self.request_times = deque(maxlen=10000)  # Keep only last 10k request times
170-176: Explicit del statements don't meaningfully accelerate garbage collection.

Python's reference counting handles cleanup when variables go out of scope. The del task and del tasks statements at the end of each loop iteration don't provide meaningful GC benefits since these references are immediately reassigned in the next iteration anyway.

You can simplify by removing these lines:
-            # Explicitly clean up task references
-            for task in tasks:
-                if not task.done():
-                    task.cancel()
-                # Help GC by removing references
-                del task
-            del tasks
+            # Cancel any incomplete tasks (defensive)
+            for task in tasks:
+                if not task.done():
+                    task.cancel()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92748c9 and 17f1b0f.

📒 Files selected for processing (5)

lib/bindings/python/examples/pipeline_openai/README.md
lib/bindings/python/examples/pipeline_openai/backend.py
lib/bindings/python/examples/pipeline_openai/frontend.py
lib/bindings/python/examples/pipeline_openai/load_test.py
lib/bindings/python/examples/pipeline_openai/memory_monitor.py

🧰 Additional context used

🧬 Code graph analysis (1)

lib/bindings/python/examples/pipeline_openai/frontend.py (5)

lib/bindings/python/src/dynamo/_core.pyi (3)

HttpAsyncEngine (923-930)

HttpService (904-910)

DistributedRuntime (36-85)

lib/bindings/python/src/dynamo/llm/exceptions.py (1)

HttpError (13-35)

lib/bindings/python/src/dynamo/runtime/__init__.py (1)

dynamo_worker (24-51)

lib/bindings/python/examples/pipeline_openai/memory_monitor.py (4)

create_monitor (67-71)

setup_background_monitor (74-85)

log_memory (30-40)

increment_request (42-64)

lib/bindings/python/examples/pipeline_openai/backend.py (2)

generate (32-48)

worker (52-80)

🪛 GitHub Actions: Copyright Checks

lib/bindings/python/examples/pipeline_openai/memory_monitor.py

[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.

lib/bindings/python/examples/pipeline_openai/load_test.py

[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.

lib/bindings/python/examples/pipeline_openai/backend.py

[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.

lib/bindings/python/examples/pipeline_openai/frontend.py

[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.

🪛 LanguageTool

lib/bindings/python/examples/pipeline_openai/README.md

[grammar] ~2-~2: Ensure spelling is correct
Context: ... allocations of dynamo itself in a high-thoughput for large payloads. files: ``` fronten...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

lib/bindings/python/examples/pipeline_openai/README.md

5-5: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

20-20: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)

lib/bindings/python/examples/pipeline_openai/memory_monitor.py

45-45: Found useless expression. Either assign it to a variable or remove it.

(B018)

lib/bindings/python/examples/pipeline_openai/load_test.py

69-69: Do not catch blind exception: Exception

(BLE001)

129-129: Do not catch blind exception: Exception

(BLE001)

130-130: Use explicit conversion flag

Replace with conversion flag

(RUF010)

167-167: Do not catch blind exception: Exception

(BLE001)

243-243: Do not catch blind exception: Exception

(BLE001)

lib/bindings/python/examples/pipeline_openai/backend.py

41-41: Loop control variable chunk not used within loop body

Rename unused chunk to _chunk

(B007)

lib/bindings/python/examples/pipeline_openai/frontend.py

59-59: Do not catch blind exception: Exception

(BLE001)

66-66: Loop control variable output not used within loop body

Rename unused output to _output

(B007)

138-138: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

lib/bindings/python/examples/pipeline_openai/load_test.py (1)

31-40: Consider setting connection limits to prevent resource exhaustion.

limit=0 (unlimited) combined with limit_per_host=2000 could exhaust file descriptors or memory under extreme load. For a load testing tool this may be intentional, but consider adding a reasonable upper bound (e.g., limit=5000) to prevent runaway resource consumption during extended tests.

lib/bindings/python/examples/pipeline_openai/backend.py

lib/bindings/python/examples/pipeline_openai/frontend.py

lib/bindings/python/examples/pipeline_openai/load_test.py

lib/bindings/python/examples/pipeline_openai/memory_monitor.py

lib/bindings/python/examples/pipeline_openai/README.md

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

nnshah1 · 2026-01-08T15:19:44Z

@michaelfeil thanks for this! will take a look not only in terms of fixing but in adding similar to our test suite.

nnshah1 · 2026-01-08T15:21:42Z

@biswapanda -fyi - one thing different from 0.6 is the new tcp request plane

nnshah1 · 2026-01-08T15:36:02Z

@michaelfeil - one question - do you see this behavior when not using the python bindings as well?

michaelfeil · 2026-01-08T16:02:46Z

@nnshah1 Never not used the python bindings - not possible to migrate the codebase (15k) to rust for the sake of the actual memory leak. Just wondering on the repro on dynamo 0.8 / this branch.

Specifically, i think this one should not have 2GB overhead of missing ram after stopping to serve traffic.

async def generate(self, request, context):
            stream = await self.proxy_client.random(request)
            i = 0
            async for chunk in stream:
                i += 1
                yield f"chunk{i}"

michaelfeil · 2026-01-08T16:07:19Z

In theory, request_plane = os.environ.get("DYN_REQUEST_PLANE", "tcp") when i set export DYN_REQUEST_PLANE=nats all i see is Error contacting pipeline: NATS request failed: no responders: no responders when starting the repro.

Similar issues on http endpoint:

2026-01-08T16:09:39.546095Z  INFO dynamo_runtime::pipeline::network::ingress::http_endpoint: Starting shared HTTP/2 endpoint server on 10.42.1.221:8888 at path /v1/rpc/:endpoint
2026-01-08T16:09:39.546163Z ERROR dynamo_runtime::pipeline::network::manager: HTTP request plane server error: Address already in use (os error 98)

nnshah1 · 2026-01-08T16:49:10Z

@nnshah1 Never not used the python bindings - not possible to migrate the codebase (15k) to rust for the sake of the actual memory leak. Just wondering on the repro on dynamo 0.8 / this branch.

Specifically, i think this one should not have 2GB overhead of missing ram after stopping to serve traffic.
async def generate(self, request, context):
            stream = await self.proxy_client.random(request)
            i = 0
            async for chunk in stream:
                i += 1
                yield f"chunk{i}"

agreed - was just curious - we can test on our side to see if that makes a difference

michaelfeil · 2026-01-08T18:27:30Z

-- starting a frontend via: https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/frontend/main.py?

memory leak repro

17f1b0f

Signed-off-by: michaelfeil <me@michaelfeil.eu>

michaelfeil requested review from a team as code owners January 8, 2026 11:33

pull-request-size bot added the size/XL label Jan 8, 2026

github-actions bot added the external-contribution Pull request is from an external contributor label Jan 8, 2026

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

functools growth

1a78481

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

michaelfeil changed the title ~~memory leak repro~~ fix: repro memory leak and deadlock under high payloads and high concurrency Jan 8, 2026

github-actions bot added the fix label Jan 8, 2026

nnshah1 self-requested a review January 8, 2026 15:08

nnshah1 assigned nnshah1, grahamking and kthui Jan 8, 2026

grahamking self-requested a review January 8, 2026 15:40

This was referenced Jan 8, 2026

[BUG] Memory leak in Frontend pod under high-throughput large payload scenarios #5275

Open

[BUG] Pipeline stalls/deadlocks at high concurrency (>=1000) - regression from 0.6.0 #5276

Open

biswapanda self-assigned this Jan 10, 2026

fix: repro memory leak and deadlock under high payloads and high concurrency #5269

Are you sure you want to change the base?

fix: repro memory leak and deadlock under high payloads and high concurrency #5269

Conversation

michaelfeil commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Critical Observations RUN 1:

Critical Observations RUN 2:

Details:

Run 1:

Run 1- continued with larger requests

RUN 2 super small payloads but high concurrency -- deadlock

Attempts to fix it:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nnshah1 commented Jan 8, 2026

Uh oh!

nnshah1 commented Jan 8, 2026

Uh oh!

nnshah1 commented Jan 8, 2026

Uh oh!

michaelfeil commented Jan 8, 2026

Uh oh!

michaelfeil commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnshah1 commented Jan 8, 2026

Uh oh!

michaelfeil commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michaelfeil commented Jan 8, 2026 •

edited

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

michaelfeil commented Jan 8, 2026 •

edited

Loading