Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Jan 8, 2026

Signed-off-by: michaelfeil me@michaelfeil.eu

Overview:

Motivation: I noticed that the Frontend Pod is suspect of some unlimited memory growth issue. Here a picture from ~200 prod pods.
I wrote a minimal repo to prove this is also happening in Dynamo upstream, not just on my private fork. Typically, the memory can grow by ~10-40GB/day, maybe after processing 200k-1M requests. The resolution was/is a un-graceful termination by k8s, dropping ALL streams and init.
image

Critical Observations RUN 1:

  • Sending long context request causes memory leak, short context messages
  • The backend.py is super super simple. The proxy configuration it still gathers 1.4B of RAM after the requests are done. The non-proxy configuration is better of. Pods that are in the middle of the DAG seem to suffer the most.

Critical Observations RUN 2:

  • Its very easy to "stall"/"deadlock" the entire pipeline. This is actually a easteregg that was unexpected. My fork is off of 0.6.0 and does not have this behavior.

Details:

To repro, I wrote the following minimal DAG:

frontend -> backend-proxy (proxy's to prefil/decoder worker) -> backend (similar to prefill worker)

I am then sending short or long context requests:

Run 1:

  1. Spawning the there components: frontend -> backend-proxy -> backend
  2. ./load_test.py --payload-size 10 --concurrency 48 running 30 token requests
    Results:
    Progress: 10000/10000 requests, rate: 1863.8 req/s
    Completed: 10000, Errors: 0
Load test completed!
Total requests sent: 10000
Successful requests: 10100
Total errors: 0
Average response time: 0.025s
Total test time: 5.37s
image

Run 1- continued with larger requests

After small requests work fine, lets run some approx 200k context requests. Very typical for agentic workflows, assuming large kv-cache, but payload is transmitted every turn.

./load_test.py --payload-size 20000 --concurrency 48
image

Memory grew a bit! Lets re-run 20000 sized request the same server again.

image

Lets increase to ./load_test.py --payload-size 200000 --concurrency 48

image

Lets increase to ./load_test.py --payload-size 200000 --concurrency 96

image
# RUNNING ./load_test.py --payload-size 20000 --concurrency 48 now 
[FRONTEND] [495.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [510.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [525.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
[FRONTEND] [540.6s] Background check: Memory: RSS=233.79MB (+183.42MB), VMS=30935.37MB, Requests=30301, Last GC freed=0.00MB
# RUNNING ./load_test.py --payload-size 200000 --concurrency 48 now 
[FRONTEND] [555.6s] Background check: Memory: RSS=960.87MB (+910.50MB), VMS=30957.11MB, Requests=31666, Last GC freed=0.00MB
[FRONTEND] [570.2s] After 35000 requests: Memory: RSS=980.10MB (+929.72MB), VMS=31056.72MB, Requests=35001, Last GC freed=0.00MB
[FRONTEND] [570.6s] Background check: Memory: RSS=1180.07MB (+1129.70MB), VMS=31132.97MB, Requests=35079, Last GC freed=0.00MB
[FRONTEND] [585.6s] Background check: Memory: RSS=1127.55MB (+1077.18MB), VMS=31037.60MB, Requests=38563, Last GC freed=0.00MB
[FRONTEND] [592.1s] After 40000 requests: Memory: RSS=1089.83MB (+1039.45MB), VMS=31037.63MB, Requests=40000, Last GC freed=0.00MB
[FRONTEND]   GC collected 33 objects, freed 0.00MB
[FRONTEND] [600.6s] Background check: Memory: RSS=1065.80MB (+1015.43MB), VMS=31129.16MB, Requests=40401, Last GC freed=0.00MB
[FRONTEND] [615.6s] Background check: Memory: RSS=1065.80MB (+1015.43MB), VMS=31121.15MB, Requests=40401, Last GC freed=0.00MB
# RUNNING ./load_test.py --payload-size 200000 --concurrency 96 now 
[FRONTEND] [630.6s] Background check: Memory: RSS=1654.90MB (+1604.52MB), VMS=31106.66MB, Requests=40930, Last GC freed=0.00MB
[FRONTEND] [645.6s] Background check: Memory: RSS=1861.16MB (+1810.78MB), VMS=31301.21MB, Requests=44029, Last GC freed=0.00MB
[FRONTEND] [650.3s] After 45000 requests: Memory: RSS=2192.25MB (+2141.88MB), VMS=31347.40MB, Requests=45000, Last GC freed=0.00MB
[FRONTEND] [660.6s] Background check: Memory: RSS=1986.45MB (+1936.07MB), VMS=31309.22MB, Requests=47192, Last GC freed=0.00MB
[FRONTEND] [673.6s] After 50000 requests: Memory: RSS=2333.06MB (+2282.69MB), VMS=31328.31MB, Requests=50000, Last GC freed=0.00MB
[FRONTEND] [675.6s] Background check: Memory: RSS=2015.35MB (+1964.97MB), VMS=31316.85MB, Requests=50446, Last GC freed=0.00MB

A lot of growth. And thats just after 50k requests. Lets move on an run the benchmark over night. If it was just small buffers, it would disappear/saturate. For this, I increase to 10000k requests, and run the bench again.

image

RUN 2 super small payloads but high concurrency -- deadlock

On my 0.6.0 branch where i wrote this script originally for, my testing concurrency was maxed at 1000.
Somehow 1000 deadlocks on the current main branch, while it works just fine on 0.6.0.
Running the benchmark with concurrency 400, everything is snappy.

pipeline_openai  $ ./load_test.py --payload-size 2 --concurrency 400
✓ Backend is accessible at http://localhost:8000
Running initial test...
Starting load test: 100 requests with 10 concurrent
Request 0 completed in 0.006s
Average response time: 0.006s

Load test completed!
Total requests sent: 100
Successful requests: 100
Total errors: 0
Average response time: 0.006s

Running full load test...
Starting load test: 10000 requests with 400 concurrent
Request 0 completed in 0.092s
Average response time: 0.007s
....
Request 9000 completed in 0.093s
Average response time: 0.158s
Progress: 10000/10000 requests, rate: 1639.8 req/s
  Completed: 10000, Errors: 0

Load test completed!
Total requests sent: 10000
Successful requests: 10100
Total errors: 0
Average response time: 0.210s
Total test time: 6.10s

Running at concurrency 1000

./load_test.py --payload-size 2 --concurrency 1000
✓ Backend is accessible at http://localhost:8000
Running initial test...
Starting load test: 100 requests with 10 concurrent
Request 0 completed in 0.006s
Average response time: 0.006s

Load test completed!
Total requests sent: 100
Successful requests: 100
Total errors: 0
Average response time: 0.006s

Running full load test...
Starting load test: 10000 requests with 1000 concurrent
Request 0 completed in 0.299s
Average response time: 0.009s
Request 1000 completed in 0.133s
Average response time: 0.277s
Request 2000 completed in 0.118s
Average response time: 0.270s
Request 3000 completed in 0.155s
Average response time: 0.185s
Request 4000 completed in 0.095s
Average response time: 0.225s
Progress: 5000/10000 requests, rate: 1480.8 req/s
  Completed: 5000, Errors: 0
Request 5000 completed in 0.108s
Average response time: 0.272s
Request 6000 completed in 0.124s
Average response time: 0.243s
Request 7000 completed in 0.110s
Average response time: 0.231s # STALLS FOR 100s of seconds here, after 7k processed requests. Are we starving the runtime here??
Total error count: 100
Total error count: 200
Total error count: 300
Total error count: 400
Total error count: 500
Total error count: 600
Total error count: 700
Total error count: 800
Total error count: 900
Total error count: 1000
Total error count: 1100
Total error count: 1200
Total error count: 1300
Total error count: 1400
Total error count: 1500
Total error count: 1600
Total error count: 1700
Total error count: 1800

After this, even curl http://localhost:8000/v1/models is broken. The python memory stats keep printing (not a GIL deadlock, otherwise no more async prints from the memory reporter). Feels like a tokio deadlock.

pipeline_openai  $ ./load_test.py --payload-size 2 --concurrency 1000
✗ Cannot connect to backend at http://localhost:8000: 

Attempts to fix it:

Tried:

  • memray, starting the application. Says a lot of python String allocations + NvChatCompletion types are responsible for memory usage.
  • Attach all kind of timeouts on the python engine.rs so that e.g. operations time out / queues get drained.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive guide for Memory Profiling Pipeline example with detailed setup and startup instructions.
  • New Features

    • Added complete example pipeline demonstrating an OpenAI-like chat service with integrated memory monitoring.
    • Introduced load testing utility for evaluating performance and stability under high-concurrency, high-payload scenarios.
    • Added memory profiling and monitoring capabilities for tracking resource usage across pipeline components.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: michaelfeil <me@michaelfeil.eu>
@michaelfeil michaelfeil requested review from a team as code owners January 8, 2026 11:33
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

👋 Hi michaelfeil! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Jan 8, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Introduces a complete OpenAI-like pipeline example for the Python bindings with integrated memory profiling. Includes documentation, a backend service with optional proxy mode and request handling, a frontend chat service with MockEngine wrapper, a load testing utility for high-concurrency scenarios, and a shared memory monitoring module for tracking allocations and performance metrics during high-throughput payload processing.

Changes

Cohort / File(s) Summary
Documentation
lib/bindings/python/examples/pipeline_openai/README.md
Describes the Memory Profiling Pipeline, documents involved modules (frontend.py, backend.py, load_test.py, memory_monitor.py), and provides startup instructions
Memory Monitoring Infrastructure
lib/bindings/python/examples/pipeline_openai/memory_monitor.py
Introduces MemoryMonitor class and factory functions to track RSS/VMS memory, request counts, and periodic memory logging via background asyncio tasks; environment-driven activation via DYNAMO_MEMORY_PROFILE
Pipeline Components
lib/bindings/python/examples/pipeline_openai/backend.py, lib/bindings/python/examples/pipeline_openai/frontend.py
Backend service defines RequestHandler with generate method (streams proxied chunks or emits mock chunks) and worker entry point with proxy mode support; Frontend service defines MockEngine wrapper and worker entry point that sets up HttpService with chat completions, integrates memory monitoring, and manages service lifecycle
Load Testing
lib/bindings/python/examples/pipeline_openai/load_test.py
Introduces LoadTestDebugger for high-concurrency HTTP load testing; manages session lifecycle, connectivity checks, request execution with streaming, latency tracking, error logging, and orchestrates concurrent batches with result summaries

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop through pipelines, memory traced,
Backend and frontend interlaced,
Load-testing hops with concurrent cheer,
Monitoring every byte held dear!
OpenAI dreams in Python's embrace,
A profiled pipeline, memory's grace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description provides extensive technical details about memory issues, reproduction steps with concrete metrics, and code samples, but lacks structured organization against template requirements. Reorganize description to clearly map findings to template sections: move reproduction setup to 'Details', specify files (frontend.py, backend.py, load_test.py, memory_monitor.py) under 'Where should the reviewer start', and clarify the GitHub issue number in 'Related Issues'.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: repro memory leak and deadlock under high payloads and high concurrency' directly aligns with the changeset, which adds a complete Memory Profiling Pipeline example demonstrating memory monitoring and load testing under high-concurrency, high-payload scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In @lib/bindings/python/examples/pipeline_openai/backend.py:
- Line 1: Update the copyright header comment at the top of the file by changing
"2025" to "2026" so the SPDX copyright line reads the current year; locate the
top-of-file comment string beginning with "SPDX-FileCopyrightText" and replace
the year token.
- Around line 72-80: The bare "raise" in the finally block after awaiting
endpoint.serve_endpoint(RequestHandler(proxy_client).generate) will cause
RuntimeError on normal exit; remove it from the finally and either remove
re-raising entirely or re-raise only when an exception occurred by changing the
structure to try: await endpoint.serve_endpoint(...) except Exception as e:
raise to propagate errors (or omit the except if you don't want propagation) and
keep the cleanup in finally (monitor.log_memory and monitor_task.cancel) so
shutdown can complete cleanly.

In @lib/bindings/python/examples/pipeline_openai/frontend.py:
- Line 1: Update the SPDX copyright header year from 2025 to 2026 in the file
that contains the header comment (the top-line string that begins with "#
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All
rights reserved."); replace "2025" with "2026" so the header reads "...
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved."
- Line 64: The HTTP status code used in the exception is incorrect: replace
HttpError(401, "Failed to contact pipeline after retries") with a 5xx code that
reflects an upstream service/connection failure (e.g., HttpError(502, "Failed to
contact pipeline after retries") or HttpError(503, ...)); update the HttpError
invocation in the same spot to use 502 (Bad Gateway) so the error semantically
indicates the pipeline was unreachable rather than an authentication issue.

In @lib/bindings/python/examples/pipeline_openai/load_test.py:
- Around line 1-5: Add the required SPDX header by inserting two comment lines
immediately after the existing shebang (#!/usr/bin/env python3) and before the
module docstring (the triple-quoted string): include an SPDX-FileCopyrightText
line with the correct year and owner and an SPDX-License-Identifier line with
the project license (e.g., "SPDX-FileCopyrightText: 2026 Your Name or Org" and
"SPDX-License-Identifier: Apache-2.0"), replacing placeholders with the real
values.
- Line 231: The variable start_time is assigned twice causing the first
assignment to be unused; remove the redundant initial assignment (or consolidate
to a single assignment) so only one start_time is set before it's used—locate
both assignments to start_time in load_test.py and keep the correct one (the
later timing capture) and delete the earlier shadowing assignment.

In @lib/bindings/python/examples/pipeline_openai/memory_monitor.py:
- Around line 1-4: Add the required SPDX copyright header to the top of
memory_monitor.py by inserting a copyright notice and an SPDX-License-Identifier
line (e.g., "Copyright YEAR YourOrganization" and "SPDX-License-Identifier:
<license-id>") as the very first non-empty lines so the header precedes the
module docstring; ensure YEAR and the copyright holder are correct and the SPDX
identifier matches the project license.

In @lib/bindings/python/examples/pipeline_openai/README.md:
- Line 2: Fix the typo in the README sentence that currently reads "This OpenAI
Compatible pipeline is profiling memory allocations of dynamo itself in a
high-thoughput for large payloads." by replacing the misspelled word "thoughput"
with "throughput" so the sentence reads correctly.
🧹 Nitpick comments (6)
lib/bindings/python/examples/pipeline_openai/README.md (1)

5-10: Add language specifiers to fenced code blocks.

Per markdown linting, fenced code blocks should have a language specified for syntax highlighting and accessibility.

📝 Suggested fix
 files:
-```
+```text
 frontend.py -> HTTP Frontend, implemented in Python with the Rust bindings
 backend.py  -> proxy and backend implementation
 load_test.py -> Sends http requests to frontend
 memory_monitor.py -> Monitors pid memory usage

Startup, best run from three different terminals in this order:
- +bash
python ./backend.py &
python ./backend.py --proxy-mode &
python ./frontend.py


Then
-```
+```bash
python load_test.py
</details>


Also applies to: 13-17, 20-22

</blockquote></details>
<details>
<summary>lib/bindings/python/examples/pipeline_openai/backend.py (1)</summary><blockquote>

`41-43`: **Rename unused loop variable `chunk` to `_chunk`.**

The loop variable is not used within the loop body.


<details>
<summary>📝 Suggested fix</summary>

```diff
-            async for chunk in stream:
+            async for _chunk in stream:
                 i += 1
                 yield f"chunk{i}"
lib/bindings/python/examples/pipeline_openai/memory_monitor.py (1)

42-53: Clarify the intent of the property access and note potential thread-safety concern.

  1. Line 45: The bare self.initial_memory access is used to trigger lazy initialization of _initial_memory, but this reads as a useless expression. Consider making it explicit.

  2. The request_count += 1 increment is not thread-safe under concurrent access. While acceptable for a profiling tool where approximate counts suffice, be aware this may undercount under high concurrency.

📝 Suggested fix for clarity
     def increment_request(self):
         """Increment request counter and log if needed"""
         self.request_count += 1
-        self.initial_memory
+        _ = self.initial_memory  # Ensure initial memory is captured on first request
lib/bindings/python/examples/pipeline_openai/frontend.py (1)

66-66: Rename unused loop variable output to _output.

📝 Suggested fix
-            async for output in stream:
+            async for _output in stream:
lib/bindings/python/examples/pipeline_openai/load_test.py (2)

22-22: Move import to module level.

The deque import should be at the top of the file with other imports, not inside __init__.

📝 Suggested fix

At the top of the file (after line 12):

from collections import deque

Then remove line 22 from __init__:

     def __init__(self, base_url: str = "http://localhost:8000"):
         self.base_url = base_url
         self.session = None
-        # Use collections.deque with maxlen to prevent unbounded memory growth
-        from collections import deque
-
+        # Use deque with maxlen to prevent unbounded memory growth
         self.request_times = deque(maxlen=10000)  # Keep only last 10k request times

170-176: Explicit del statements don't meaningfully accelerate garbage collection.

Python's reference counting handles cleanup when variables go out of scope. The del task and del tasks statements at the end of each loop iteration don't provide meaningful GC benefits since these references are immediately reassigned in the next iteration anyway.

You can simplify by removing these lines:

-            # Explicitly clean up task references
-            for task in tasks:
-                if not task.done():
-                    task.cancel()
-                # Help GC by removing references
-                del task
-            del tasks
+            # Cancel any incomplete tasks (defensive)
+            for task in tasks:
+                if not task.done():
+                    task.cancel()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92748c9 and 17f1b0f.

📒 Files selected for processing (5)
  • lib/bindings/python/examples/pipeline_openai/README.md
  • lib/bindings/python/examples/pipeline_openai/backend.py
  • lib/bindings/python/examples/pipeline_openai/frontend.py
  • lib/bindings/python/examples/pipeline_openai/load_test.py
  • lib/bindings/python/examples/pipeline_openai/memory_monitor.py
🧰 Additional context used
🧬 Code graph analysis (1)
lib/bindings/python/examples/pipeline_openai/frontend.py (5)
lib/bindings/python/src/dynamo/_core.pyi (3)
  • HttpAsyncEngine (923-930)
  • HttpService (904-910)
  • DistributedRuntime (36-85)
lib/bindings/python/src/dynamo/llm/exceptions.py (1)
  • HttpError (13-35)
lib/bindings/python/src/dynamo/runtime/__init__.py (1)
  • dynamo_worker (24-51)
lib/bindings/python/examples/pipeline_openai/memory_monitor.py (4)
  • create_monitor (67-71)
  • setup_background_monitor (74-85)
  • log_memory (30-40)
  • increment_request (42-64)
lib/bindings/python/examples/pipeline_openai/backend.py (2)
  • generate (32-48)
  • worker (52-80)
🪛 GitHub Actions: Copyright Checks
lib/bindings/python/examples/pipeline_openai/memory_monitor.py

[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.

lib/bindings/python/examples/pipeline_openai/load_test.py

[error] 1-1: Copyright header missing or invalid. File lacks SPDX header as required.

lib/bindings/python/examples/pipeline_openai/backend.py

[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.

lib/bindings/python/examples/pipeline_openai/frontend.py

[error] 1-1: Incorrect date in header: 2025. Update header to reflect current year.

🪛 LanguageTool
lib/bindings/python/examples/pipeline_openai/README.md

[grammar] ~2-~2: Ensure spelling is correct
Context: ... allocations of dynamo itself in a high-thoughput for large payloads. files: ``` fronten...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
lib/bindings/python/examples/pipeline_openai/README.md

5-5: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


20-20: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
lib/bindings/python/examples/pipeline_openai/memory_monitor.py

45-45: Found useless expression. Either assign it to a variable or remove it.

(B018)

lib/bindings/python/examples/pipeline_openai/load_test.py

69-69: Do not catch blind exception: Exception

(BLE001)


129-129: Do not catch blind exception: Exception

(BLE001)


130-130: Use explicit conversion flag

Replace with conversion flag

(RUF010)


167-167: Do not catch blind exception: Exception

(BLE001)


243-243: Do not catch blind exception: Exception

(BLE001)

lib/bindings/python/examples/pipeline_openai/backend.py

41-41: Loop control variable chunk not used within loop body

Rename unused chunk to _chunk

(B007)

lib/bindings/python/examples/pipeline_openai/frontend.py

59-59: Do not catch blind exception: Exception

(BLE001)


66-66: Loop control variable output not used within loop body

Rename unused output to _output

(B007)


138-138: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
lib/bindings/python/examples/pipeline_openai/load_test.py (1)

31-40: Consider setting connection limits to prevent resource exhaustion.

limit=0 (unlimited) combined with limit_per_host=2000 could exhaust file descriptors or memory under extreme load. For a load testing tool this may be intentional, but consider adding a reasonable upper bound (e.g., limit=5000) to prevent runaway resource consumption during extended tests.

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
@michaelfeil michaelfeil changed the title memory leak repro fix: repro memory leak and deadlock under high payloads and high concurrency Jan 8, 2026
@github-actions github-actions bot added the fix label Jan 8, 2026
@nnshah1 nnshah1 self-requested a review January 8, 2026 15:08
@nnshah1
Copy link
Contributor

nnshah1 commented Jan 8, 2026

@michaelfeil thanks for this! will take a look not only in terms of fixing but in adding similar to our test suite.

@nnshah1
Copy link
Contributor

nnshah1 commented Jan 8, 2026

@biswapanda -fyi - one thing different from 0.6 is the new tcp request plane

@nnshah1
Copy link
Contributor

nnshah1 commented Jan 8, 2026

@michaelfeil - one question - do you see this behavior when not using the python bindings as well?

@michaelfeil
Copy link
Contributor Author

@nnshah1 Never not used the python bindings - not possible to migrate the codebase (15k) to rust for the sake of the actual memory leak. Just wondering on the repro on dynamo 0.8 / this branch.

Specifically, i think this one should not have 2GB overhead of missing ram after stopping to serve traffic.

async def generate(self, request, context):
            stream = await self.proxy_client.random(request)
            i = 0
            async for chunk in stream:
                i += 1
                yield f"chunk{i}"

@michaelfeil
Copy link
Contributor Author

michaelfeil commented Jan 8, 2026

In theory, request_plane = os.environ.get("DYN_REQUEST_PLANE", "tcp") when i set export DYN_REQUEST_PLANE=nats all i see is Error contacting pipeline: NATS request failed: no responders: no responders when starting the repro.

Similar issues on http endpoint:

2026-01-08T16:09:39.546095Z  INFO dynamo_runtime::pipeline::network::ingress::http_endpoint: Starting shared HTTP/2 endpoint server on 10.42.1.221:8888 at path /v1/rpc/:endpoint
2026-01-08T16:09:39.546163Z ERROR dynamo_runtime::pipeline::network::manager: HTTP request plane server error: Address already in use (os error 98)

@nnshah1
Copy link
Contributor

nnshah1 commented Jan 8, 2026

@nnshah1 Never not used the python bindings - not possible to migrate the codebase (15k) to rust for the sake of the actual memory leak. Just wondering on the repro on dynamo 0.8 / this branch.

Specifically, i think this one should not have 2GB overhead of missing ram after stopping to serve traffic.

async def generate(self, request, context):
            stream = await self.proxy_client.random(request)
            i = 0
            async for chunk in stream:
                i += 1
                yield f"chunk{i}"

agreed - was just curious - we can test on our side to see if that makes a difference

@michaelfeil
Copy link
Contributor Author

@biswapanda biswapanda self-assigned this Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor fix size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants