feat: GPU Memory Service by galletas1712 · Pull Request #5286 · ai-dynamo/dynamo

galletas1712 · 2026-01-08T19:01:34Z

GPU Memory Service provides out-of-process GPU memory management, enabling zero-copy sharing and data survival across process crashes.

This one of a few PRs for the GPU Memory Service - it allows weights to be managed externally from the inference engine in a resilient manner. There are a few downstream benefits for the user:

It allows users who are running DeepSeek-R1 on GB200s to run a "shadow engine", essentially another vLLM/SGLang process on the same set of GPUs that can take over immediately when the current inference engine fails (due to transient software/network failures)
For autoscaling via checkpoint/restore, it allows weights to be sent independently of the other inference engine state via ModelExpress, significantly speeding up the time to cold start.

Package structure:

lib/gpu_memory_service/common/ - Shared types and protocol
lib/gpu_memory_service/server/ - Allocation server daemon
lib/gpu_memory_service/client/ - Client library with PyTorch integration
components/src/dynamo/gpu_memory_service/ - Dynamo CLI wrapper

Key features:

CUDA VMM-based memory allocation with VA-stable sleep/wake
Connection-based RW/RO locking for multi-process coordination
PyTorch tensor utilities for materialization and registration

Summary by CodeRabbit

New Features
- Added GPU Memory Service—a new memory management system supporting efficient GPU memory allocation and mapping via RPC communication.
- Integrated PyTorch support enabling seamless tensor operations with the GPU Memory Service.
- Added server and client components for distributed memory management.
Chores
- Updated development container, Docker images, and CI configuration to support GPU Memory Service builds and deployment.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-08T19:01:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-01-08T19:07:43Z

Walkthrough

This PR introduces a comprehensive GPU Memory Service (GMS) system featuring server-side memory allocation management with CUDA VMM support, an asynchronous RPC protocol with state machine-driven locking, and client libraries for PyTorch integration. It includes protocol definitions, CUDA utilities, metadata storage, and container build infrastructure.

Changes

Cohort / File(s)	Summary
Development & Configuration `.devcontainer/post-create.sh`, `.dockerignore`, `.gitignore`, `README.md`	Adjusts dev container setup to install gpu_memory_service before primary package. Updates ignore patterns for GMS build artifacts. Unignores `lib/gpu_memory_service/server/` for version control. Updates installation documentation.
GMS Core Package `lib/gpu_memory_service/__init__.py`, `lib/gpu_memory_service/pyproject.toml`, `lib/gpu_memory_service/setup.py`	Establishes main package exports (GMSClientMemoryManager, StaleMemoryLayoutError, lock/memory functions). Defines build configuration with C++ extension support for PyTorch allocator shim.
GMS Protocol & Types `lib/gpu_memory_service/common/*`	Defines wire protocol with length-prefixed framing and FD passing. Introduces RPC message types (enums, request/response structs) using msgspec. Establishes server state enums (EMPTY, RW, COMMITTED, RO) and state validation.
GMS Client RPC & Memory Manager `lib/gpu_memory_service/client/rpc.py`, `lib/gpu_memory_service/client/memory_manager.py`, `lib/gpu_memory_service/client/cuda_vmm_utils.py`	Implements GMSRPCClient for Unix-domain socket communication with lock acquisition. Provides GMSClientMemoryManager supporting RW/RO modes, VA reservation, memory mapping, sleep/wake cycles, and metadata operations. Wraps CUDA driver API for VA/memory operations.
GMS Client PyTorch Integration `lib/gpu_memory_service/client/torch/*`	Adds PyTorch-specific functions (allocator lifecycle, module tensor registration/materialization). Implements C++ CUDAPluggableAllocator shim via Python/C bindings. Provides tensor metadata serialization and GMS-backed tensor reconstruction.
GMS Server Components `lib/gpu_memory_service/server/*`	Implements GMSRPCServer with async request handling. Introduces GlobalLockFSM for explicit state transitions and operation permission checks. Provides GMSServerMemoryManager for CUDA physical allocation and export. Includes RequestHandler for RPC method dispatch and metadata management.
Dynamo GMS Wrapper `components/src/dynamo/gpu_memory_service/__init__.py`, `components/src/dynamo/gpu_memory_service/__main__.py`, `components/src/dynamo/gpu_memory_service/args.py`, `components/src/dynamo/gpu_memory_service/server.py`	Re-exports GMS client APIs for Dynamo compatibility. Adds CLI entry point and argument parser for device/socket configuration. Implements background server bootstrap with signal handling and graceful shutdown.
Container Build Integration `container/build.sh`, `container/Dockerfile*`, `.github/filters.yaml`	Introduces ENABLE_GPU_MEMORY_SERVICE flag for conditional wheel building and installation across multiple container targets (base, VLLM, SGLANG, TRT-LLM). Adds GMS package build step and wheelhouse integration. Updates CI filtering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 A service for memories, swift and fair,
With CUDA maps and sockets we share,
RPC locks keep order through day,
State machines guide the GMS way—
Tensors now fly on reserved VA! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 65.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description provides comprehensive details about the GPU Memory Service feature, package structure, key features, and real-world use cases. However, it does not follow the specified template structure with sections for Overview, Details, Where should the reviewer start, and Related Issues.	Restructure the description to match the template: add explicit 'Overview', 'Details', 'Where should the reviewer start', and 'Related Issues' sections for clarity and consistency with repository standards.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: GPU Memory Service' accurately summarizes the main change—adding a GPU Memory Service feature for out-of-process memory management. It is concise, clear, and directly reflects the primary purpose of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @components/src/dynamo/gpu_memory_service/args.py:
- Around line 54-58: The default socket path currently uses /tmp; change the
logic in the socket_path initialization (the block that sets socket_path when
socket_path is None and uses args.device) to prefer XDG_RUNTIME_DIR if present:
build socket_dir = os.environ.get("XDG_RUNTIME_DIR") or f"/tmp" (or better
os.path.join("/run/user", str(os.getuid())) if XDG_RUNTIME_DIR missing), then
set socket_path = os.path.join(socket_dir,
f"gpu_memory_service_{args.device}.sock"); keep the existing placeholder
expansion for explicit socket_path inputs
(socket_path.format(device=args.device)); also ensure you import os at top and
document in a comment that /tmp is fallback for single-user development only.

🧹 Nitpick comments (20)

lib/gpu_memory_service/pyproject.toml (1)

37-41: Consider pinning pytest-asyncio version for test stability.

The pytest-asyncio package has had breaking changes between versions (especially around mode configuration). Consider pinning to a specific version range like pytest-asyncio>=0.23,<1.0 to ensure consistent test behavior.
lib/gpu_memory_service/setup.py (1)
78-87: Use raise ... from None to suppress exception chaining.

When re-raising a different exception within an except block, use from None to avoid confusing exception chains in the traceback.
♻️ Proposed fix
         # Check if PyTorch is available (needed for _tensor_from_pointer)
         try:
-            import torch  # noqa: F401
+            import torch
         except ImportError:
             if self.extensions:
-                raise RuntimeError(
+                raise RuntimeError(
                     "PyTorch not found. Cannot build CUDA extensions. "
                     "Install PyTorch first, or use --no-build-isolation with pip wheel."
-                )
+                ) from None
             return
components/src/dynamo/gpu_memory_service/server.py (2)
84-93: Consider using logging.exception for better diagnostics.

The error handler captures startup failures but loses the stack trace. Using logging.exception instead of logging.error will include the traceback.
♻️ Suggested improvement
         except Exception as e:
-            logger.error(f"GMSRPCServer error: {e}")
+            logger.exception(f"GMSRPCServer error: {e}")
             self._error = e
             self._started.set()  # Unblock waiter even on error
95-111: Fragile coupling to internal server state.

The stop() method directly manipulates private attributes (_running, _shutdown, _condition) of GMSRPCServer. This creates tight coupling that can break if the server's internals change.

Consider adding a proper stop() or shutdown() method to GMSRPCServer that encapsulates this logic, then call it here. If that's not feasible in this PR, document this coupling with a comment explaining the dependency.
     def stop(self) -> None:
         """Stop the allocation server."""
         if self._server is not None:
             logger.info(f"Stopping GMSRPCServer on device {self.device}")
-            # Signal the server to stop - the loop in _run_server will exit
+            # NOTE: Directly accessing server internals; coordinate with GMSRPCServer changes.
+            # TODO: Replace with self._server.stop() when available.
             self._server._running = False
             self._server._shutdown = True
lib/gpu_memory_service/server/memory_manager.py (2)
207-224: Consider propagating cuMemRelease failures for free().

The free() method logs a warning if cuMemRelease fails but still returns True and removes the allocation from tracking. This could mask resource leaks if the CUDA release genuinely fails.
♻️ Optional: Propagate cuMemRelease failure
     def free(self, allocation_id: str) -> bool:
         """Release physical memory for a single allocation.
 
         Args:
             allocation_id: ID of allocation to free
 
         Returns:
-            True if allocation existed and was freed, False otherwise
+            True if allocation existed and was freed successfully, False otherwise
         """
         info = self._allocations.pop(allocation_id, None)
         if info is None:
             return False
 
         (result,) = cuda.cuMemRelease(info.handle)
         if result != cuda.CUresult.CUDA_SUCCESS:
             logger.warning(f"cuMemRelease failed for {allocation_id}: {result}")
+            # Re-add to tracking since release failed
+            self._allocations[allocation_id] = info
+            return False
         logger.debug(f"Freed allocation: {allocation_id}")
         return True
226-244: Same concern applies to clear_all() - failures are silently continued.

When cuMemRelease fails in clear_all(), the allocation tracking is still cleared. This is acceptable for cleanup scenarios (RW abort, shutdown) where we want best-effort cleanup, but consider returning a tuple of (cleared_count, failed_count) for observability.
lib/gpu_memory_service/client/memory_manager.py (3)
475-503: Consider moving the json import to module level.

The json import inside _snapshot_metadata_specs is executed on every call. While not a performance issue for occasional sleep operations, moving it to module level is more conventional.
♻️ Move import to module level

Add at the top of the file with other imports:
import json
Then remove the inline import at line 486.
650-655: __exit__ should return None or False, not bool.

The return type annotation is implicit but returning False explicitly is slightly unusual. Returning None (or just omitting the return) is more idiomatic for context managers that don't suppress exceptions.
♻️ Idiomatic __exit__ pattern
     def __exit__(self, exc_type, exc_val, exc_tb) -> None:
         self.close()
-        return False
789-810: Prefix unused dev_ptr variable with underscore.

The static analysis correctly identifies that dev_ptr is never used. This is intentional since the validation is checking if the call succeeds, not the returned value.
♻️ Fix unused variable warning
-        result, dev_ptr = cuda.cuPointerGetAttribute(
+        result, _dev_ptr = cuda.cuPointerGetAttribute(
             cuda.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_POINTER, va
         )
lib/gpu_memory_service/server/handler.py (1)
150-161: Use exception chaining for better debugging context.

When re-raising AllocationNotFoundError as ValueError, preserving the original exception chain aids debugging.
♻️ Add exception chaining
     def handle_get_allocation(self, req: GetAllocationRequest) -> GetAllocationResponse:
         """Get allocation info."""
         try:
             info = self._memory_manager.get_allocation(req.allocation_id)
             return GetAllocationResponse(
                 allocation_id=info.allocation_id,
                 size=info.size,
                 aligned_size=info.aligned_size,
                 tag=info.tag,
             )
         except AllocationNotFoundError:
-            raise ValueError(f"Unknown allocation: {req.allocation_id}")
+            raise ValueError(f"Unknown allocation: {req.allocation_id}") from None
lib/gpu_memory_service/client/torch/lifecycle.py (1)
192-227: free_cb accesses private attributes and silently swallows exceptions.

The free_cb callback directly accesses manager._mappings and manager._allocation_id_to_va (private attributes), and swallows all exceptions during cleanup. While this works, consider:

Adding a public method to GMSClientMemoryManager for freeing by VA

At minimum, logging the exception instead of completely ignoring it
♻️ Log exceptions in free_cb
         try:
             manager._client.free(mapping.allocation_id)
-        except Exception:
-            pass  # Ignore errors during cleanup
+        except Exception as e:
+            logger.debug(f"[GPU Memory Service] Error during free cleanup: {e}")
Also, the stream parameter is unused because the callback signature is dictated by PyTorch's CUDAPluggableAllocator interface - this is expected and not a bug.
lib/gpu_memory_service/server/rpc.py (2)
110-130: Use logger.exception for error logging to capture stack traces.

When logging exceptions, logger.exception automatically includes the traceback, which is valuable for debugging production issues.
♻️ Use logger.exception for better debugging
         except ConnectionResetError:
             logger.debug(f"Connection reset: {session_id}")
         except asyncio.CancelledError:
             raise
         except Exception as e:
-            logger.error(f"Connection error: {session_id}, {e}")
+            logger.exception(f"Connection error: {session_id}")
Apply similar changes to lines 141-142, 300-301, and 317-318.
263-284: Remove unused session_id parameter or use it for logging.

The session_id parameter in _cleanup_connection is never used. Either remove it or add it to log messages for better traceability.
♻️ Use session_id in cleanup logging
     async def _cleanup_connection(
         self, conn: Optional[Connection], session_id: str
     ) -> None:
         """Clean up after connection closes via state machine transition."""
         if conn is None:
             return
+        logger.debug(f"Cleaning up connection: {session_id}")
lib/gpu_memory_service/client/rpc.py (2)
118-163: Use exception chaining in _connect for better error context.

When wrapping exceptions during connection, preserve the original exception chain for debugging.
♻️ Add exception chaining
         try:
             self._socket.connect(self.socket_path)
         except FileNotFoundError:
-            raise ConnectionError(f"Server not running at {self.socket_path}")
+            raise ConnectionError(f"Server not running at {self.socket_path}") from None
         except Exception as e:
-            raise ConnectionError(f"Failed to connect: {e}")
+            raise ConnectionError(f"Failed to connect: {e}") from e
282-290: Prefix unused response variable with underscore.

The export() method doesn't use the response object, only the fd. This should be prefixed to satisfy linters and clarify intent.
♻️ Fix unused variable
     def export(self, allocation_id: str) -> int:
         """Export allocation as POSIX FD.

         Caller is responsible for closing the FD when done.
         """
-        response, fd = self._send_recv(ExportRequest(allocation_id=allocation_id))
+        _response, fd = self._send_recv(ExportRequest(allocation_id=allocation_id))
         if fd < 0:
             raise RuntimeError("No FD received from server")
         return fd
lib/gpu_memory_service/server/locking.py (1)
129-133: Inconsistent frozenset syntax.

Line 130 uses frozenset[ServerState]({ServerState.RW}) with an explicit type parameter, while other transitions (lines 123, 137, 144, etc.) use frozenset({...}) without the type parameter. This inconsistency doesn't affect runtime but reduces readability.
Suggested fix
-    from_states=frozenset[ServerState]({ServerState.RW}),
+    from_states=frozenset({ServerState.RW}),
lib/gpu_memory_service/client/torch/tensor.py (2)
67-101: Consider adding strict=True to zip for defensive validation.

The function correctly validates that len(shape) == len(stride) before the zip. Adding strict=True to the zip call would provide defense-in-depth against future refactoring that might remove or bypass the length check.
Suggested fix
-    for d, st in zip(shape, stride):
+    for d, st in zip(shape, stride, strict=True):
562-577: Accessing private _mappings attribute breaks encapsulation.

The code accesses manager._mappings directly to find which allocation contains a tensor's pointer. This couples the function to the internal implementation of GMSClientMemoryManager. Consider adding a public method to the manager like find_allocation_for_pointer(ptr: int) -> Optional[Tuple[int, str]] to encapsulate this lookup.
# Suggested addition to GMSClientMemoryManager:
def find_allocation_for_pointer(self, ptr: int) -> Optional[Tuple[int, str]]:
    """Find the allocation containing the given pointer.
    
    Returns:
        Tuple of (base_va, allocation_id) or None if not found.
    """
    for va, mapping in self._mappings.items():
        if va <= ptr < va + mapping.aligned_size:
            return (va, mapping.allocation_id)
    return None
lib/gpu_memory_service/common/protocol.py (2)
480-482: Fix implicit Optional type annotation.

The recv_buffer parameter defaults to None but the type hint is bytearray. Use explicit Optional type.
Suggested fix
 async def recv_message(
-    reader, recv_buffer: bytearray = None, raw_sock=None
+    reader, recv_buffer: Optional[bytearray] = None, raw_sock=None
 ) -> Tuple[Any, int, bytearray]:
590-592: Same implicit Optional issue as async version.
Suggested fix
 def recv_message_sync(
-    sock, recv_buffer: bytearray = None
+    sock, recv_buffer: Optional[bytearray] = None
 ) -> Tuple[Any, int, bytearray]:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfb95df and e4ecf18.

📒 Files selected for processing (29)

.devcontainer/post-create.sh
.dockerignore
.gitignore
components/src/dynamo/gpu_memory_service/__init__.py
components/src/dynamo/gpu_memory_service/__main__.py
components/src/dynamo/gpu_memory_service/args.py
components/src/dynamo/gpu_memory_service/server.py
lib/gpu_memory_service/__init__.py
lib/gpu_memory_service/client/__init__.py
lib/gpu_memory_service/client/memory_manager.py
lib/gpu_memory_service/client/rpc.py
lib/gpu_memory_service/client/torch/__init__.py
lib/gpu_memory_service/client/torch/extensions/__init__.py
lib/gpu_memory_service/client/torch/extensions/allocator.cpp
lib/gpu_memory_service/client/torch/extensions/tensor_from_pointer.cpp
lib/gpu_memory_service/client/torch/lifecycle.py
lib/gpu_memory_service/client/torch/tensor.py
lib/gpu_memory_service/common/__init__.py
lib/gpu_memory_service/common/cuda_vmm_utils.py
lib/gpu_memory_service/common/protocol.py
lib/gpu_memory_service/common/types.py
lib/gpu_memory_service/pyproject.toml
lib/gpu_memory_service/server/__init__.py
lib/gpu_memory_service/server/handler.py
lib/gpu_memory_service/server/locking.py
lib/gpu_memory_service/server/memory_manager.py
lib/gpu_memory_service/server/metadata_store.py
lib/gpu_memory_service/server/rpc.py
lib/gpu_memory_service/setup.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-15T02:01:01.238Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2453
File: deploy/dynamo_check.py:739-741
Timestamp: 2025-08-15T02:01:01.238Z
Learning: In the dynamo project, missing cargo should be treated as a fatal error in dynamo_check.py because developers need cargo to build the Rust components. The tool is designed to validate the complete development environment, not just import functionality.

Applied to files:

.devcontainer/post-create.sh

🧬 Code graph analysis (17)

.devcontainer/post-create.sh (1)

lib/runtime/src/compute/pool.rs (1)

install (271-284)

lib/gpu_memory_service/server/metadata_store.py (2)

lib/runtime/src/storage/kv.rs (1)

value (99-101)

lib/tokens/src/radix.rs (1)

prefix (23-29)

lib/gpu_memory_service/server/handler.py (5)

lib/gpu_memory_service/common/protocol.py (19)

AllocateRequest (175-183)

AllocateResponse (187-192)

ClearAllResponse (269-272)

FreeRequest (245-248)

FreeResponse (252-255)

GetAllocationRequest (210-213)

GetAllocationResponse (217-223)

GetAllocationStateResponse (164-168)

GetLockStateResponse (145-153)

ListAllocationsRequest (227-233)

ListAllocationsResponse (237-241)

MetadataDeleteRequest (317-320)

MetadataDeleteResponse (324-325)

MetadataGetRequest (302-305)

MetadataGetResponse (309-313)

MetadataListRequest (329-332)

MetadataListResponse (336-337)

MetadataPutRequest (287-293)

MetadataPutResponse (297-298)

lib/gpu_memory_service/common/types.py (2)

derive_state (150-158)

is_ready (145-147)

lib/gpu_memory_service/server/memory_manager.py (12)

AllocationNotFoundError (54-57)

GMSServerMemoryManager (60-274)

device (105-107)

granularity (110-112)

clear_all (226-244)

allocation_count (115-117)

total_bytes (120-122)

allocate (128-173)

export_fd (175-205)

get_allocation (246-261)

list_allocations (263-274)

free (207-224)

lib/gpu_memory_service/server/metadata_store.py (6)

GMSMetadataStore (31-67)

clear (64-67)

put (44-50)

get (52-53)

delete (55-57)

list_keys (59-62)

lib/gpu_memory_service/server/rpc.py (3)

granularity (97-98)

state (92-94)

is_ready (100-102)

lib/gpu_memory_service/server/rpc.py (3)

lib/gpu_memory_service/common/protocol.py (15)

AllocateRequest (175-183)

ClearAllRequest (259-265)

CommitRequest (117-124)

CommitResponse (128-131)

ExportRequest (196-203)

FreeRequest (245-248)

GetAllocationRequest (210-213)

HandshakeRequest (84-96)

HandshakeResponse (100-110)

MetadataDeleteRequest (317-320)

MetadataGetRequest (302-305)

MetadataListRequest (329-332)

MetadataPutRequest (287-293)

recv_message (480-558)

send_message (423-477)

lib/gpu_memory_service/common/types.py (4)

ConnectionMode (37-42)

Operation (55-82)

ServerState (14-34)

StateEvent (45-52)

lib/gpu_memory_service/server/locking.py (12)

Connection (50-81)

GlobalLockFSM (177-391)

state (208-216)

committed (234-236)

rw_conn (219-221)

close (75-81)

transition (297-341)

can_acquire_rw (373-383)

can_acquire_ro (385-391)

ro_conns (224-226)

check_operation (345-369)

ro_count (229-231)

components/src/dynamo/gpu_memory_service/__init__.py (3)

lib/gpu_memory_service/client/memory_manager.py (2)

GMSClientMemoryManager (80-836)

StaleWeightsError (40-52)

lib/gpu_memory_service/client/torch/lifecycle.py (2)

get_allocator (235-241)

get_or_create_allocator (49-164)

lib/gpu_memory_service/client/torch/tensor.py (2)

materialize_module_from_gms (645-866)

register_module_tensors (506-595)

lib/gpu_memory_service/client/memory_manager.py (4)

lib/gpu_memory_service/common/cuda_vmm_utils.py (2)

check_cuda_result (13-29)

get_allocation_granularity (43-64)

lib/gpu_memory_service/server/memory_manager.py (7)

device (105-107)

granularity (110-112)

list_allocations (263-274)

allocate (128-173)

get_allocation (246-261)

clear_all (226-244)

total_bytes (120-122)

lib/gpu_memory_service/server/handler.py (1)

granularity (64-66)

lib/gpu_memory_service/server/rpc.py (1)

granularity (97-98)

components/src/dynamo/gpu_memory_service/__main__.py (1)

components/src/dynamo/gpu_memory_service/server.py (1)

main (162-165)

lib/gpu_memory_service/server/__init__.py (6)

lib/gpu_memory_service/common/types.py (3)

ConnectionMode (37-42)

ServerState (14-34)

StateSnapshot (135-147)

lib/gpu_memory_service/server/handler.py (2)

RequestHandler (41-235)

metadata_store (69-71)

lib/gpu_memory_service/server/locking.py (2)

Connection (50-81)

GlobalLockFSM (177-391)

lib/gpu_memory_service/server/memory_manager.py (3)

AllocationInfo (34-51)

AllocationNotFoundError (54-57)

GMSServerMemoryManager (60-274)

lib/gpu_memory_service/server/metadata_store.py (1)

GMSMetadataStore (31-67)

lib/gpu_memory_service/server/rpc.py (1)

GMSRPCServer (57-447)

lib/gpu_memory_service/client/torch/extensions/allocator.cpp (4)

lib/gpu_memory_service/server/memory_manager.py (2)

device (105-107)

granularity (110-112)

lib/gpu_memory_service/server/handler.py (1)

granularity (64-66)

lib/gpu_memory_service/server/rpc.py (1)

granularity (97-98)

lib/gpu_memory_service/client/torch/lifecycle.py (2)

malloc_cb (192-207)

free_cb (209-227)

lib/gpu_memory_service/common/cuda_vmm_utils.py (3)

lib/gpu_memory_service/server/memory_manager.py (2)

device (105-107)

granularity (110-112)

lib/gpu_memory_service/server/handler.py (1)

granularity (64-66)

lib/gpu_memory_service/server/rpc.py (1)

granularity (97-98)

lib/gpu_memory_service/client/torch/__init__.py (2)

lib/gpu_memory_service/client/torch/lifecycle.py (2)

get_allocator (235-241)

get_or_create_allocator (49-164)

lib/gpu_memory_service/client/torch/tensor.py (2)

materialize_module_from_gms (645-866)

register_module_tensors (506-595)

lib/gpu_memory_service/client/rpc.py (1)

lib/gpu_memory_service/common/protocol.py (21)

AllocateRequest (175-183)

ClearAllRequest (259-265)

CommitRequest (117-124)

CommitResponse (128-131)

ExportRequest (196-203)

FreeRequest (245-248)

FreeResponse (252-255)

GetAllocationRequest (210-213)

GetAllocationStateRequest (157-160)

GetAllocationStateResponse (164-168)

GetLockStateRequest (138-141)

GetLockStateResponse (145-153)

HandshakeRequest (84-96)

HandshakeResponse (100-110)

ListAllocationsRequest (227-233)

MetadataDeleteRequest (317-320)

MetadataGetRequest (302-305)

MetadataListRequest (329-332)

MetadataPutRequest (287-293)

recv_message_sync (590-653)

send_message_sync (564-587)

lib/gpu_memory_service/__init__.py (2)

lib/gpu_memory_service/client/memory_manager.py (2)

GMSClientMemoryManager (80-836)

StaleWeightsError (40-52)

lib/gpu_memory_service/client/torch/lifecycle.py (2)

get_allocator (235-241)

get_or_create_allocator (49-164)

lib/gpu_memory_service/client/__init__.py (2)

lib/gpu_memory_service/client/memory_manager.py (2)

GMSClientMemoryManager (80-836)

StaleWeightsError (40-52)

lib/gpu_memory_service/client/rpc.py (1)

GMSRPCClient (71-405)

lib/gpu_memory_service/client/torch/lifecycle.py (1)

lib/gpu_memory_service/client/memory_manager.py (3)

GMSClientMemoryManager (80-836)

mode (146-148)

allocate_to_va (213-299)

lib/gpu_memory_service/common/protocol.py (4)

lib/gpu_memory_service/server/locking.py (2)

committed (234-236)

state (208-216)

lib/gpu_memory_service/server/rpc.py (2)

state (92-94)

is_ready (100-102)

lib/gpu_memory_service/common/types.py (1)

is_ready (145-147)

lib/gpu_memory_service/server/memory_manager.py (2)

allocation_count (115-117)

total_bytes (120-122)

lib/gpu_memory_service/common/types.py (3)

lib/gpu_memory_service/server/locking.py (3)

state (208-216)

ro_count (229-231)

committed (234-236)

lib/gpu_memory_service/server/rpc.py (2)

state (92-94)

is_ready (100-102)

lib/gpu_memory_service/client/rpc.py (2)

committed (165-167)

is_ready (222-224)

🪛 Ruff (0.14.10)

lib/gpu_memory_service/client/torch/extensions/__init__.py

15-15: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

16-16: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

21-21: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

24-24: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

lib/gpu_memory_service/server/handler.py

161-161: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

161-161: Avoid specifying long messages outside the exception class

(TRY003)

lib/gpu_memory_service/common/__init__.py

51-94: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

lib/gpu_memory_service/server/rpc.py

126-126: Do not catch blind exception: Exception

(BLE001)

127-127: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

141-141: Do not catch blind exception: Exception

(BLE001)

142-142: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

210-210: Consider moving this statement to an else block

(TRY300)

226-226: Consider moving this statement to an else block

(TRY300)

257-257: Consider moving this statement to an else block

(TRY300)

264-264: Unused method argument: session_id

(ARG002)

300-300: Do not catch blind exception: Exception

(BLE001)

301-301: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

317-317: Do not catch blind exception: Exception

(BLE001)

318-318: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

388-388: Avoid specifying long messages outside the exception class

(TRY003)

components/src/dynamo/gpu_memory_service/__init__.py

32-45: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

lib/gpu_memory_service/client/memory_manager.py

130-132: Avoid specifying long messages outside the exception class

(TRY003)

245-245: Avoid specifying long messages outside the exception class

(TRY003)

254-257: Avoid specifying long messages outside the exception class

(TRY003)

262-264: Avoid specifying long messages outside the exception class

(TRY003)

362-362: Avoid specifying long messages outside the exception class

(TRY003)

383-383: Avoid specifying long messages outside the exception class

(TRY003)

418-418: Avoid specifying long messages outside the exception class

(TRY003)

420-422: Avoid specifying long messages outside the exception class

(TRY003)

426-428: Avoid specifying long messages outside the exception class

(TRY003)

446-446: Avoid specifying long messages outside the exception class

(TRY003)

451-451: Avoid specifying long messages outside the exception class

(TRY003)

502-502: Do not catch blind exception: Exception

(BLE001)

522-522: Avoid specifying long messages outside the exception class

(TRY003)

553-553: Do not catch blind exception: Exception

(BLE001)

578-578: Avoid specifying long messages outside the exception class

(TRY003)

584-587: Avoid specifying long messages outside the exception class

(TRY003)

591-594: Avoid specifying long messages outside the exception class

(TRY003)

608-611: Abstract raise to an inner function

(TRY301)

608-611: Avoid specifying long messages outside the exception class

(TRY003)

613-616: Abstract raise to an inner function

(TRY301)

613-616: Avoid specifying long messages outside the exception class

(TRY003)

618-621: Abstract raise to an inner function

(TRY301)

618-621: Avoid specifying long messages outside the exception class

(TRY003)

624-624: Do not catch blind exception: Exception

(BLE001)

662-662: Avoid specifying long messages outside the exception class

(TRY003)

663-663: Avoid specifying long messages outside the exception class

(TRY003)

716-716: Do not catch blind exception: Exception

(BLE001)

742-742: Avoid specifying long messages outside the exception class

(TRY003)

746-746: Avoid specifying long messages outside the exception class

(TRY003)

757-759: Avoid specifying long messages outside the exception class

(TRY003)

763-766: Avoid specifying long messages outside the exception class

(TRY003)

793-793: Unpacked variable dev_ptr is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

833-833: Do not catch blind exception: Exception

(BLE001)

lib/gpu_memory_service/server/__init__.py

29-41: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

components/src/dynamo/gpu_memory_service/server.py

62-62: Avoid specifying long messages outside the exception class

(TRY003)

87-87: Do not catch blind exception: Exception

(BLE001)

88-88: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

lib/gpu_memory_service/setup.py

71-75: Avoid specifying long messages outside the exception class

(TRY003)

80-80: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

83-86: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

83-86: Avoid specifying long messages outside the exception class

(TRY003)

133-133: Consider [*torch_include, cuda_include] instead of concatenation

Replace with [*torch_include, cuda_include]

(RUF005)

lib/gpu_memory_service/common/cuda_vmm_utils.py

29-29: Avoid specifying long messages outside the exception class

(TRY003)

lib/gpu_memory_service/client/torch/__init__.py

22-29: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

lib/gpu_memory_service/client/rpc.py

124-124: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

124-124: Avoid specifying long messages outside the exception class

(TRY003)

125-125: Do not catch blind exception: Exception

(BLE001)

126-126: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

126-126: Avoid specifying long messages outside the exception class

(TRY003)

142-142: Avoid specifying long messages outside the exception class

(TRY003)

147-147: Avoid specifying long messages outside the exception class

(TRY003)

152-152: Avoid specifying long messages outside the exception class

(TRY003)

194-194: Avoid specifying long messages outside the exception class

(TRY003)

202-202: Prefer TypeError exception for invalid type

(TRY004)

202-202: Avoid specifying long messages outside the exception class

(TRY003)

212-212: Prefer TypeError exception for invalid type

(TRY004)

212-212: Avoid specifying long messages outside the exception class

(TRY003)

219-219: Prefer TypeError exception for invalid type

(TRY004)

219-219: Avoid specifying long messages outside the exception class

(TRY003)

237-237: Avoid specifying long messages outside the exception class

(TRY003)

271-271: Avoid specifying long messages outside the exception class

(TRY003)

275-275: Prefer TypeError exception for invalid type

(TRY004)

275-275: Avoid specifying long messages outside the exception class

(TRY003)

287-287: Unpacked variable response is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

289-289: Avoid specifying long messages outside the exception class

(TRY003)

296-296: Prefer TypeError exception for invalid type

(TRY004)

296-296: Avoid specifying long messages outside the exception class

(TRY003)

309-309: Prefer TypeError exception for invalid type

(TRY004)

309-309: Avoid specifying long messages outside the exception class

(TRY003)

315-315: Avoid specifying long messages outside the exception class

(TRY003)

325-325: Avoid specifying long messages outside the exception class

(TRY003)

329-329: Prefer TypeError exception for invalid type

(TRY004)

329-329: Avoid specifying long messages outside the exception class

(TRY003)

341-341: Avoid specifying long messages outside the exception class

(TRY003)

351-351: Prefer TypeError exception for invalid type

(TRY004)

351-351: Avoid specifying long messages outside the exception class

(TRY003)

361-361: Prefer TypeError exception for invalid type

(TRY004)

361-361: Avoid specifying long messages outside the exception class

(TRY003)

369-369: Avoid specifying long messages outside the exception class

(TRY003)

372-372: Prefer TypeError exception for invalid type

(TRY004)

372-372: Avoid specifying long messages outside the exception class

(TRY003)

379-379: Prefer TypeError exception for invalid type

(TRY004)

379-379: Avoid specifying long messages outside the exception class

(TRY003)

389-390: try-except-pass detected, consider logging the exception

(S110)

389-389: Do not catch blind exception: Exception

(BLE001)

lib/gpu_memory_service/server/locking.py

80-81: try-except-pass detected, consider logging the exception

(S110)

80-80: Do not catch blind exception: Exception

(BLE001)

261-261: Avoid specifying long messages outside the exception class

(TRY003)

316-319: Avoid specifying long messages outside the exception class

(TRY003)

327-330: Avoid specifying long messages outside the exception class

(TRY003)

360-362: Avoid specifying long messages outside the exception class

(TRY003)

366-369: Avoid specifying long messages outside the exception class

(TRY003)

lib/gpu_memory_service/server/memory_manager.py

28-28: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

197-197: Avoid specifying long messages outside the exception class

(TRY003)

260-260: Avoid specifying long messages outside the exception class

(TRY003)

lib/gpu_memory_service/__init__.py

38-45: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

components/src/dynamo/gpu_memory_service/args.py

55-55: Probable insecure usage of temporary file or directory: "/tmp/gpu_memory_service_"

(S108)

lib/gpu_memory_service/client/__init__.py

21-25: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

lib/gpu_memory_service/client/torch/lifecycle.py

109-112: Avoid specifying long messages outside the exception class

(TRY003)

121-124: Avoid specifying long messages outside the exception class

(TRY003)

193-193: Unused function argument: stream

(ARG001)

226-227: try-except-pass detected, consider logging the exception

(S110)

226-226: Do not catch blind exception: Exception

(BLE001)

lib/gpu_memory_service/common/protocol.py

387-387: Avoid specifying long messages outside the exception class

(TRY003)

405-405: Avoid specifying long messages outside the exception class

(TRY003)

410-410: Avoid specifying long messages outside the exception class

(TRY003)

448-448: Avoid specifying long messages outside the exception class

(TRY003)

481-481: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

523-523: Unpacked variable flags is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

537-537: Avoid specifying long messages outside the exception class

(TRY003)

542-542: Avoid specifying long messages outside the exception class

(TRY003)

551-551: Avoid specifying long messages outside the exception class

(TRY003)

591-591: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

623-623: Unpacked variable flags is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

637-637: Avoid specifying long messages outside the exception class

(TRY003)

646-646: Avoid specifying long messages outside the exception class

(TRY003)

lib/gpu_memory_service/client/torch/tensor.py

90-90: Avoid specifying long messages outside the exception class

(TRY003)

93-93: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

160-160: Avoid specifying long messages outside the exception class

(TRY003)

162-162: Prefer TypeError exception for invalid type

(TRY004)

162-162: Avoid specifying long messages outside the exception class

(TRY003)

257-257: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

330-331: try-except-continue detected, consider logging the exception

(S112)

330-330: Do not catch blind exception: Exception

(BLE001)

408-409: try-except-pass detected, consider logging the exception

(S110)

408-408: Do not catch blind exception: Exception

(BLE001)

411-413: Avoid specifying long messages outside the exception class

(TRY003)

452-452: Avoid specifying long messages outside the exception class

(TRY003)

457-457: Avoid specifying long messages outside the exception class

(TRY003)

590-590: Do not catch blind exception: Exception

(BLE001)

624-627: Avoid specifying long messages outside the exception class

(TRY003)

734-738: Avoid specifying long messages outside the exception class

(TRY003)

758-762: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (54)

lib/gpu_memory_service/common/cuda_vmm_utils.py (3)

32-40: LGTM!

The CUDA initialization and error handling utilities are well-structured. The redundant if check on line 39 before calling check_cuda_result is harmless since check_cuda_result already handles the success case by returning early.

43-64: LGTM!

The granularity retrieval correctly configures the allocation properties for POSIX FD sharing and uses the minimum granularity flag. This aligns with the C++ implementation in allocator.cpp.

67-77: LGTM!

Standard ceiling-division alignment formula. Correct implementation.

lib/gpu_memory_service/setup.py (1)

97-146: LGTM!

The extension module creation logic correctly handles optional PyTorch availability for _tensor_from_pointer while ensuring _allocator_ext is always built. The C++17, O3 optimization, and fPIC flags are appropriate for CUDA extensions.

lib/gpu_memory_service/client/torch/extensions/allocator.cpp (5)

106-172: LGTM - Robust error handling in my_malloc.

The function correctly:

Reserves VA before registering the allocation

Cleans up VA reservation if callback is not set

On callback failure, cleans up any handle/mapping that may have been created, then frees the VA

Properly acquires/releases GIL around Python callback

222-247: LGTM - Correct reference counting in py_init_module.

The INCREF/DECREF ordering is correct: increment new references before decrementing old ones to avoid premature deallocation if the same object is passed.

251-322: LGTM - Proper cleanup on failure in py_import_and_map.

The function correctly releases resources in reverse order on error:

On cuMemMap failure: releases handle

On cuMemSetAccess failure: unmaps, then releases handle

326-355: Memory management in py_get_all_allocations is correct.

The function properly handles PyList_Append semantics by decrementing the item reference after append (since PyList_Append increments it), and cleans up properly on error paths.

58-62: Thread safety advisory: g_allocations is unprotected.

The global g_allocations map is accessed without synchronization in my_malloc() (before acquiring the GIL) and my_free() (without acquiring the GIL). While PyTorch's CUDAPluggableAllocator with a single MemPool per process typically ensures serialized allocator calls, and Python callback functions execute with the GIL held, the C++ allocator entry/exit functions (my_malloc/my_free) themselves do not hold the GIL during critical accesses. If this extension is ever used in a multi-threaded C++ context where multiple threads call the allocator directly, consider adding a mutex to protect g_allocations.

lib/gpu_memory_service/common/types.py (3)

14-35: LGTM - Well-documented state machine.

The ServerState enum with the ASCII state diagram in the docstring clearly documents the valid transitions. Using (str, Enum) enables clean serialization.

85-131: LGTM - Clear permission model.

The separation of RW_REQUIRED_OPS and STATE_ALLOWED_OPS provides a clear two-level permission check: first verify the operation is allowed in the current state, then verify the connection has the required mode.

134-158: LGTM - Consistent is_ready semantics.

The StateSnapshot.is_ready property correctly matches the server-side implementation in lib/gpu_memory_service/server/rpc.py:99-101: ready means committed with no active writer, while readers may still be connected.

.gitignore (1)

59-60: LGTM - Correct negation pattern.

The !lib/gpu_memory_service/server/ negation ensures the GPU Memory Service server directory is tracked despite the broader server/ ignore rule.

.dockerignore (1)

46-49: LGTM - Appropriate Docker context exclusions.

Excluding build/, *.egg-info/, and compiled *.so files prevents stale build artifacts from inflating the Docker context and ensures fresh builds within containers.

components/src/dynamo/gpu_memory_service/__main__.py (1)

1-7: LGTM! Clean entry point implementation.

The module correctly imports and invokes the server's main function, following standard Python CLI patterns.

.devcontainer/post-create.sh (1)

69-78: LGTM! Proper installation sequence for CUDA extensions.

The installation correctly:

Uninstalls any existing package to avoid conflicts

Uses dynamic site-packages detection for portability

Applies --no-build-isolation so the build can access PyTorch from the environment (required for CUDA extension compilation)

lib/gpu_memory_service/client/__init__.py (1)

15-18: The import path is correct. StaleWeightsError is actually defined in memory_manager.py (line 40), not in rpc.py as the original concern suggested. The __init__.py file correctly imports it from memory_manager, and there is no import issue.

Likely an incorrect or invalid review comment.

lib/gpu_memory_service/__init__.py (1)

1-45: LGTM!

The package initialization is well-structured with clear documentation of the package layout and a clean re-export of the primary client API. The logical grouping in __all__ (Client, then Lifecycle) is more maintainable than alphabetical sorting for an API surface.

lib/gpu_memory_service/client/torch/extensions/__init__.py (1)

14-26: LGTM!

The guarded import pattern is appropriate for optional C++ extensions that may not be built yet. Setting the placeholders to None on ImportError allows dependent code to check availability before use.

lib/gpu_memory_service/server/metadata_store.py (1)

24-67: LGTM!

Clean implementation of an in-memory metadata store. The frozen dataclass for entries and explicit non-thread-safe documentation (relying on external RW/RO lock semantics) are appropriate design choices.

components/src/dynamo/gpu_memory_service/__init__.py (1)

1-45: LGTM!

The Dynamo component wrapper provides a clean re-export surface with logical grouping. The docstring clearly documents the purpose and relationship to the core gpu_memory_service package.

lib/gpu_memory_service/client/torch/__init__.py (1)

1-29: LGTM!

Clean PyTorch integration surface with logical grouping of lifecycle and tensor operation exports.

lib/gpu_memory_service/client/torch/extensions/tensor_from_pointer.cpp (2)

23-44: LGTM!

The implementation correctly uses torch::from_blob with a no-op deleter for non-owning tensor views. The optional strides handling is clean, and the cast chain for the pointer is appropriate for CUDA memory addresses.

63-80: Clear API documentation.

The pybind11 module registration with inline docstrings provides clear Python-side documentation for both exported functions.

lib/gpu_memory_service/server/memory_manager.py (2)

1-57: Well-structured module with proper CUDA VMM abstraction.

The module header, dataclass definition, and exception class are clean and well-documented. The design decision to have no VA mapping on the server side (avoiding CUDA context) is sound for crash resilience.

60-103: LGTM on class design and initialization.

The lazy CUDA initialization pattern and clear documentation about thread-safety requirements are appropriate.

lib/gpu_memory_service/server/__init__.py (1)

1-41: Clean package re-export structure.

The module appropriately aggregates and exposes the public API surface. The docstring provides a good architectural overview.

lib/gpu_memory_service/client/memory_manager.py (2)

40-52: Well-documented exception for stale weights detection.

The docstring clearly explains when this exception is raised and what it doesn't detect (content-only changes), which is helpful for users.

825-836: Best-effort cleanup in _unmap_all is appropriate.

The broad exception handling here is acceptable since this is cleanup code that should proceed even if individual unmaps fail. The warnings provide observability.

lib/gpu_memory_service/server/handler.py (2)

41-72: Clean handler design with proper separation of concerns.

The RequestHandler correctly separates business logic from connection management. The initialization and property accessors are straightforward.

73-93: Lifecycle callbacks are well-structured.

The on_rw_abort and on_shutdown callbacks properly clean up both allocations and metadata. The conditional check in on_shutdown avoids unnecessary work.

lib/gpu_memory_service/client/torch/lifecycle.py (1)

49-165: Solid singleton lifecycle management with clear mode transitions.

The get_or_create_allocator function properly enforces single-manager-per-process semantics with clear error messages for incompatible mode requests. The auto mode handling is sensible.

lib/gpu_memory_service/server/rpc.py (2)

57-88: Well-architected async RPC server with explicit state machine.

The server initialization is clean with proper separation between request handling (delegated to RequestHandler) and state management (delegated to GlobalLockFSM).

406-439: Server lifecycle management is robust.

The start() properly cleans up stale socket files, and stop() correctly handles shutdown by:

Signaling shutdown to waiters

Closing the server

Closing active connections (bypassing FSM - appropriate for shutdown)

Cleaning up resources

lib/gpu_memory_service/client/rpc.py (3)

71-116: Clear client design with lock-as-connection semantics.

The docstring and class design clearly communicate that the socket connection represents the lock, which is a key architectural decision.

228-262: Clever commit verification via RO reconnect.

The commit() method's error recovery strategy - verifying via a quick RO connection when the socket closes - is a robust way to confirm the commit succeeded despite connection errors. This handles the race condition where the server closes the socket as part of commit.

402-405: Destructor warning for leaked connections is helpful for debugging.

The __del__ warning helps identify connection leaks during development. This is a good practice for resources that should be explicitly closed.

lib/gpu_memory_service/server/locking.py (5)

49-82: Connection class implementation looks solid.

The design choice to use eq=False with explicit __hash__ based on session_id is appropriate for use in sets while maintaining object identity for equality. The raw_socket property correctly extracts the underlying socket for FD passing.

One minor observation: the close() method silently swallows all exceptions (lines 80-81). While this is often acceptable for cleanup operations, consider logging at DEBUG level for troubleshooting connection issues.

177-241: Well-designed state machine with derived state.

The design choice to derive state from the actual connection objects (_rw_conn, _ro_conns, _committed) rather than storing it separately is robust - it ensures state consistency and eliminates potential drift between stored state and actual connections. The transition history log is useful for debugging.

245-261: Condition evaluation is correct but uses string-based dispatch.

The _has_remaining_readers logic correctly handles the edge case where conn might not be in _ro_conns (defensive programming). The string-based condition dispatch in _check_condition is functional but could be replaced with a method reference in the Transition dataclass for type safety. However, this is a minor consideration for a small, controlled set of conditions.

297-341: Transition method correctly validates state changes.

The post-transition validation (lines 326-330) that checks if the resulting state matches the expected to_state is a good defensive measure to catch implementation bugs. The transition logging with session context aids debugging.

373-395: Lock acquisition predicates correctly implement fairness.

can_acquire_rw() correctly requires both no RW holder and no RO holders. can_acquire_ro() includes a waiting_writers == 0 check which prevents writer starvation - readers won't jump ahead of waiting writers. This is a good fairness design.

lib/gpu_memory_service/client/torch/tensor.py (7)

36-65: Clean metadata dataclass design.

The separation of TensorMeta (serialization format with string dtype) and ParsedTensorMeta (runtime format with torch.dtype) provides a clean boundary between storage and usage. The frozen dataclasses ensure immutability.

152-164: dtype parsing is robust with proper validation.

The getattr(torch, s) approach is flexible, and the subsequent isinstance(dt, torch.dtype) check correctly guards against non-dtype attributes being returned.

187-200: Legacy format fallback is well-handled.

The graceful degradation from span_bytes → nbytes → computed value ensures backward compatibility. The element size computation via torch.tensor([], dtype=dtype).element_size() works but creates a temporary tensor; consider using torch.finfo(dtype).bits // 8 for floating types or torch.iinfo(dtype).bits // 8 for integer types if performance becomes a concern in hot paths.

272-351: Module tree extraction is comprehensive but complex.

The function correctly captures parameters, buffers, and arbitrary tensor attributes (like _k_scale, _v_scale). The attribute filtering logic (lines 308-327) is intricate but handles the edge cases well - skipping dunder attributes while allowing private tensor attributes that are common in quantization scenarios.

The broad except Exception at line 330 is appropriate here since dir() can return attributes that raise when accessed (descriptors, lazy properties), and we want to skip those gracefully.

384-416: Module path resolution handles container types correctly.

The resolve_module_attr function properly handles ModuleList, Sequential, and ModuleDict by checking for __getitem__ and handling both integer indices and string keys. The fallback to __getitem__ after hasattr check is the right approach for these container types.

726-747: Parameter materialization handles meta tensors correctly.

The logic to check for shape/dtype mismatches (lines 733-738) and the needs_replace condition for meta tensors or device mismatch (line 739) is correct. The requires_grad handling for non-floating-point types (lines 742-743) properly prevents gradients on integer tensors.

One minor note: lines 732 and 756 use assert for runtime validation. While unlikely to fail in practice (the condition is checked in the if clause), assert statements can be stripped with -O flag. Consider removing these redundant assertions.

804-849: Meta tensor initialization with default values is pragmatic.

The heuristic to fill "scale" tensors with 1.0 and others with 0.0 is a reasonable default that works for common quantization patterns. The implementation correctly iterates over both parameters and buffers.

Consider documenting this behavior in the function docstring so users understand that meta tensors not in GMS metadata will be initialized with default values.

lib/gpu_memory_service/common/protocol.py (5)

46-78: Message type codes are well-organized.

The message type codes are logically grouped (handshake 0x01-0x04, state 0x05-0x08, allocation 0x10-0x1A, metadata 0x30-0x37, error 0xFF). This organization leaves room for future additions in each category.

83-111: Handshake protocol design is clean.

The lock_type Literal type clearly defines the valid options, and the rw_or_ro mode for opportunistic fallback is a practical feature. The granted_lock_type in the response allows callers to know which lock was actually acquired.

380-418: Serialization is straightforward and correct.

The encode_message function correctly handles dataclass conversion via asdict, and decode_message properly reconstructs the message instance using **payload unpacking. The message type lookup dictionaries provide clean bidirectional mapping.

454-471: FD passing with socket duplication is complex but necessary.

The approach of duplicating the transport's file descriptor, creating a temporary socket for sendmsg, then detaching is a workaround for asyncio's lack of native ancillary data support. The exception handling correctly cleans up the duplicated fd on failure.

This is fragile code that depends on Python's socket semantics. Consider adding a brief comment explaining why this dance is necessary (asyncio streams don't expose sendmsg, transport socket may have issues with sendmsg in async context).

621-631: Sync version correctly uses consistent socket operations.

Unlike the async version, the sync recv_message_sync uses the same socket for both recvmsg and subsequent recv calls (line 644), avoiding any buffer inconsistency issues. The unused flags variable (line 623) is a minor static analysis warning that can be silenced by prefixing with underscore.

components/src/dynamo/gpu_memory_service/args.py

lib/gpu_memory_service/common/protocol.py

grahamking · 2026-01-12T19:23:59Z

@galletas1712 We can't really take a 6k line PR, no-one can review that. Is there a few-hundred line piece inside this that is useful by itself? We could maybe start with that, and if that works build on it.

container/Dockerfile.sglang

.devcontainer/post-create.sh

mohammedabdulwahhab

Is is possible to run this on bare-metal or does it require a devcontainer?

If it is possible, we might want to add some notes here in case anything different needs to be done:
https://github.com/ai-dynamo/dynamo/blob/main/README.md#developing-locally

If it's not possible, let's check that the developing locally workflow is not broken

lib/gpu_memory_service/client/cuda_vmm_utils.py

container/Dockerfile

Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>

nv-tusharma

PR looks good to me, contingent on all the framework builds passing

components/src/dynamo/gpu_memory_service/__init__.py

- Fix type mismatch: add assertions for lock_type before passing to functions expecting non-optional GrantedLockType - Remove redundant CUDA success check in ensure_cuda_initialized - Remove unused session_id parameter from _cleanup_connection - Add exception chaining (from None/from e) for better debugging - Use logger.exception() instead of logger.error() for stack traces - Add ClassVar annotation to _HANDLERS dispatch table - Add strict=True to zip() for defense-in-depth validation - Make C++ compiler configurable via CXX environment variable - Simplify server.py by removing unnecessary threading wrapper (GMSRPCServer is single-threaded async, no need for separate thread)

galletas1712 · 2026-01-23T08:30:24Z

@CodeRabbit review

coderabbitai · 2026-01-23T08:30:33Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@components/src/dynamo/gpu_memory_service/__init__.py`:
- Around line 6-9: Update the module docstring in the dynamo.gpu_memory_service
package to fix the typo: replace the phrase "gpu_memory package" with
"gpu_memory_service" so the description correctly names the wrapped package;
edit the top-of-file docstring in __init__.py that documents the CLI entry point
and re-exports to use the exact package name "gpu_memory_service".
- Around line 20-22: The unconditional import of the optional compiled extension
gpu_memory_service.client.torch.extensions._allocator_ext can raise ImportError
in CPU-only/minimal installs; wrap the import in a try/except ImportError (or
Exception) inside __init__.py, assign a safe fallback (e.g., _allocator_ext =
None) when the import fails, and optionally set/export a boolean flag (e.g.,
_has_allocator_ext) so callers can check before using _allocator_ext; update any
re-exports to use the fallback symbol to avoid hard crashes.

♻️ Duplicate comments (1)

components/src/dynamo/gpu_memory_service/args.py (1)

52-58: Prefer a per-user runtime dir over /tmp for the default socket path.

This was already flagged earlier; keeping it here to track the same concern.

🧹 Nitpick comments (1)

components/src/dynamo/gpu_memory_service/__init__.py (1)

29-41: (Optional) Sort __all__ to satisfy Ruff RUF022.

Ruff flags this list as unsorted. If you want to keep the grouping, consider suppressing the rule; otherwise, sort it.

♻️ One possible ordering

 __all__ = [
-    # Core
-    "GMSClientMemoryManager",
-    "StaleMemoryLayoutError",
-    # GMS client memory manager
-    "get_or_create_gms_client_memory_manager",
-    "get_gms_client_memory_manager",
-    # Tensor utilities
-    "register_module_tensors",
-    "materialize_module_from_gms",
-    # Extensions
     "_allocator_ext",
+    "GMSClientMemoryManager",
+    "StaleMemoryLayoutError",
+    "get_gms_client_memory_manager",
+    "get_or_create_gms_client_memory_manager",
+    "materialize_module_from_gms",
+    "register_module_tensors",
 ]

components/src/dynamo/gpu_memory_service/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>

grahamking · 2026-01-23T15:49:31Z

@galletas1712 We can't really take a 6k line PR, no-one can review that. Is there a few-hundred line piece inside this that is useful by itself? We could maybe start with that, and if that works build on it.

🤷

galletas1712 requested review from a team as code owners January 8, 2026 19:01

pull-request-size bot added the size/XXL label Jan 8, 2026

github-actions bot added the feat label Jan 8, 2026

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

components/src/dynamo/gpu_memory_service/args.py Show resolved Hide resolved

lib/gpu_memory_service/common/protocol.py Outdated Show resolved Hide resolved

galletas1712 force-pushed the schwinns/gms-core-library branch from e4ecf18 to 5012c21 Compare January 12, 2026 15:02

github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026

galletas1712 requested review from a team as code owners January 12, 2026 15:07

github-actions bot added backend::vllm Relates to the vllm backend build backend::sglang Relates to the sglang backend and removed build labels Jan 12, 2026

galletas1712 force-pushed the schwinns/gms-core-library branch from 2076c66 to 499bc45 Compare January 12, 2026 15:54

github-actions bot added the build label Jan 12, 2026

galletas1712 removed backend::vllm Relates to the vllm backend backend::sglang Relates to the sglang backend labels Jan 13, 2026

mohammedabdulwahhab reviewed Jan 13, 2026

View reviewed changes

container/Dockerfile.sglang Outdated Show resolved Hide resolved

mohammedabdulwahhab reviewed Jan 13, 2026

View reviewed changes

.devcontainer/post-create.sh Outdated Show resolved Hide resolved

mohammedabdulwahhab reviewed Jan 13, 2026

View reviewed changes

github-actions bot removed the build label Jan 14, 2026

galletas1712 force-pushed the schwinns/gms-core-library branch 6 times, most recently from ccaf7e2 to caf59a2 Compare January 14, 2026 18:59

github-actions bot added ci Issues/PRs that reference CI build/test build backend::trtllm Relates to the trtllm backend labels Jan 22, 2026

Merge branch 'main' into schwinns/gms-core-library

88c180b

copy-pr-bot bot temporarily deployed to GITLAB January 22, 2026 23:43 Inactive

github-actions bot added ci Issues/PRs that reference CI build/test build and removed ci Issues/PRs that reference CI build/test build labels Jan 22, 2026

copy-pr-bot bot temporarily deployed to GITLAB January 22, 2026 23:49 Inactive

Merge branch 'main' into schwinns/gms-core-library

de47b9f

copy-pr-bot bot temporarily deployed to GITLAB January 23, 2026 00:05 Inactive

nv-tusharma reviewed Jan 23, 2026

View reviewed changes

lib/gpu_memory_service/client/cuda_vmm_utils.py Outdated Show resolved Hide resolved

Updated copyright

e4d830a

nv-tusharma reviewed Jan 23, 2026

View reviewed changes

container/Dockerfile Outdated Show resolved Hide resolved

galletas1712 and others added 2 commits January 22, 2026 16:45

Add install gating

3a11549

Merge branch 'main' into schwinns/gms-core-library

a0199ec

Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>

nv-tusharma approved these changes Jan 23, 2026

View reviewed changes

galletas1712 added 2 commits January 22, 2026 17:10

Merge branch 'main' into schwinns/gms-core-library

0f07aee

Merge branch 'main' into schwinns/gms-core-library

188bdda

nv-hwoo reviewed Jan 23, 2026

View reviewed changes

components/src/dynamo/gpu_memory_service/__init__.py Show resolved Hide resolved

galletas1712 and others added 3 commits January 23, 2026 00:15

Merge branch 'main' into schwinns/gms-core-library

f60cc61

Fix linting issues in GPU Memory Service

8bd0a6e

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

components/src/dynamo/gpu_memory_service/__init__.py Show resolved Hide resolved

components/src/dynamo/gpu_memory_service/__init__.py Show resolved Hide resolved

galletas1712 and others added 2 commits January 23, 2026 00:50

Update components/src/dynamo/gpu_memory_service/__init__.py

a5c9a09

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>

Update components/src/dynamo/gpu_memory_service/__init__.py

79fd045

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>

Conversation

galletas1712 commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grahamking commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

mohammedabdulwahhab left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nv-tusharma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galletas1712 commented Jan 23, 2026

Uh oh!

coderabbitai bot commented Jan 23, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grahamking commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

galletas1712 commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

mohammedabdulwahhab left a comment •

edited

Loading