-
Notifications
You must be signed in to change notification settings - Fork 813
feat: GPU Memory Service #5286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: GPU Memory Service #5286
Conversation
WalkthroughThis PR introduces a comprehensive GPU Memory Service (GMS) system featuring server-side memory allocation management with CUDA VMM support, an asynchronous RPC protocol with state machine-driven locking, and client libraries for PyTorch integration. It includes protocol definitions, CUDA utilities, metadata storage, and container build infrastructure. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @components/src/dynamo/gpu_memory_service/args.py:
- Around line 54-58: The default socket path currently uses /tmp; change the
logic in the socket_path initialization (the block that sets socket_path when
socket_path is None and uses args.device) to prefer XDG_RUNTIME_DIR if present:
build socket_dir = os.environ.get("XDG_RUNTIME_DIR") or f"/tmp" (or better
os.path.join("/run/user", str(os.getuid())) if XDG_RUNTIME_DIR missing), then
set socket_path = os.path.join(socket_dir,
f"gpu_memory_service_{args.device}.sock"); keep the existing placeholder
expansion for explicit socket_path inputs
(socket_path.format(device=args.device)); also ensure you import os at top and
document in a comment that /tmp is fallback for single-user development only.
🧹 Nitpick comments (20)
lib/gpu_memory_service/pyproject.toml (1)
37-41: Consider pinningpytest-asyncioversion for test stability.The
pytest-asynciopackage has had breaking changes between versions (especially around mode configuration). Consider pinning to a specific version range likepytest-asyncio>=0.23,<1.0to ensure consistent test behavior.lib/gpu_memory_service/setup.py (1)
78-87: Useraise ... from Noneto suppress exception chaining.When re-raising a different exception within an
exceptblock, usefrom Noneto avoid confusing exception chains in the traceback.♻️ Proposed fix
# Check if PyTorch is available (needed for _tensor_from_pointer) try: - import torch # noqa: F401 + import torch except ImportError: if self.extensions: - raise RuntimeError( + raise RuntimeError( "PyTorch not found. Cannot build CUDA extensions. " "Install PyTorch first, or use --no-build-isolation with pip wheel." - ) + ) from None returncomponents/src/dynamo/gpu_memory_service/server.py (2)
84-93: Consider usinglogging.exceptionfor better diagnostics.The error handler captures startup failures but loses the stack trace. Using
logging.exceptioninstead oflogging.errorwill include the traceback.♻️ Suggested improvement
except Exception as e: - logger.error(f"GMSRPCServer error: {e}") + logger.exception(f"GMSRPCServer error: {e}") self._error = e self._started.set() # Unblock waiter even on error
95-111: Fragile coupling to internal server state.The
stop()method directly manipulates private attributes (_running,_shutdown,_condition) ofGMSRPCServer. This creates tight coupling that can break if the server's internals change.Consider adding a proper
stop()orshutdown()method toGMSRPCServerthat encapsulates this logic, then call it here. If that's not feasible in this PR, document this coupling with a comment explaining the dependency.def stop(self) -> None: """Stop the allocation server.""" if self._server is not None: logger.info(f"Stopping GMSRPCServer on device {self.device}") - # Signal the server to stop - the loop in _run_server will exit + # NOTE: Directly accessing server internals; coordinate with GMSRPCServer changes. + # TODO: Replace with self._server.stop() when available. self._server._running = False self._server._shutdown = Truelib/gpu_memory_service/server/memory_manager.py (2)
207-224: Consider propagating cuMemRelease failures for free().The
free()method logs a warning ifcuMemReleasefails but still returnsTrueand removes the allocation from tracking. This could mask resource leaks if the CUDA release genuinely fails.♻️ Optional: Propagate cuMemRelease failure
def free(self, allocation_id: str) -> bool: """Release physical memory for a single allocation. Args: allocation_id: ID of allocation to free Returns: - True if allocation existed and was freed, False otherwise + True if allocation existed and was freed successfully, False otherwise """ info = self._allocations.pop(allocation_id, None) if info is None: return False (result,) = cuda.cuMemRelease(info.handle) if result != cuda.CUresult.CUDA_SUCCESS: logger.warning(f"cuMemRelease failed for {allocation_id}: {result}") + # Re-add to tracking since release failed + self._allocations[allocation_id] = info + return False logger.debug(f"Freed allocation: {allocation_id}") return True
226-244: Same concern applies to clear_all() - failures are silently continued.When
cuMemReleasefails inclear_all(), the allocation tracking is still cleared. This is acceptable for cleanup scenarios (RW abort, shutdown) where we want best-effort cleanup, but consider returning a tuple of (cleared_count, failed_count) for observability.lib/gpu_memory_service/client/memory_manager.py (3)
475-503: Consider moving the json import to module level.The
jsonimport inside_snapshot_metadata_specsis executed on every call. While not a performance issue for occasional sleep operations, moving it to module level is more conventional.♻️ Move import to module level
Add at the top of the file with other imports:
import jsonThen remove the inline import at line 486.
650-655:__exit__should returnNoneorFalse, notbool.The return type annotation is implicit but returning
Falseexplicitly is slightly unusual. ReturningNone(or just omitting the return) is more idiomatic for context managers that don't suppress exceptions.♻️ Idiomatic __exit__ pattern
def __exit__(self, exc_type, exc_val, exc_tb) -> None: self.close() - return False
789-810: Prefix unuseddev_ptrvariable with underscore.The static analysis correctly identifies that
dev_ptris never used. This is intentional since the validation is checking if the call succeeds, not the returned value.♻️ Fix unused variable warning
- result, dev_ptr = cuda.cuPointerGetAttribute( + result, _dev_ptr = cuda.cuPointerGetAttribute( cuda.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_POINTER, va )lib/gpu_memory_service/server/handler.py (1)
150-161: Use exception chaining for better debugging context.When re-raising
AllocationNotFoundErrorasValueError, preserving the original exception chain aids debugging.♻️ Add exception chaining
def handle_get_allocation(self, req: GetAllocationRequest) -> GetAllocationResponse: """Get allocation info.""" try: info = self._memory_manager.get_allocation(req.allocation_id) return GetAllocationResponse( allocation_id=info.allocation_id, size=info.size, aligned_size=info.aligned_size, tag=info.tag, ) except AllocationNotFoundError: - raise ValueError(f"Unknown allocation: {req.allocation_id}") + raise ValueError(f"Unknown allocation: {req.allocation_id}") from Nonelib/gpu_memory_service/client/torch/lifecycle.py (1)
192-227: free_cb accesses private attributes and silently swallows exceptions.The
free_cbcallback directly accessesmanager._mappingsandmanager._allocation_id_to_va(private attributes), and swallows all exceptions during cleanup. While this works, consider:
- Adding a public method to GMSClientMemoryManager for freeing by VA
- At minimum, logging the exception instead of completely ignoring it
♻️ Log exceptions in free_cb
try: manager._client.free(mapping.allocation_id) - except Exception: - pass # Ignore errors during cleanup + except Exception as e: + logger.debug(f"[GPU Memory Service] Error during free cleanup: {e}")Also, the
streamparameter is unused because the callback signature is dictated by PyTorch's CUDAPluggableAllocator interface - this is expected and not a bug.lib/gpu_memory_service/server/rpc.py (2)
110-130: Uselogger.exceptionfor error logging to capture stack traces.When logging exceptions,
logger.exceptionautomatically includes the traceback, which is valuable for debugging production issues.♻️ Use logger.exception for better debugging
except ConnectionResetError: logger.debug(f"Connection reset: {session_id}") except asyncio.CancelledError: raise except Exception as e: - logger.error(f"Connection error: {session_id}, {e}") + logger.exception(f"Connection error: {session_id}")Apply similar changes to lines 141-142, 300-301, and 317-318.
263-284: Remove unusedsession_idparameter or use it for logging.The
session_idparameter in_cleanup_connectionis never used. Either remove it or add it to log messages for better traceability.♻️ Use session_id in cleanup logging
async def _cleanup_connection( self, conn: Optional[Connection], session_id: str ) -> None: """Clean up after connection closes via state machine transition.""" if conn is None: return + logger.debug(f"Cleaning up connection: {session_id}")lib/gpu_memory_service/client/rpc.py (2)
118-163: Use exception chaining in _connect for better error context.When wrapping exceptions during connection, preserve the original exception chain for debugging.
♻️ Add exception chaining
try: self._socket.connect(self.socket_path) except FileNotFoundError: - raise ConnectionError(f"Server not running at {self.socket_path}") + raise ConnectionError(f"Server not running at {self.socket_path}") from None except Exception as e: - raise ConnectionError(f"Failed to connect: {e}") + raise ConnectionError(f"Failed to connect: {e}") from e
282-290: Prefix unusedresponsevariable with underscore.The
export()method doesn't use the response object, only the fd. This should be prefixed to satisfy linters and clarify intent.♻️ Fix unused variable
def export(self, allocation_id: str) -> int: """Export allocation as POSIX FD. Caller is responsible for closing the FD when done. """ - response, fd = self._send_recv(ExportRequest(allocation_id=allocation_id)) + _response, fd = self._send_recv(ExportRequest(allocation_id=allocation_id)) if fd < 0: raise RuntimeError("No FD received from server") return fdlib/gpu_memory_service/server/locking.py (1)
129-133: Inconsistent frozenset syntax.Line 130 uses
frozenset[ServerState]({ServerState.RW})with an explicit type parameter, while other transitions (lines 123, 137, 144, etc.) usefrozenset({...})without the type parameter. This inconsistency doesn't affect runtime but reduces readability.Suggested fix
- from_states=frozenset[ServerState]({ServerState.RW}), + from_states=frozenset({ServerState.RW}),lib/gpu_memory_service/client/torch/tensor.py (2)
67-101: Consider addingstrict=Trueto zip for defensive validation.The function correctly validates that
len(shape) == len(stride)before the zip. Addingstrict=Trueto the zip call would provide defense-in-depth against future refactoring that might remove or bypass the length check.Suggested fix
- for d, st in zip(shape, stride): + for d, st in zip(shape, stride, strict=True):
562-577: Accessing private_mappingsattribute breaks encapsulation.The code accesses
manager._mappingsdirectly to find which allocation contains a tensor's pointer. This couples the function to the internal implementation ofGMSClientMemoryManager. Consider adding a public method to the manager likefind_allocation_for_pointer(ptr: int) -> Optional[Tuple[int, str]]to encapsulate this lookup.# Suggested addition to GMSClientMemoryManager: def find_allocation_for_pointer(self, ptr: int) -> Optional[Tuple[int, str]]: """Find the allocation containing the given pointer. Returns: Tuple of (base_va, allocation_id) or None if not found. """ for va, mapping in self._mappings.items(): if va <= ptr < va + mapping.aligned_size: return (va, mapping.allocation_id) return Nonelib/gpu_memory_service/common/protocol.py (2)
480-482: Fix implicit Optional type annotation.The
recv_bufferparameter defaults toNonebut the type hint isbytearray. Use explicit Optional type.Suggested fix
async def recv_message( - reader, recv_buffer: bytearray = None, raw_sock=None + reader, recv_buffer: Optional[bytearray] = None, raw_sock=None ) -> Tuple[Any, int, bytearray]:
590-592: Same implicit Optional issue as async version.Suggested fix
def recv_message_sync( - sock, recv_buffer: bytearray = None + sock, recv_buffer: Optional[bytearray] = None ) -> Tuple[Any, int, bytearray]:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
.devcontainer/post-create.sh.dockerignore.gitignorecomponents/src/dynamo/gpu_memory_service/__init__.pycomponents/src/dynamo/gpu_memory_service/__main__.pycomponents/src/dynamo/gpu_memory_service/args.pycomponents/src/dynamo/gpu_memory_service/server.pylib/gpu_memory_service/__init__.pylib/gpu_memory_service/client/__init__.pylib/gpu_memory_service/client/memory_manager.pylib/gpu_memory_service/client/rpc.pylib/gpu_memory_service/client/torch/__init__.pylib/gpu_memory_service/client/torch/extensions/__init__.pylib/gpu_memory_service/client/torch/extensions/allocator.cpplib/gpu_memory_service/client/torch/extensions/tensor_from_pointer.cpplib/gpu_memory_service/client/torch/lifecycle.pylib/gpu_memory_service/client/torch/tensor.pylib/gpu_memory_service/common/__init__.pylib/gpu_memory_service/common/cuda_vmm_utils.pylib/gpu_memory_service/common/protocol.pylib/gpu_memory_service/common/types.pylib/gpu_memory_service/pyproject.tomllib/gpu_memory_service/server/__init__.pylib/gpu_memory_service/server/handler.pylib/gpu_memory_service/server/locking.pylib/gpu_memory_service/server/memory_manager.pylib/gpu_memory_service/server/metadata_store.pylib/gpu_memory_service/server/rpc.pylib/gpu_memory_service/setup.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T02:01:01.238Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2453
File: deploy/dynamo_check.py:739-741
Timestamp: 2025-08-15T02:01:01.238Z
Learning: In the dynamo project, missing cargo should be treated as a fatal error in dynamo_check.py because developers need cargo to build the Rust components. The tool is designed to validate the complete development environment, not just import functionality.
Applied to files:
.devcontainer/post-create.sh
🧬 Code graph analysis (17)
.devcontainer/post-create.sh (1)
lib/runtime/src/compute/pool.rs (1)
install(271-284)
lib/gpu_memory_service/server/metadata_store.py (2)
lib/runtime/src/storage/kv.rs (1)
value(99-101)lib/tokens/src/radix.rs (1)
prefix(23-29)
lib/gpu_memory_service/server/handler.py (5)
lib/gpu_memory_service/common/protocol.py (19)
AllocateRequest(175-183)AllocateResponse(187-192)ClearAllResponse(269-272)FreeRequest(245-248)FreeResponse(252-255)GetAllocationRequest(210-213)GetAllocationResponse(217-223)GetAllocationStateResponse(164-168)GetLockStateResponse(145-153)ListAllocationsRequest(227-233)ListAllocationsResponse(237-241)MetadataDeleteRequest(317-320)MetadataDeleteResponse(324-325)MetadataGetRequest(302-305)MetadataGetResponse(309-313)MetadataListRequest(329-332)MetadataListResponse(336-337)MetadataPutRequest(287-293)MetadataPutResponse(297-298)lib/gpu_memory_service/common/types.py (2)
derive_state(150-158)is_ready(145-147)lib/gpu_memory_service/server/memory_manager.py (12)
AllocationNotFoundError(54-57)GMSServerMemoryManager(60-274)device(105-107)granularity(110-112)clear_all(226-244)allocation_count(115-117)total_bytes(120-122)allocate(128-173)export_fd(175-205)get_allocation(246-261)list_allocations(263-274)free(207-224)lib/gpu_memory_service/server/metadata_store.py (6)
GMSMetadataStore(31-67)clear(64-67)put(44-50)get(52-53)delete(55-57)list_keys(59-62)lib/gpu_memory_service/server/rpc.py (3)
granularity(97-98)state(92-94)is_ready(100-102)
lib/gpu_memory_service/server/rpc.py (3)
lib/gpu_memory_service/common/protocol.py (15)
AllocateRequest(175-183)ClearAllRequest(259-265)CommitRequest(117-124)CommitResponse(128-131)ExportRequest(196-203)FreeRequest(245-248)GetAllocationRequest(210-213)HandshakeRequest(84-96)HandshakeResponse(100-110)MetadataDeleteRequest(317-320)MetadataGetRequest(302-305)MetadataListRequest(329-332)MetadataPutRequest(287-293)recv_message(480-558)send_message(423-477)lib/gpu_memory_service/common/types.py (4)
ConnectionMode(37-42)Operation(55-82)ServerState(14-34)StateEvent(45-52)lib/gpu_memory_service/server/locking.py (12)
Connection(50-81)GlobalLockFSM(177-391)state(208-216)committed(234-236)rw_conn(219-221)close(75-81)transition(297-341)can_acquire_rw(373-383)can_acquire_ro(385-391)ro_conns(224-226)check_operation(345-369)ro_count(229-231)
components/src/dynamo/gpu_memory_service/__init__.py (3)
lib/gpu_memory_service/client/memory_manager.py (2)
GMSClientMemoryManager(80-836)StaleWeightsError(40-52)lib/gpu_memory_service/client/torch/lifecycle.py (2)
get_allocator(235-241)get_or_create_allocator(49-164)lib/gpu_memory_service/client/torch/tensor.py (2)
materialize_module_from_gms(645-866)register_module_tensors(506-595)
lib/gpu_memory_service/client/memory_manager.py (4)
lib/gpu_memory_service/common/cuda_vmm_utils.py (2)
check_cuda_result(13-29)get_allocation_granularity(43-64)lib/gpu_memory_service/server/memory_manager.py (7)
device(105-107)granularity(110-112)list_allocations(263-274)allocate(128-173)get_allocation(246-261)clear_all(226-244)total_bytes(120-122)lib/gpu_memory_service/server/handler.py (1)
granularity(64-66)lib/gpu_memory_service/server/rpc.py (1)
granularity(97-98)
components/src/dynamo/gpu_memory_service/__main__.py (1)
components/src/dynamo/gpu_memory_service/server.py (1)
main(162-165)
lib/gpu_memory_service/server/__init__.py (6)
lib/gpu_memory_service/common/types.py (3)
ConnectionMode(37-42)ServerState(14-34)StateSnapshot(135-147)lib/gpu_memory_service/server/handler.py (2)
RequestHandler(41-235)metadata_store(69-71)lib/gpu_memory_service/server/locking.py (2)
Connection(50-81)GlobalLockFSM(177-391)lib/gpu_memory_service/server/memory_manager.py (3)
AllocationInfo(34-51)AllocationNotFoundError(54-57)GMSServerMemoryManager(60-274)lib/gpu_memory_service/server/metadata_store.py (1)
GMSMetadataStore(31-67)lib/gpu_memory_service/server/rpc.py (1)
GMSRPCServer(57-447)
lib/gpu_memory_service/client/torch/extensions/allocator.cpp (4)
lib/gpu_memory_service/server/memory_manager.py (2)
device(105-107)granularity(110-112)lib/gpu_memory_service/server/handler.py (1)
granularity(64-66)lib/gpu_memory_service/server/rpc.py (1)
granularity(97-98)lib/gpu_memory_service/client/torch/lifecycle.py (2)
malloc_cb(192-207)free_cb(209-227)
lib/gpu_memory_service/common/cuda_vmm_utils.py (3)
lib/gpu_memory_service/server/memory_manager.py (2)
device(105-107)granularity(110-112)lib/gpu_memory_service/server/handler.py (1)
granularity(64-66)lib/gpu_memory_service/server/rpc.py (1)
granularity(97-98)
lib/gpu_memory_service/client/torch/__init__.py (2)
lib/gpu_memory_service/client/torch/lifecycle.py (2)
get_allocator(235-241)get_or_create_allocator(49-164)lib/gpu_memory_service/client/torch/tensor.py (2)
materialize_module_from_gms(645-866)register_module_tensors(506-595)
lib/gpu_memory_service/client/rpc.py (1)
lib/gpu_memory_service/common/protocol.py (21)
AllocateRequest(175-183)ClearAllRequest(259-265)CommitRequest(117-124)CommitResponse(128-131)ExportRequest(196-203)FreeRequest(245-248)FreeResponse(252-255)GetAllocationRequest(210-213)GetAllocationStateRequest(157-160)GetAllocationStateResponse(164-168)GetLockStateRequest(138-141)GetLockStateResponse(145-153)HandshakeRequest(84-96)HandshakeResponse(100-110)ListAllocationsRequest(227-233)MetadataDeleteRequest(317-320)MetadataGetRequest(302-305)MetadataListRequest(329-332)MetadataPutRequest(287-293)recv_message_sync(590-653)send_message_sync(564-587)
lib/gpu_memory_service/__init__.py (2)
lib/gpu_memory_service/client/memory_manager.py (2)
GMSClientMemoryManager(80-836)StaleWeightsError(40-52)lib/gpu_memory_service/client/torch/lifecycle.py (2)
get_allocator(235-241)get_or_create_allocator(49-164)
lib/gpu_memory_service/client/__init__.py (2)
lib/gpu_memory_service/client/memory_manager.py (2)
GMSClientMemoryManager(80-836)StaleWeightsError(40-52)lib/gpu_memory_service/client/rpc.py (1)
GMSRPCClient(71-405)
lib/gpu_memory_service/client/torch/lifecycle.py (1)
lib/gpu_memory_service/client/memory_manager.py (3)
GMSClientMemoryManager(80-836)mode(146-148)allocate_to_va(213-299)
lib/gpu_memory_service/common/protocol.py (4)
lib/gpu_memory_service/server/locking.py (2)
committed(234-236)state(208-216)lib/gpu_memory_service/server/rpc.py (2)
state(92-94)is_ready(100-102)lib/gpu_memory_service/common/types.py (1)
is_ready(145-147)lib/gpu_memory_service/server/memory_manager.py (2)
allocation_count(115-117)total_bytes(120-122)
lib/gpu_memory_service/common/types.py (3)
lib/gpu_memory_service/server/locking.py (3)
state(208-216)ro_count(229-231)committed(234-236)lib/gpu_memory_service/server/rpc.py (2)
state(92-94)is_ready(100-102)lib/gpu_memory_service/client/rpc.py (2)
committed(165-167)is_ready(222-224)
🪛 Ruff (0.14.10)
lib/gpu_memory_service/client/torch/extensions/__init__.py
15-15: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
16-16: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
21-21: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
24-24: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
lib/gpu_memory_service/server/handler.py
161-161: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
161-161: Avoid specifying long messages outside the exception class
(TRY003)
lib/gpu_memory_service/common/__init__.py
51-94: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
lib/gpu_memory_service/server/rpc.py
126-126: Do not catch blind exception: Exception
(BLE001)
127-127: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
141-141: Do not catch blind exception: Exception
(BLE001)
142-142: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
210-210: Consider moving this statement to an else block
(TRY300)
226-226: Consider moving this statement to an else block
(TRY300)
257-257: Consider moving this statement to an else block
(TRY300)
264-264: Unused method argument: session_id
(ARG002)
300-300: Do not catch blind exception: Exception
(BLE001)
301-301: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
317-317: Do not catch blind exception: Exception
(BLE001)
318-318: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
388-388: Avoid specifying long messages outside the exception class
(TRY003)
components/src/dynamo/gpu_memory_service/__init__.py
32-45: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
lib/gpu_memory_service/client/memory_manager.py
130-132: Avoid specifying long messages outside the exception class
(TRY003)
245-245: Avoid specifying long messages outside the exception class
(TRY003)
254-257: Avoid specifying long messages outside the exception class
(TRY003)
262-264: Avoid specifying long messages outside the exception class
(TRY003)
362-362: Avoid specifying long messages outside the exception class
(TRY003)
383-383: Avoid specifying long messages outside the exception class
(TRY003)
418-418: Avoid specifying long messages outside the exception class
(TRY003)
420-422: Avoid specifying long messages outside the exception class
(TRY003)
426-428: Avoid specifying long messages outside the exception class
(TRY003)
446-446: Avoid specifying long messages outside the exception class
(TRY003)
451-451: Avoid specifying long messages outside the exception class
(TRY003)
502-502: Do not catch blind exception: Exception
(BLE001)
522-522: Avoid specifying long messages outside the exception class
(TRY003)
553-553: Do not catch blind exception: Exception
(BLE001)
578-578: Avoid specifying long messages outside the exception class
(TRY003)
584-587: Avoid specifying long messages outside the exception class
(TRY003)
591-594: Avoid specifying long messages outside the exception class
(TRY003)
608-611: Abstract raise to an inner function
(TRY301)
608-611: Avoid specifying long messages outside the exception class
(TRY003)
613-616: Abstract raise to an inner function
(TRY301)
613-616: Avoid specifying long messages outside the exception class
(TRY003)
618-621: Abstract raise to an inner function
(TRY301)
618-621: Avoid specifying long messages outside the exception class
(TRY003)
624-624: Do not catch blind exception: Exception
(BLE001)
662-662: Avoid specifying long messages outside the exception class
(TRY003)
663-663: Avoid specifying long messages outside the exception class
(TRY003)
716-716: Do not catch blind exception: Exception
(BLE001)
742-742: Avoid specifying long messages outside the exception class
(TRY003)
746-746: Avoid specifying long messages outside the exception class
(TRY003)
757-759: Avoid specifying long messages outside the exception class
(TRY003)
763-766: Avoid specifying long messages outside the exception class
(TRY003)
793-793: Unpacked variable dev_ptr is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
833-833: Do not catch blind exception: Exception
(BLE001)
lib/gpu_memory_service/server/__init__.py
29-41: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
components/src/dynamo/gpu_memory_service/server.py
62-62: Avoid specifying long messages outside the exception class
(TRY003)
87-87: Do not catch blind exception: Exception
(BLE001)
88-88: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
lib/gpu_memory_service/setup.py
71-75: Avoid specifying long messages outside the exception class
(TRY003)
80-80: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
83-86: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
83-86: Avoid specifying long messages outside the exception class
(TRY003)
133-133: Consider [*torch_include, cuda_include] instead of concatenation
Replace with [*torch_include, cuda_include]
(RUF005)
lib/gpu_memory_service/common/cuda_vmm_utils.py
29-29: Avoid specifying long messages outside the exception class
(TRY003)
lib/gpu_memory_service/client/torch/__init__.py
22-29: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
lib/gpu_memory_service/client/rpc.py
124-124: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
124-124: Avoid specifying long messages outside the exception class
(TRY003)
125-125: Do not catch blind exception: Exception
(BLE001)
126-126: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
126-126: Avoid specifying long messages outside the exception class
(TRY003)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
147-147: Avoid specifying long messages outside the exception class
(TRY003)
152-152: Avoid specifying long messages outside the exception class
(TRY003)
194-194: Avoid specifying long messages outside the exception class
(TRY003)
202-202: Prefer TypeError exception for invalid type
(TRY004)
202-202: Avoid specifying long messages outside the exception class
(TRY003)
212-212: Prefer TypeError exception for invalid type
(TRY004)
212-212: Avoid specifying long messages outside the exception class
(TRY003)
219-219: Prefer TypeError exception for invalid type
(TRY004)
219-219: Avoid specifying long messages outside the exception class
(TRY003)
237-237: Avoid specifying long messages outside the exception class
(TRY003)
271-271: Avoid specifying long messages outside the exception class
(TRY003)
275-275: Prefer TypeError exception for invalid type
(TRY004)
275-275: Avoid specifying long messages outside the exception class
(TRY003)
287-287: Unpacked variable response is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
289-289: Avoid specifying long messages outside the exception class
(TRY003)
296-296: Prefer TypeError exception for invalid type
(TRY004)
296-296: Avoid specifying long messages outside the exception class
(TRY003)
309-309: Prefer TypeError exception for invalid type
(TRY004)
309-309: Avoid specifying long messages outside the exception class
(TRY003)
315-315: Avoid specifying long messages outside the exception class
(TRY003)
325-325: Avoid specifying long messages outside the exception class
(TRY003)
329-329: Prefer TypeError exception for invalid type
(TRY004)
329-329: Avoid specifying long messages outside the exception class
(TRY003)
341-341: Avoid specifying long messages outside the exception class
(TRY003)
351-351: Prefer TypeError exception for invalid type
(TRY004)
351-351: Avoid specifying long messages outside the exception class
(TRY003)
361-361: Prefer TypeError exception for invalid type
(TRY004)
361-361: Avoid specifying long messages outside the exception class
(TRY003)
369-369: Avoid specifying long messages outside the exception class
(TRY003)
372-372: Prefer TypeError exception for invalid type
(TRY004)
372-372: Avoid specifying long messages outside the exception class
(TRY003)
379-379: Prefer TypeError exception for invalid type
(TRY004)
379-379: Avoid specifying long messages outside the exception class
(TRY003)
389-390: try-except-pass detected, consider logging the exception
(S110)
389-389: Do not catch blind exception: Exception
(BLE001)
lib/gpu_memory_service/server/locking.py
80-81: try-except-pass detected, consider logging the exception
(S110)
80-80: Do not catch blind exception: Exception
(BLE001)
261-261: Avoid specifying long messages outside the exception class
(TRY003)
316-319: Avoid specifying long messages outside the exception class
(TRY003)
327-330: Avoid specifying long messages outside the exception class
(TRY003)
360-362: Avoid specifying long messages outside the exception class
(TRY003)
366-369: Avoid specifying long messages outside the exception class
(TRY003)
lib/gpu_memory_service/server/memory_manager.py
28-28: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
197-197: Avoid specifying long messages outside the exception class
(TRY003)
260-260: Avoid specifying long messages outside the exception class
(TRY003)
lib/gpu_memory_service/__init__.py
38-45: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
components/src/dynamo/gpu_memory_service/args.py
55-55: Probable insecure usage of temporary file or directory: "/tmp/gpu_memory_service_"
(S108)
lib/gpu_memory_service/client/__init__.py
21-25: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
lib/gpu_memory_service/client/torch/lifecycle.py
109-112: Avoid specifying long messages outside the exception class
(TRY003)
121-124: Avoid specifying long messages outside the exception class
(TRY003)
193-193: Unused function argument: stream
(ARG001)
226-227: try-except-pass detected, consider logging the exception
(S110)
226-226: Do not catch blind exception: Exception
(BLE001)
lib/gpu_memory_service/common/protocol.py
387-387: Avoid specifying long messages outside the exception class
(TRY003)
405-405: Avoid specifying long messages outside the exception class
(TRY003)
410-410: Avoid specifying long messages outside the exception class
(TRY003)
448-448: Avoid specifying long messages outside the exception class
(TRY003)
481-481: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
523-523: Unpacked variable flags is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
537-537: Avoid specifying long messages outside the exception class
(TRY003)
542-542: Avoid specifying long messages outside the exception class
(TRY003)
551-551: Avoid specifying long messages outside the exception class
(TRY003)
591-591: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
623-623: Unpacked variable flags is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
637-637: Avoid specifying long messages outside the exception class
(TRY003)
646-646: Avoid specifying long messages outside the exception class
(TRY003)
lib/gpu_memory_service/client/torch/tensor.py
90-90: Avoid specifying long messages outside the exception class
(TRY003)
93-93: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
160-160: Avoid specifying long messages outside the exception class
(TRY003)
162-162: Prefer TypeError exception for invalid type
(TRY004)
162-162: Avoid specifying long messages outside the exception class
(TRY003)
257-257: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
330-331: try-except-continue detected, consider logging the exception
(S112)
330-330: Do not catch blind exception: Exception
(BLE001)
408-409: try-except-pass detected, consider logging the exception
(S110)
408-408: Do not catch blind exception: Exception
(BLE001)
411-413: Avoid specifying long messages outside the exception class
(TRY003)
452-452: Avoid specifying long messages outside the exception class
(TRY003)
457-457: Avoid specifying long messages outside the exception class
(TRY003)
590-590: Do not catch blind exception: Exception
(BLE001)
624-627: Avoid specifying long messages outside the exception class
(TRY003)
734-738: Avoid specifying long messages outside the exception class
(TRY003)
758-762: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (54)
lib/gpu_memory_service/common/cuda_vmm_utils.py (3)
32-40: LGTM!The CUDA initialization and error handling utilities are well-structured. The redundant
ifcheck on line 39 before callingcheck_cuda_resultis harmless sincecheck_cuda_resultalready handles the success case by returning early.
43-64: LGTM!The granularity retrieval correctly configures the allocation properties for POSIX FD sharing and uses the minimum granularity flag. This aligns with the C++ implementation in
allocator.cpp.
67-77: LGTM!Standard ceiling-division alignment formula. Correct implementation.
lib/gpu_memory_service/setup.py (1)
97-146: LGTM!The extension module creation logic correctly handles optional PyTorch availability for
_tensor_from_pointerwhile ensuring_allocator_extis always built. The C++17, O3 optimization, and fPIC flags are appropriate for CUDA extensions.lib/gpu_memory_service/client/torch/extensions/allocator.cpp (5)
106-172: LGTM - Robust error handling inmy_malloc.The function correctly:
- Reserves VA before registering the allocation
- Cleans up VA reservation if callback is not set
- On callback failure, cleans up any handle/mapping that may have been created, then frees the VA
- Properly acquires/releases GIL around Python callback
222-247: LGTM - Correct reference counting inpy_init_module.The INCREF/DECREF ordering is correct: increment new references before decrementing old ones to avoid premature deallocation if the same object is passed.
251-322: LGTM - Proper cleanup on failure inpy_import_and_map.The function correctly releases resources in reverse order on error:
- On
cuMemMapfailure: releases handle- On
cuMemSetAccessfailure: unmaps, then releases handle
326-355: Memory management inpy_get_all_allocationsis correct.The function properly handles
PyList_Appendsemantics by decrementing the item reference after append (sincePyList_Appendincrements it), and cleans up properly on error paths.
58-62: Thread safety advisory:g_allocationsis unprotected.The global
g_allocationsmap is accessed without synchronization inmy_malloc()(before acquiring the GIL) andmy_free()(without acquiring the GIL). While PyTorch's CUDAPluggableAllocator with a single MemPool per process typically ensures serialized allocator calls, and Python callback functions execute with the GIL held, the C++ allocator entry/exit functions (my_malloc/my_free) themselves do not hold the GIL during critical accesses. If this extension is ever used in a multi-threaded C++ context where multiple threads call the allocator directly, consider adding a mutex to protectg_allocations.lib/gpu_memory_service/common/types.py (3)
14-35: LGTM - Well-documented state machine.The
ServerStateenum with the ASCII state diagram in the docstring clearly documents the valid transitions. Using(str, Enum)enables clean serialization.
85-131: LGTM - Clear permission model.The separation of
RW_REQUIRED_OPSandSTATE_ALLOWED_OPSprovides a clear two-level permission check: first verify the operation is allowed in the current state, then verify the connection has the required mode.
134-158: LGTM - Consistentis_readysemantics.The
StateSnapshot.is_readyproperty correctly matches the server-side implementation inlib/gpu_memory_service/server/rpc.py:99-101: ready means committed with no active writer, while readers may still be connected..gitignore (1)
59-60: LGTM - Correct negation pattern.The
!lib/gpu_memory_service/server/negation ensures the GPU Memory Service server directory is tracked despite the broaderserver/ignore rule..dockerignore (1)
46-49: LGTM - Appropriate Docker context exclusions.Excluding
build/,*.egg-info/, and compiled*.sofiles prevents stale build artifacts from inflating the Docker context and ensures fresh builds within containers.components/src/dynamo/gpu_memory_service/__main__.py (1)
1-7: LGTM! Clean entry point implementation.The module correctly imports and invokes the server's main function, following standard Python CLI patterns.
.devcontainer/post-create.sh (1)
69-78: LGTM! Proper installation sequence for CUDA extensions.The installation correctly:
- Uninstalls any existing package to avoid conflicts
- Uses dynamic site-packages detection for portability
- Applies
--no-build-isolationso the build can access PyTorch from the environment (required for CUDA extension compilation)lib/gpu_memory_service/client/__init__.py (1)
15-18: The import path is correct.StaleWeightsErroris actually defined inmemory_manager.py(line 40), not inrpc.pyas the original concern suggested. The__init__.pyfile correctly imports it frommemory_manager, and there is no import issue.Likely an incorrect or invalid review comment.
lib/gpu_memory_service/__init__.py (1)
1-45: LGTM!The package initialization is well-structured with clear documentation of the package layout and a clean re-export of the primary client API. The logical grouping in
__all__(Client, then Lifecycle) is more maintainable than alphabetical sorting for an API surface.lib/gpu_memory_service/client/torch/extensions/__init__.py (1)
14-26: LGTM!The guarded import pattern is appropriate for optional C++ extensions that may not be built yet. Setting the placeholders to
NoneonImportErrorallows dependent code to check availability before use.lib/gpu_memory_service/server/metadata_store.py (1)
24-67: LGTM!Clean implementation of an in-memory metadata store. The frozen dataclass for entries and explicit non-thread-safe documentation (relying on external RW/RO lock semantics) are appropriate design choices.
components/src/dynamo/gpu_memory_service/__init__.py (1)
1-45: LGTM!The Dynamo component wrapper provides a clean re-export surface with logical grouping. The docstring clearly documents the purpose and relationship to the core
gpu_memory_servicepackage.lib/gpu_memory_service/client/torch/__init__.py (1)
1-29: LGTM!Clean PyTorch integration surface with logical grouping of lifecycle and tensor operation exports.
lib/gpu_memory_service/client/torch/extensions/tensor_from_pointer.cpp (2)
23-44: LGTM!The implementation correctly uses
torch::from_blobwith a no-op deleter for non-owning tensor views. The optional strides handling is clean, and the cast chain for the pointer is appropriate for CUDA memory addresses.
63-80: Clear API documentation.The pybind11 module registration with inline docstrings provides clear Python-side documentation for both exported functions.
lib/gpu_memory_service/server/memory_manager.py (2)
1-57: Well-structured module with proper CUDA VMM abstraction.The module header, dataclass definition, and exception class are clean and well-documented. The design decision to have no VA mapping on the server side (avoiding CUDA context) is sound for crash resilience.
60-103: LGTM on class design and initialization.The lazy CUDA initialization pattern and clear documentation about thread-safety requirements are appropriate.
lib/gpu_memory_service/server/__init__.py (1)
1-41: Clean package re-export structure.The module appropriately aggregates and exposes the public API surface. The docstring provides a good architectural overview.
lib/gpu_memory_service/client/memory_manager.py (2)
40-52: Well-documented exception for stale weights detection.The docstring clearly explains when this exception is raised and what it doesn't detect (content-only changes), which is helpful for users.
825-836: Best-effort cleanup in _unmap_all is appropriate.The broad exception handling here is acceptable since this is cleanup code that should proceed even if individual unmaps fail. The warnings provide observability.
lib/gpu_memory_service/server/handler.py (2)
41-72: Clean handler design with proper separation of concerns.The RequestHandler correctly separates business logic from connection management. The initialization and property accessors are straightforward.
73-93: Lifecycle callbacks are well-structured.The
on_rw_abortandon_shutdowncallbacks properly clean up both allocations and metadata. The conditional check inon_shutdownavoids unnecessary work.lib/gpu_memory_service/client/torch/lifecycle.py (1)
49-165: Solid singleton lifecycle management with clear mode transitions.The
get_or_create_allocatorfunction properly enforces single-manager-per-process semantics with clear error messages for incompatible mode requests. The auto mode handling is sensible.lib/gpu_memory_service/server/rpc.py (2)
57-88: Well-architected async RPC server with explicit state machine.The server initialization is clean with proper separation between request handling (delegated to RequestHandler) and state management (delegated to GlobalLockFSM).
406-439: Server lifecycle management is robust.The
start()properly cleans up stale socket files, andstop()correctly handles shutdown by:
- Signaling shutdown to waiters
- Closing the server
- Closing active connections (bypassing FSM - appropriate for shutdown)
- Cleaning up resources
lib/gpu_memory_service/client/rpc.py (3)
71-116: Clear client design with lock-as-connection semantics.The docstring and class design clearly communicate that the socket connection represents the lock, which is a key architectural decision.
228-262: Clever commit verification via RO reconnect.The commit() method's error recovery strategy - verifying via a quick RO connection when the socket closes - is a robust way to confirm the commit succeeded despite connection errors. This handles the race condition where the server closes the socket as part of commit.
402-405: Destructor warning for leaked connections is helpful for debugging.The
__del__warning helps identify connection leaks during development. This is a good practice for resources that should be explicitly closed.lib/gpu_memory_service/server/locking.py (5)
49-82: Connection class implementation looks solid.The design choice to use
eq=Falsewith explicit__hash__based onsession_idis appropriate for use in sets while maintaining object identity for equality. Theraw_socketproperty correctly extracts the underlying socket for FD passing.One minor observation: the
close()method silently swallows all exceptions (lines 80-81). While this is often acceptable for cleanup operations, consider logging at DEBUG level for troubleshooting connection issues.
177-241: Well-designed state machine with derived state.The design choice to derive
statefrom the actual connection objects (_rw_conn,_ro_conns,_committed) rather than storing it separately is robust - it ensures state consistency and eliminates potential drift between stored state and actual connections. The transition history log is useful for debugging.
245-261: Condition evaluation is correct but uses string-based dispatch.The
_has_remaining_readerslogic correctly handles the edge case whereconnmight not be in_ro_conns(defensive programming). The string-based condition dispatch in_check_conditionis functional but could be replaced with a method reference in theTransitiondataclass for type safety. However, this is a minor consideration for a small, controlled set of conditions.
297-341: Transition method correctly validates state changes.The post-transition validation (lines 326-330) that checks if the resulting state matches the expected
to_stateis a good defensive measure to catch implementation bugs. The transition logging with session context aids debugging.
373-395: Lock acquisition predicates correctly implement fairness.
can_acquire_rw()correctly requires both no RW holder and no RO holders.can_acquire_ro()includes awaiting_writers == 0check which prevents writer starvation - readers won't jump ahead of waiting writers. This is a good fairness design.lib/gpu_memory_service/client/torch/tensor.py (7)
36-65: Clean metadata dataclass design.The separation of
TensorMeta(serialization format with string dtype) andParsedTensorMeta(runtime format withtorch.dtype) provides a clean boundary between storage and usage. The frozen dataclasses ensure immutability.
152-164: dtype parsing is robust with proper validation.The
getattr(torch, s)approach is flexible, and the subsequentisinstance(dt, torch.dtype)check correctly guards against non-dtype attributes being returned.
187-200: Legacy format fallback is well-handled.The graceful degradation from
span_bytes→nbytes→ computed value ensures backward compatibility. The element size computation viatorch.tensor([], dtype=dtype).element_size()works but creates a temporary tensor; consider usingtorch.finfo(dtype).bits // 8for floating types ortorch.iinfo(dtype).bits // 8for integer types if performance becomes a concern in hot paths.
272-351: Module tree extraction is comprehensive but complex.The function correctly captures parameters, buffers, and arbitrary tensor attributes (like
_k_scale,_v_scale). The attribute filtering logic (lines 308-327) is intricate but handles the edge cases well - skipping dunder attributes while allowing private tensor attributes that are common in quantization scenarios.The broad
except Exceptionat line 330 is appropriate here sincedir()can return attributes that raise when accessed (descriptors, lazy properties), and we want to skip those gracefully.
384-416: Module path resolution handles container types correctly.The
resolve_module_attrfunction properly handlesModuleList,Sequential, andModuleDictby checking for__getitem__and handling both integer indices and string keys. The fallback to__getitem__afterhasattrcheck is the right approach for these container types.
726-747: Parameter materialization handles meta tensors correctly.The logic to check for shape/dtype mismatches (lines 733-738) and the
needs_replacecondition for meta tensors or device mismatch (line 739) is correct. Therequires_gradhandling for non-floating-point types (lines 742-743) properly prevents gradients on integer tensors.One minor note: lines 732 and 756 use
assertfor runtime validation. While unlikely to fail in practice (the condition is checked in theifclause),assertstatements can be stripped with-Oflag. Consider removing these redundant assertions.
804-849: Meta tensor initialization with default values is pragmatic.The heuristic to fill "scale" tensors with 1.0 and others with 0.0 is a reasonable default that works for common quantization patterns. The implementation correctly iterates over both parameters and buffers.
Consider documenting this behavior in the function docstring so users understand that meta tensors not in GMS metadata will be initialized with default values.
lib/gpu_memory_service/common/protocol.py (5)
46-78: Message type codes are well-organized.The message type codes are logically grouped (handshake 0x01-0x04, state 0x05-0x08, allocation 0x10-0x1A, metadata 0x30-0x37, error 0xFF). This organization leaves room for future additions in each category.
83-111: Handshake protocol design is clean.The
lock_typeLiteral type clearly defines the valid options, and therw_or_romode for opportunistic fallback is a practical feature. Thegranted_lock_typein the response allows callers to know which lock was actually acquired.
380-418: Serialization is straightforward and correct.The
encode_messagefunction correctly handles dataclass conversion viaasdict, anddecode_messageproperly reconstructs the message instance using**payloadunpacking. The message type lookup dictionaries provide clean bidirectional mapping.
454-471: FD passing with socket duplication is complex but necessary.The approach of duplicating the transport's file descriptor, creating a temporary socket for
sendmsg, then detaching is a workaround for asyncio's lack of native ancillary data support. The exception handling correctly cleans up the duplicated fd on failure.This is fragile code that depends on Python's socket semantics. Consider adding a brief comment explaining why this dance is necessary (asyncio streams don't expose sendmsg, transport socket may have issues with sendmsg in async context).
621-631: Sync version correctly uses consistent socket operations.Unlike the async version, the sync
recv_message_syncuses the same socket for bothrecvmsgand subsequentrecvcalls (line 644), avoiding any buffer inconsistency issues. The unusedflagsvariable (line 623) is a minor static analysis warning that can be silenced by prefixing with underscore.
e4ecf18 to
5012c21
Compare
2076c66 to
499bc45
Compare
|
@galletas1712 We can't really take a 6k line PR, no-one can review that. Is there a few-hundred line piece inside this that is useful by itself? We could maybe start with that, and if that works build on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is possible to run this on bare-metal or does it require a devcontainer?
If it is possible, we might want to add some notes here in case anything different needs to be done:
https://github.com/ai-dynamo/dynamo/blob/main/README.md#developing-locally
If it's not possible, let's check that the developing locally workflow is not broken
ccaf7e2 to
caf59a2
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@components/src/dynamo/gpu_memory_service/__init__.py`:
- Around line 6-9: Update the module docstring in the dynamo.gpu_memory_service
package to fix the typo: replace the phrase "gpu_memory package" with
"gpu_memory_service" so the description correctly names the wrapped package;
edit the top-of-file docstring in __init__.py that documents the CLI entry point
and re-exports to use the exact package name "gpu_memory_service".
- Around line 20-22: The unconditional import of the optional compiled extension
gpu_memory_service.client.torch.extensions._allocator_ext can raise ImportError
in CPU-only/minimal installs; wrap the import in a try/except ImportError (or
Exception) inside __init__.py, assign a safe fallback (e.g., _allocator_ext =
None) when the import fails, and optionally set/export a boolean flag (e.g.,
_has_allocator_ext) so callers can check before using _allocator_ext; update any
re-exports to use the fallback symbol to avoid hard crashes.
♻️ Duplicate comments (1)
components/src/dynamo/gpu_memory_service/args.py (1)
52-58: Prefer a per-user runtime dir over/tmpfor the default socket path.This was already flagged earlier; keeping it here to track the same concern.
🧹 Nitpick comments (1)
components/src/dynamo/gpu_memory_service/__init__.py (1)
29-41: (Optional) Sort__all__to satisfy Ruff RUF022.Ruff flags this list as unsorted. If you want to keep the grouping, consider suppressing the rule; otherwise, sort it.
♻️ One possible ordering
__all__ = [ - # Core - "GMSClientMemoryManager", - "StaleMemoryLayoutError", - # GMS client memory manager - "get_or_create_gms_client_memory_manager", - "get_gms_client_memory_manager", - # Tensor utilities - "register_module_tensors", - "materialize_module_from_gms", - # Extensions "_allocator_ext", + "GMSClientMemoryManager", + "StaleMemoryLayoutError", + "get_gms_client_memory_manager", + "get_or_create_gms_client_memory_manager", + "materialize_module_from_gms", + "register_module_tensors", ]
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com>
🤷 |
Signed-off-by: Schwinn Saereesitthipitak <17022745+galletas1712@users.noreply.github.com> Signed-off-by: davilu <davilu@nvidia.com>
GPU Memory Service provides out-of-process GPU memory management, enabling zero-copy sharing and data survival across process crashes.
This one of a few PRs for the GPU Memory Service - it allows weights to be managed externally from the inference engine in a resilient manner. There are a few downstream benefits for the user:
Package structure:
Key features:
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.