-
Notifications
You must be signed in to change notification settings - Fork 170
[NVBUG: 5472822] Skip memory monitoring if not available #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chenjie Luo <[email protected]>
WalkthroughUpdated launch_memory_monitor to perform a pre-check for NVML memory info on GPU 0. On pre-check failure, it logs an error and returns None. On success, it proceeds to create, start, and register the GPUMemoryMonitor as before. The function’s return type is updated to GPUMemoryMonitor | None. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Utils as launch_memory_monitor
participant NVML as NVML API
participant Monitor as GPUMemoryMonitor
Caller->>Utils: launch_memory_monitor(interval)
Utils->>NVML: get_memory_info(gpu=0)
alt NVML info OK
Utils->>Monitor: create(interval)
Utils->>Monitor: start()
Utils->>Caller: return Monitor instance
else NVML info fails
Utils->>Caller: print error, return None
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modelopt/torch/utils/memory_monitor.py (2)
91-99
: Guard per‑GPU sampling against NVML NotSupported to avoid thread crash.If some GPUs don’t support memory info, the monitoring thread will raise and stop. Handle pynvml errors per device and continue.
Apply this refinement:
def _monitor_loop(self): while self.is_running: for i in range(self.device_count): - handle = nvmlDeviceGetHandleByIndex(i) - gpu_memory = nvmlDeviceGetMemoryInfo(handle) - used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024) # Convert to GB - self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb) + try: + handle = nvmlDeviceGetHandleByIndex(i) + gpu_memory = nvmlDeviceGetMemoryInfo(handle) + except NVMLError_NotSupported: + continue # skip GPUs without memory info support + except NVMLError: + continue # optionally log and continue + used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024) + self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb)Note: Requires importing
NVMLError
andNVMLError_NotSupported
as mentioned above.
121-131
: Join the thread before printing to avoid concurrent access topeak_memory
.Minor race: printing while the monitor thread may still update the dict. Join first, then print.
def stop(self): self.is_running = False - # Print peak memory usage - print("########") - for gpu_idx, peak_mem in self.peak_memory.items(): - print( - f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU" - ) - print("########") if self.monitor_thread: self.monitor_thread.join() + # Print peak memory usage + print("########") + for gpu_idx, peak_mem in self.peak_memory.items(): + print(f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU") + print("########") nvmlShutdown()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/utils/memory_monitor.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/utils/memory_monitor.py (1)
134-142
: Update Returns docstring; union type supported
Replace the Returns section in launch_memory_monitor with:- Returns: - GPUMemoryMonitor: The monitor instance that was launched + Returns: + GPUMemoryMonitor | None: The monitor instance if monitoring is available; otherwise None.Python ≥3.10 already permits the
|
union syntax.
try: | ||
nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0)) | ||
except Exception as e: | ||
print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-check calls NVML without initialization; narrow exception handling and balance init/shutdown.
Calling nvmlDeviceGetHandleByIndex / nvmlDeviceGetMemoryInfo before nvmlInit will raise Uninitialized and wrongly disable monitoring. Also, catching Exception is too broad. Initialize NVML for the probe, catch pynvml errors specifically, and ensure shutdown to balance the pre-check init.
Apply this diff to harden the pre-check:
- try:
- nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0))
- except Exception as e:
- print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.")
- return None
+ try:
+ nvmlInit()
+ count = nvmlDeviceGetCount()
+ if count == 0:
+ print("No NVIDIA GPUs detected. Skipping GPU memory monitor.")
+ return None
+ handle = nvmlDeviceGetHandleByIndex(0)
+ nvmlDeviceGetMemoryInfo(handle)
+ except NVMLError_NotSupported as e:
+ print(f"NVML memory info not supported on this platform: {e}. Skipping GPU memory monitor.")
+ return None
+ except NVMLError as e:
+ print(f"NVML not available (init/count/handle/memoryinfo failed): {e}. Skipping GPU memory monitor.")
+ return None
+ finally:
+ try:
+ nvmlShutdown()
+ except NVMLError:
+ pass
Add the required exception imports near the existing NVML imports:
from pynvml import NVMLError, NVMLError_NotSupported
Alternatively, group with the existing import block.
🤖 Prompt for AI Agents
In modelopt/torch/utils/memory_monitor.py around lines 143 to 147, the pre-check
calls nvmlDeviceGetHandleByIndex/nvmlDeviceGetMemoryInfo without calling
nvmlInit, catches Exception too broadly, and does not balance nvmlInit with
nvmlShutdown; initialize NVML before probing, wrap the probe in a try/except
that catches pynvml-specific errors (NVMLError and NVMLError_NotSupported)
instead of Exception, call nvmlShutdown in a finally block so NVML is always
shut down after the probe, and add the required imports from pynvml (NVMLError,
NVMLError_NotSupported) near the existing NVML imports.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #374 +/- ##
=======================================
Coverage 73.46% 73.46%
=======================================
Files 172 172
Lines 17640 17640
=======================================
+ Hits 12959 12960 +1
+ Misses 4681 4680 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Bug fix
Overview: ?
In some platforms, the memory monitor will hit the following issue:
We should skip memory monitoring if it's not available.
Testing
Test on the target platform:
scripts/huggingface_example.sh --model "Qwen/Qwen2.5-Coder-0.5B-Instruct" --quant nvfp4 --tp 1 --export_fmt hf
Summary by CodeRabbit