Skip to content

Conversation

cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Sep 25, 2025

What does this PR do?

Bug fix

Overview: ?

In some platforms, the memory monitor will hit the following issue:

  File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/memory_monitor.py", line 95, in _monitor_loop                                              
    gpu_memory = nvmlDeviceGetMemoryInfo(handle)                                                                                                                
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 3522, in nvmlDeviceGetMemoryInfo                                                               
    _nvmlCheckReturn(ret)                                                                                                                                       
  File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 1059, in _nvmlCheckReturn                                                                      
    raise NVMLError(ret)                                                                                                                                        
pynvml.NVMLError_NotSupported: Not Supported

We should skip memory monitoring if it's not available.

Testing

Test on the target platform:
scripts/huggingface_example.sh --model "Qwen/Qwen2.5-Coder-0.5B-Instruct" --quant nvfp4 --tp 1 --export_fmt hf

Summary by CodeRabbit

  • Bug Fixes
    • Prevents crashes during startup when GPU memory info cannot be retrieved (e.g., missing or misconfigured NVML).
    • Adds a preliminary check before enabling GPU memory monitoring; if unavailable, monitoring is skipped and a clear error is logged.
    • No change for users with working GPU monitoring; behavior remains the same.
    • Improves robustness across diverse environments, ensuring smoother operation on systems without accessible GPU metrics.

@cjluo-nv cjluo-nv requested a review from a team as a code owner September 25, 2025 17:20
@cjluo-nv cjluo-nv requested a review from meenchen September 25, 2025 17:20
Copy link

coderabbitai bot commented Sep 25, 2025

Walkthrough

Updated launch_memory_monitor to perform a pre-check for NVML memory info on GPU 0. On pre-check failure, it logs an error and returns None. On success, it proceeds to create, start, and register the GPUMemoryMonitor as before. The function’s return type is updated to GPUMemoryMonitor | None.

Changes

Cohort / File(s) Summary
Memory monitor initialization and API
modelopt/torch/utils/memory_monitor.py
Added NVML memory info pre-check for GPU 0; on failure, print error and return None. If successful, create/start/register monitor. Updated function signature to return GPUMemoryMonitor | None.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Utils as launch_memory_monitor
  participant NVML as NVML API
  participant Monitor as GPUMemoryMonitor

  Caller->>Utils: launch_memory_monitor(interval)
  Utils->>NVML: get_memory_info(gpu=0)
  alt NVML info OK
    Utils->>Monitor: create(interval)
    Utils->>Monitor: start()
    Utils->>Caller: return Monitor instance
  else NVML info fails
    Utils->>Caller: print error, return None
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A nibble of bytes, a cautious hop,
I sniff the GPU—should I stop?
If NVML’s hush says “none,” I flee;
Else, I monitor merrily.
Thump-thump logs, ears up, I run—
Safe checks first, then metrics fun! 🐇💾

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the primary change—skipping the memory monitor when NVML is unavailable—and directly references the associated bug, making it concise, specific, and fully aligned with the PR’s intent.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/memory_monitor_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv requested a review from Edwardf0t1 September 25, 2025 17:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
modelopt/torch/utils/memory_monitor.py (2)

91-99: Guard per‑GPU sampling against NVML NotSupported to avoid thread crash.

If some GPUs don’t support memory info, the monitoring thread will raise and stop. Handle pynvml errors per device and continue.

Apply this refinement:

     def _monitor_loop(self):
         while self.is_running:
             for i in range(self.device_count):
-                handle = nvmlDeviceGetHandleByIndex(i)
-                gpu_memory = nvmlDeviceGetMemoryInfo(handle)
-                used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024)  # Convert to GB
-                self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb)
+                try:
+                    handle = nvmlDeviceGetHandleByIndex(i)
+                    gpu_memory = nvmlDeviceGetMemoryInfo(handle)
+                except NVMLError_NotSupported:
+                    continue  # skip GPUs without memory info support
+                except NVMLError:
+                    continue  # optionally log and continue
+                used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024)
+                self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb)

Note: Requires importing NVMLError and NVMLError_NotSupported as mentioned above.


121-131: Join the thread before printing to avoid concurrent access to peak_memory.

Minor race: printing while the monitor thread may still update the dict. Join first, then print.

     def stop(self):
         self.is_running = False
-        # Print peak memory usage
-        print("########")
-        for gpu_idx, peak_mem in self.peak_memory.items():
-            print(
-                f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU"
-            )
-        print("########")
         if self.monitor_thread:
             self.monitor_thread.join()
+        # Print peak memory usage
+        print("########")
+        for gpu_idx, peak_mem in self.peak_memory.items():
+            print(f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU")
+        print("########")
         nvmlShutdown()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0178562 and f3c402c.

📒 Files selected for processing (1)
  • modelopt/torch/utils/memory_monitor.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/utils/memory_monitor.py (1)

134-142: Update Returns docstring; union type supported
Replace the Returns section in launch_memory_monitor with:

-    Returns:
-        GPUMemoryMonitor: The monitor instance that was launched
+    Returns:
+        GPUMemoryMonitor | None: The monitor instance if monitoring is available; otherwise None.

Python ≥3.10 already permits the | union syntax.

Comment on lines +143 to +147
try:
nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0))
except Exception as e:
print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.")
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Pre-check calls NVML without initialization; narrow exception handling and balance init/shutdown.

Calling nvmlDeviceGetHandleByIndex / nvmlDeviceGetMemoryInfo before nvmlInit will raise Uninitialized and wrongly disable monitoring. Also, catching Exception is too broad. Initialize NVML for the probe, catch pynvml errors specifically, and ensure shutdown to balance the pre-check init.

Apply this diff to harden the pre-check:

-    try:
-        nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0))
-    except Exception as e:
-        print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.")
-        return None
+    try:
+        nvmlInit()
+        count = nvmlDeviceGetCount()
+        if count == 0:
+            print("No NVIDIA GPUs detected. Skipping GPU memory monitor.")
+            return None
+        handle = nvmlDeviceGetHandleByIndex(0)
+        nvmlDeviceGetMemoryInfo(handle)
+    except NVMLError_NotSupported as e:
+        print(f"NVML memory info not supported on this platform: {e}. Skipping GPU memory monitor.")
+        return None
+    except NVMLError as e:
+        print(f"NVML not available (init/count/handle/memoryinfo failed): {e}. Skipping GPU memory monitor.")
+        return None
+    finally:
+        try:
+            nvmlShutdown()
+        except NVMLError:
+            pass

Add the required exception imports near the existing NVML imports:

from pynvml import NVMLError, NVMLError_NotSupported

Alternatively, group with the existing import block.

🤖 Prompt for AI Agents
In modelopt/torch/utils/memory_monitor.py around lines 143 to 147, the pre-check
calls nvmlDeviceGetHandleByIndex/nvmlDeviceGetMemoryInfo without calling
nvmlInit, catches Exception too broadly, and does not balance nvmlInit with
nvmlShutdown; initialize NVML before probing, wrap the probe in a try/except
that catches pynvml-specific errors (NVMLError and NVMLError_NotSupported)
instead of Exception, call nvmlShutdown in a finally block so NVML is always
shut down after the probe, and add the required imports from pynvml (NVMLError,
NVMLError_NotSupported) near the existing NVML imports.

Copy link

codecov bot commented Sep 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (0178562) to head (f3c402c).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #374   +/-   ##
=======================================
  Coverage   73.46%   73.46%           
=======================================
  Files         172      172           
  Lines       17640    17640           
=======================================
+ Hits        12959    12960    +1     
+ Misses       4681     4680    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 merged commit b15a2b8 into main Sep 26, 2025
27 checks passed
@kevalmorabia97 kevalmorabia97 deleted the chenjiel/memory_monitor_fix branch September 26, 2025 03:54
kevalmorabia97 pushed a commit that referenced this pull request Sep 26, 2025
kevalmorabia97 pushed a commit that referenced this pull request Sep 26, 2025
yeyu-nvidia pushed a commit that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants