[NVBUG: 5472822] Skip memory monitoring if not available #374

cjluo-nv · 2025-09-25T17:20:06Z

What does this PR do?

Bug fix

Overview: ?

In some platforms, the memory monitor will hit the following issue:

  File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/memory_monitor.py", line 95, in _monitor_loop                                              
    gpu_memory = nvmlDeviceGetMemoryInfo(handle)                                                                                                                
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 3522, in nvmlDeviceGetMemoryInfo                                                               
    _nvmlCheckReturn(ret)                                                                                                                                       
  File "/usr/local/lib/python3.12/dist-packages/pynvml.py", line 1059, in _nvmlCheckReturn                                                                      
    raise NVMLError(ret)                                                                                                                                        
pynvml.NVMLError_NotSupported: Not Supported

We should skip memory monitoring if it's not available.

Testing

Test on the target platform:
scripts/huggingface_example.sh --model "Qwen/Qwen2.5-Coder-0.5B-Instruct" --quant nvfp4 --tp 1 --export_fmt hf

Summary by CodeRabbit

Bug Fixes
- Prevents crashes during startup when GPU memory info cannot be retrieved (e.g., missing or misconfigured NVML).
- Adds a preliminary check before enabling GPU memory monitoring; if unavailable, monitoring is skipped and a clear error is logged.
- No change for users with working GPU monitoring; behavior remains the same.
- Improves robustness across diverse environments, ensuring smoother operation on systems without accessible GPU metrics.

Signed-off-by: Chenjie Luo <[email protected]>

coderabbitai · 2025-09-25T17:20:15Z

Walkthrough

Updated launch_memory_monitor to perform a pre-check for NVML memory info on GPU 0. On pre-check failure, it logs an error and returns None. On success, it proceeds to create, start, and register the GPUMemoryMonitor as before. The function’s return type is updated to GPUMemoryMonitor | None.

Changes

Cohort / File(s)	Summary
Memory monitor initialization and API `modelopt/torch/utils/memory_monitor.py`	Added NVML memory info pre-check for GPU 0; on failure, print error and return None. If successful, create/start/register monitor. Updated function signature to return GPUMemoryMonitor \| None.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Utils as launch_memory_monitor
  participant NVML as NVML API
  participant Monitor as GPUMemoryMonitor

  Caller->>Utils: launch_memory_monitor(interval)
  Utils->>NVML: get_memory_info(gpu=0)
  alt NVML info OK
    Utils->>Monitor: create(interval)
    Utils->>Monitor: start()
    Utils->>Caller: return Monitor instance
  else NVML info fails
    Utils->>Caller: print error, return None
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A nibble of bytes, a cautious hop,
I sniff the GPU—should I stop?
If NVML’s hush says “none,” I flee;
Else, I monitor merrily.
Thump-thump logs, ears up, I run—
Safe checks first, then metrics fun! 🐇💾

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly summarizes the primary change—skipping the memory monitor when NVML is unavailable—and directly references the associated bug, making it concise, specific, and fully aligned with the PR’s intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/memory_monitor_fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

modelopt/torch/utils/memory_monitor.py (2)

91-99: Guard per‑GPU sampling against NVML NotSupported to avoid thread crash.

If some GPUs don’t support memory info, the monitoring thread will raise and stop. Handle pynvml errors per device and continue.

Apply this refinement:

     def _monitor_loop(self):
         while self.is_running:
             for i in range(self.device_count):
-                handle = nvmlDeviceGetHandleByIndex(i)
-                gpu_memory = nvmlDeviceGetMemoryInfo(handle)
-                used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024)  # Convert to GB
-                self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb)
+                try:
+                    handle = nvmlDeviceGetHandleByIndex(i)
+                    gpu_memory = nvmlDeviceGetMemoryInfo(handle)
+                except NVMLError_NotSupported:
+                    continue  # skip GPUs without memory info support
+                except NVMLError:
+                    continue  # optionally log and continue
+                used_memory_gb = gpu_memory.used / (1024 * 1024 * 1024)
+                self.peak_memory[i] = max(self.peak_memory.get(i, 0), used_memory_gb)

Note: Requires importing NVMLError and NVMLError_NotSupported as mentioned above.

121-131: Join the thread before printing to avoid concurrent access to peak_memory.

Minor race: printing while the monitor thread may still update the dict. Join first, then print.

     def stop(self):
         self.is_running = False
-        # Print peak memory usage
-        print("########")
-        for gpu_idx, peak_mem in self.peak_memory.items():
-            print(
-                f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU"
-            )
-        print("########")
         if self.monitor_thread:
             self.monitor_thread.join()
+        # Print peak memory usage
+        print("########")
+        for gpu_idx, peak_mem in self.peak_memory.items():
+            print(f"GPU {gpu_idx}: Peak memory usage = {peak_mem:.2f} GB for all processes on the GPU")
+        print("########")
         nvmlShutdown()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0178562 and f3c402c.

📒 Files selected for processing (1)

modelopt/torch/utils/memory_monitor.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/utils/memory_monitor.py (1)
134-142: Update Returns docstring; union type supported
Replace the Returns section in launch_memory_monitor with:
-    Returns:
-        GPUMemoryMonitor: The monitor instance that was launched
+    Returns:
+        GPUMemoryMonitor | None: The monitor instance if monitoring is available; otherwise None.
Python ≥3.10 already permits the | union syntax.

coderabbitai · 2025-09-25T17:28:28Z

modelopt/torch/utils/memory_monitor.py

+    try:
+        nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0))
+    except Exception as e:
+        print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.")
+        return None


⚠️ Potential issue | 🔴 Critical

Pre-check calls NVML without initialization; narrow exception handling and balance init/shutdown.

Calling nvmlDeviceGetHandleByIndex / nvmlDeviceGetMemoryInfo before nvmlInit will raise Uninitialized and wrongly disable monitoring. Also, catching Exception is too broad. Initialize NVML for the probe, catch pynvml errors specifically, and ensure shutdown to balance the pre-check init.

Apply this diff to harden the pre-check:

- try: - nvmlDeviceGetMemoryInfo(nvmlDeviceGetHandleByIndex(0)) - except Exception as e: - print(f"Failed to get GPU memory info: {e}. Stopping GPU memory monitor.") - return None + try: + nvmlInit() + count = nvmlDeviceGetCount() + if count == 0: + print("No NVIDIA GPUs detected. Skipping GPU memory monitor.") + return None + handle = nvmlDeviceGetHandleByIndex(0) + nvmlDeviceGetMemoryInfo(handle) + except NVMLError_NotSupported as e: + print(f"NVML memory info not supported on this platform: {e}. Skipping GPU memory monitor.") + return None + except NVMLError as e: + print(f"NVML not available (init/count/handle/memoryinfo failed): {e}. Skipping GPU memory monitor.") + return None + finally: + try: + nvmlShutdown() + except NVMLError: + pass

Add the required exception imports near the existing NVML imports:

from pynvml import NVMLError, NVMLError_NotSupported

Alternatively, group with the existing import block.

🤖 Prompt for AI Agents

In modelopt/torch/utils/memory_monitor.py around lines 143 to 147, the pre-check calls nvmlDeviceGetHandleByIndex/nvmlDeviceGetMemoryInfo without calling nvmlInit, catches Exception too broadly, and does not balance nvmlInit with nvmlShutdown; initialize NVML before probing, wrap the probe in a try/except that catches pynvml-specific errors (NVMLError and NVMLError_NotSupported) instead of Exception, call nvmlShutdown in a finally block so NVML is always shut down after the probe, and add the required imports from pynvml (NVMLError, NVMLError_NotSupported) near the existing NVML imports.

codecov · 2025-09-25T17:32:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (0178562) to head (f3c402c).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #374   +/-   ##
=======================================
  Coverage   73.46%   73.46%           
=======================================
  Files         172      172           
  Lines       17640    17640           
=======================================
+ Hits        12959    12960    +1     
+ Misses       4681     4680    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <[email protected]>

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Skip memory monitoring if not available

f3c402c

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv requested a review from a team as a code owner September 25, 2025 17:20

cjluo-nv requested a review from meenchen September 25, 2025 17:20

cjluo-nv requested a review from Edwardf0t1 September 25, 2025 17:20

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

kevalmorabia97 approved these changes Sep 25, 2025

View reviewed changes

kevalmorabia97 merged commit b15a2b8 into main Sep 26, 2025
27 checks passed

kevalmorabia97 deleted the chenjiel/memory_monitor_fix branch September 26, 2025 03:54

kevalmorabia97 pushed a commit that referenced this pull request Sep 26, 2025

[NVBUG: 5472822] Skip memory monitoring if not available (#374)

ea08e22

Signed-off-by: Chenjie Luo <[email protected]>

kevalmorabia97 pushed a commit that referenced this pull request Sep 26, 2025

[NVBUG: 5472822] Skip memory monitoring if not available (#374)

12b4891

Signed-off-by: Chenjie Luo <[email protected]>

yeyu-nvidia pushed a commit that referenced this pull request Oct 1, 2025

[NVBUG: 5472822] Skip memory monitoring if not available (#374)

2d266d3

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVBUG: 5472822] Skip memory monitoring if not available #374

[NVBUG: 5472822] Skip memory monitoring if not available #374

Uh oh!

cjluo-nv commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 25, 2025

Uh oh!

codecov bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[NVBUG: 5472822] Skip memory monitoring if not available #374

[NVBUG: 5472822] Skip memory monitoring if not available #374

Uh oh!

Conversation

cjluo-nv commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

cjluo-nv commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading

codecov bot commented Sep 25, 2025 •

edited

Loading