Skip to content

CUDA FutureWarning from torch.cuda.reset_max_memory_allocated in perf_tracker.py::TracerΒ #361

@DNXie

Description

@DNXie

πŸ› Describe the bug

1. Repeated CUDA FutureWarning

  • Problem: During training, repeated warnings appear:
[0] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[0]   warnings.warn(
[0] πŸ”§ Train rank 0: Step 0, loss=0
[1] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[1]   warnings.warn(
[1] πŸ”§ Train rank 1: Step 0, loss=1000
[1] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[1]   warnings.warn(
[0] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[0]   warnings.warn(
  • Context: This warning is triggered in perf_tracker.py::Tracer when tracking CUDA memory usage. The warning appears multiple times per step and across ranks.

2. TLDR of Tracer Behavior

  • self.start() signals CUDA to begin memory tracking.
  • self.stop() ends memory tracking.
  • If nested calls occur (e.g., foo(bar())), the inner call should NOT reset the memory stats of the outer call.

3. Required Fix

  • Goal: Update the API usage in Tracer to avoid triggering the FutureWarning. Ensure that memory stats are not reset in nested calls unless intended.

Additional Suggestions (good to be separate PRs)

  • Enable per-step memory tracking: Instead of only tracking from start to stop, allow memory tracking for each training step.
  • Register reserved memory: Use the .summary() API to capture all relevant memory stats, including reserved memory. Be cautious about excessive logging (e.g., avoid spamming wandb with too many graphs per step).

cc @felipemello1 @allenwang28

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions