-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Labels
bugSomething isn't workingSomething isn't working
Description
π Describe the bug
1. Repeated CUDA FutureWarning
- Problem: During training, repeated warnings appear:
[0] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[0] warnings.warn(
[0] π§ Train rank 0: Step 0, loss=0
[1] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[1] warnings.warn(
[1] π§ Train rank 1: Step 0, loss=1000
[1] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[1] warnings.warn(
[0] /home/felipemello/.conda/envs/forge/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
[0] warnings.warn(
- Context: This warning is triggered in
perf_tracker.py::Tracer
when tracking CUDA memory usage. The warning appears multiple times per step and across ranks.
2. TLDR of Tracer Behavior
self.start()
signals CUDA to begin memory tracking.self.stop()
ends memory tracking.- If nested calls occur (e.g.,
foo(bar())
), the inner call should NOT reset the memory stats of the outer call.
3. Required Fix
- Goal: Update the API usage in
Tracer
to avoid triggering the FutureWarning. Ensure that memory stats are not reset in nested calls unless intended.
Additional Suggestions (good to be separate PRs)
- Enable per-step memory tracking: Instead of only tracking from start to stop, allow memory tracking for each training step.
- Register reserved memory: Use the
.summary()
API to capture all relevant memory stats, including reserved memory. Be cautious about excessive logging (e.g., avoid spamming wandb with too many graphs per step).
felipemello1
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working