Commit 2e7fbc3
Fix checkpoint loading: Move RNG states to CPU before restoration
Root cause: torch.load() with map_location='cuda:N' moves tensors to CUDA,
but torch.set_rng_state() requires CPU tensors.
Changes:
- Move torch_random_state to CPU before calling set_rng_state()
- Move cuda_random_state elements to CPU (set_rng_state_all handles placement)
- Add try-except fallback: if RNG restoration fails, log warning and continue
- This allows training to resume without losing model/optimizer state
Benefits:
- Resume mode now works correctly
- Graceful fallback if RNG restoration fails for any reason
- Preserves deterministic training when possible
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>1 parent 66f0e4c commit 2e7fbc3
1 file changed
+19
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | | - | |
73 | | - | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
74 | 81 | | |
75 | 82 | | |
76 | | - | |
77 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
78 | 93 | | |
79 | 94 | | |
80 | 95 | | |
| |||
0 commit comments