-
Notifications
You must be signed in to change notification settings - Fork 107
Open
Labels
autogradthunderfxfor things that could be applicable to the dynamo+thunder frontendfor things that could be applicable to the dynamo+thunder frontend
Description
🐛 Bug
test_networks.py::test_checkpointing_thunderfx
fails due to grads mismatch between eager pytorch and thunderfx.
To Reproduce
Steps to reproduce the behavior:
- Run
test_networks.py::test_checkpointing_thunderfx
- See grad mismatch e.g.
> assert_close(grads_res, grads_ref, atol=1e-3, rtol=1e-3)
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 62 / 20480 (0.3%)
E Greatest absolute difference: 9818.546875 at index (1, 7) (up to 0.001 allowed)
E Greatest relative difference: 22.85102081298828 at index (1, 6) (up to 0.001 allowed)
E
E The failure occurred for item [1]
Expected behavior
Environment
pjnl-20250926
Additional context
- It seems that the test case has been failing since mid August (b: 8/13, gb: 8/21)
- The pytorch checkpointing function itself seems stable, from the file's commit history -- https://github.com/pytorch/pytorch/commits/viable/strict/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py
Copilot
Metadata
Metadata
Labels
autogradthunderfxfor things that could be applicable to the dynamo+thunder frontendfor things that could be applicable to the dynamo+thunder frontend