Skip to content

test_networks::test_checkpointing_thunderfx fails on (G)B200 due to grads mismatch #2543

@crcrpar

Description

@crcrpar

🐛 Bug

test_networks.py::test_checkpointing_thunderfx fails due to grads mismatch between eager pytorch and thunderfx.

To Reproduce

Steps to reproduce the behavior:

  1. Run test_networks.py::test_checkpointing_thunderfx
  2. See grad mismatch e.g.
>       assert_close(grads_res, grads_ref, atol=1e-3, rtol=1e-3)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 62 / 20480 (0.3%)
E       Greatest absolute difference: 9818.546875 at index (1, 7) (up to 0.001 allowed)
E       Greatest relative difference: 22.85102081298828 at index (1, 6) (up to 0.001 allowed)
E
E       The failure occurred for item [1]

Expected behavior

Environment

pjnl-20250926

Additional context

Metadata

Metadata

Assignees

Labels

autogradthunderfxfor things that could be applicable to the dynamo+thunder frontend

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions