Skip to content

Gradient checkpointing and ddp do not work togetherΒ #20395

@rubenweitzman

Description

@rubenweitzman

Bug description

Am launching a script taht trains a model which works well when trained without ddp and using gradient checkpointing, or using ddp but no gradient checkpointing, using fabric too. However, when setting both ddp and gradient checkpointing, activate thorugh gradient_checkpointing_enable() function of huggingface, we get error

[rank0]:   File "/home/.../v2/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: RuntimeError: expect_autograd_hooks_ INTERNAL ASSERT FAILED at "../torch/csrc/distributed/c10d/reducer.cpp":1591, please report a bug to PyTorch. 

Scripts where launched with

fabric = Fabric(accelerator="gpu", 
                    loggers=loggers,
                    precision=opt.precision,
                    strategy=DDPStrategy(process_group_backend="nccl", find_unused_parameters=False, static_graph=True)
                    )

When i launch with options strategy=DDPStrategy(process_group_backend="nccl", find_unused_parameters=True, static_graph=False), I get error instead:

[rank0]: Parameter at index 560 with name reader.decoder.transformer.h.11.mlp.c_proj.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

Thanks in advance for your help.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrepro neededThe issue is missing a reproducible examplever: 2.4.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions