Skip to content

Deepspeed Startegy doesn't set num_checkpoints while using activation partitions #20329

@Gforky

Description

@Gforky

Bug description

When training with DeepSpeed and configuring the ZeRO Stage 3 strategy, if activation partitioning is enabled along with contiguous_checkpointing, you may encounter an "index out of range" error related to contiguous_data_buffers. This issue arises because, during the creation of the activation partition configuration, the num_checkpoints parameter is not passed. As a result, DeepSpeed uses the global variable num_layers with its default value of False, which leads to the incorrect creation of an empty contiguous_data_buffers.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 557, in forward
[rank0]:     inputs = partition_activations(args, CPU_CHECKPOINT, CONTIGUOUS_CHECKPOINTING)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 421, in partition_activations
[rank0]:     contiguous_data_buffers[i][data_offsets[i]].data[range(
[rank0]: IndexError: list index out of range

Environment

Current environment
#- PyTorch Lightning Version: 2.4.0
#- PyTorch Version: 2.4.1
#- Python version: 3.10.6
#- OS: Ubuntu-22.04
#- CUDA version: 12.1
#- GPU models and configuration: A100
#- How you installed Lightning(`conda`, `pip`, source): pip insatll

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions