Skip to content

Conversation

@sywangyi
Copy link
Contributor

@sywangyi sywangyi commented Jan 6, 2026

fix issue when using
accelerate launch --num-processes 4 nd_parallel.py --dp-shard-size 2 --tp-size 2

…, caused by double initialization of the optimizer

Signed-off-by: Wang, Yi A <[email protected]>
@sywangyi
Copy link
Contributor Author

sywangyi commented Jan 6, 2026

crash logs.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 172, in
[rank0]: train(args)
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 152, in train
[rank0]: accelerator.save_state(args.save_dir + f"/checkpoint-{step}")
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/accelerator.py", line 3618, in save_state
[rank0]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/utils/fsdp_utils.py", line 256, in save_fsdp_optimizer
[rank0]: optim_state = get_optimizer_state_dict(model, optimizer, options=sd_options)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 1106, in get_optimizer_state_dict
[rank0]: optim_state_dict = _get_optim_state_dict(model, optimizers, info)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 785, in _get_optim_state_dict
[rank0]: _init_optim_state(optim)
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 625, in _init_optim_state
[rank0]: param.grad = torch.zeros_like(param)
[rank0]: ^^^^^^^^^^
[rank0]: RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype 'float'. Please ensure that the gradient and the tensor have the same dtype

@sywangyi
Copy link
Contributor Author

sywangyi commented Jan 6, 2026

and crash like
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 172, in
[rank0]: train(args)
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 152, in train
[rank0]: accelerator.save_state(args.save_dir + f"/checkpoint-{step}")
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/accelerator.py", line 3625, in save_state
[rank0]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
[rank0]: ~~~~~~~~~~~~^^^
[rank0]: IndexError: list index out of range

caused the optimizer is double initializated if tp is enabled

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left a comment

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <[email protected]>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a nit

Signed-off-by: Wang, Yi A <[email protected]>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !

@sywangyi
Copy link
Contributor Author

sywangyi commented Jan 9, 2026

the test failure in ci has nothing to do with the PR.

@SunMarc SunMarc merged commit 2d388f1 into huggingface:main Jan 9, 2026
20 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants