Np parall fix #3900

sywangyi · 2026-01-06T07:11:30Z

fix issue when using
accelerate launch --num-processes 4 nd_parallel.py --dp-shard-size 2 --tp-size 2

…, caused by double initialization of the optimizer Signed-off-by: Wang, Yi A <[email protected]>

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi · 2026-01-06T07:13:44Z

crash logs.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 172, in
[rank0]: train(args)
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 152, in train
[rank0]: accelerator.save_state(args.save_dir + f"/checkpoint-{step}")
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/accelerator.py", line 3618, in save_state
[rank0]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/utils/fsdp_utils.py", line 256, in save_fsdp_optimizer
[rank0]: optim_state = get_optimizer_state_dict(model, optimizer, options=sd_options)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 1106, in get_optimizer_state_dict
[rank0]: optim_state_dict = _get_optim_state_dict(model, optimizers, info)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 785, in _get_optim_state_dict
[rank0]: _init_optim_state(optim)
[rank0]: File "/mnt/disk0/wangyi/miniforge3/envs/transformers/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict.py", line 625, in _init_optim_state
[rank0]: param.grad = torch.zeros_like(param)
[rank0]: ^^^^^^^^^^
[rank0]: RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype 'float'. Please ensure that the gradient and the tensor have the same dtype

sywangyi · 2026-01-06T07:24:44Z

and crash like
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 172, in
[rank0]: train(args)
[rank0]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel.py", line 152, in train
[rank0]: accelerator.save_state(args.save_dir + f"/checkpoint-{step}")
[rank0]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/accelerator.py", line 3625, in save_state
[rank0]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
[rank0]: ~~~~~~~~~~~~^^^
[rank0]: IndexError: list index out of range

caused the optimizer is double initializated if tp is enabled

SunMarc

Thanks ! Left a comment

src/accelerate/accelerator.py

HuggingFaceDocBuilderDev · 2026-01-06T12:30:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <[email protected]>

SunMarc

Thanks, just a nit

src/accelerate/accelerator.py

Signed-off-by: Wang, Yi A <[email protected]>

SunMarc

thanks !

sywangyi · 2026-01-09T06:44:43Z

the test failure in ci has nothing to do with the PR.

sywangyi added 2 commits January 6, 2026 06:04

fix nd_paralled.py issue and accelerate crash during optimizer saving…

f7cc342

…, caused by double initialization of the optimizer Signed-off-by: Wang, Yi A <[email protected]>

fix crash in save_fsdp_optimizer

84c900b

Signed-off-by: Wang, Yi A <[email protected]>

SunMarc reviewed Jan 6, 2026

View reviewed changes

src/accelerate/accelerator.py Show resolved Hide resolved

update

a13e53f

Signed-off-by: Wang, Yi A <[email protected]>

SunMarc reviewed Jan 7, 2026

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

update

fb63e53

Signed-off-by: Wang, Yi A <[email protected]>

SunMarc approved these changes Jan 7, 2026

View reviewed changes

SunMarc merged commit 2d388f1 into huggingface:main Jan 9, 2026
20 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Np parall fix #3900

Np parall fix #3900

sywangyi commented Jan 6, 2026

Uh oh!

sywangyi commented Jan 6, 2026

Uh oh!

sywangyi commented Jan 6, 2026 •

edited

Loading

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2026

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

SunMarc left a comment •

edited

Loading

Uh oh!

sywangyi commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Np parall fix #3900

Np parall fix #3900

Conversation

sywangyi commented Jan 6, 2026

Uh oh!

sywangyi commented Jan 6, 2026

Uh oh!

sywangyi commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sywangyi commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sywangyi commented Jan 6, 2026 •

edited

Loading

SunMarc left a comment •

edited

Loading