43125: Ensure correct checkpoint saving behavior by simplifying Trainer.save_model parallelism logic #43314

vaibhav-research · 2026-01-16T02:43:03Z

What does this PR do?

This PR improves the correctness and robustness of model checkpoint saving by removing a redundant parallelism_config branch from Trainer.save_model().

The removed logic attempted to special-case checkpoint saving whenever accelerator.parallelism_config was present. In practice, this routing was unnecessary and error-prone, as checkpointing behavior for supported parallelism strategies (FSDP, DeepSpeed, tensor parallelism) is already handled by their respective, dedicated code paths.

As discussed in the linked issue, the presence of this branch could intercept the save flow prematurely, leading to incorrect save semantics—particularly in configurations involving FSDP—by invoking a generic _save() path instead of the required backend-specific logic.

Fixes #43125

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…er.save_model parallelism logic

SunMarc · 2026-01-16T15:31:38Z

src/transformers/trainer.py

-                self._save(output_dir)
        # If we drop to here, we're in 1D parallelism, so all ranks need to go to `save_pretrained`
        elif (tp_size := getattr(self.model, "_tp_size", 0)) is not None and tp_size > 1:
            self._save(output_dir)


remove this also

I’ve removed the TP branch and pushed the update.

I didn’t drop it earlier because I wasn’t fully sure whether _tp_size > 1 still required all ranks to go through _save() for correct shard handling, and I wanted to avoid removing something that might still be relied on implicitly.

SunMarc

Thanks, just a nit

vaibhav-research · 2026-01-16T15:43:05Z

Thanks, just a nit

thanks for the review @SunMarc, updated the PR

HuggingFaceDocBuilderDev · 2026-01-16T15:58:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vaibhav-research · 2026-01-16T16:35:54Z

the CI is failing on following test which seems unrelated to this change

( python3 -m pytest -m 'not generate' -n 8  -p no:warning -o junit_family=xunit1 --junitxml=test-results/junit.xml --reruns 5 --reruns-delay 2 --only-rerun '(OSError|Timeout|ConnectionError|FileNotFoundError|PIL.UnidentifiedImageError|HTTPError|AssertionError: Tensor-likes are not close!|TypeError: expected str, bytes or os.PathLike object, not NoneType|TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType|Converting from Tiktoken failed|KeyError: <class |TypeError: not a string)' --max-worker-restart=0 -vvv -rsfE --make-reports=tests_torch $(cat splitted_tests.txt) | tee tests_output.txt)

SKIPPED [1] tests/generation/test_utils.py:2063: Model is an encoder-decoder
SKIPPED [1] tests/models/t5gemma2/test_modeling_t5gemma2.py:1003: SiglipVisionModel (vision backbone) does not support standalone training
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:175: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:180: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:184: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [3] tests/test_modeling_common.py:3437: VitDetModel does not support grouped_mm
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32 - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask - RuntimeError: expected data_ptr to be aligned to 16 bytes
= 9 failed, 2647 passed, 3963 skipped, 6 xfailed, 26 warnings in 178.34s (0:02:58) =

Exited with code exit status 1

#!/bin/bash -eo pipefail
python3 .circleci/parse_test_outputs.py --file tests_output.txt --fail

   9 failed because RuntimeError -> expected data_ptr to be aligned to 16 bytes
Number of failed tests: 9

Exited with code exit status 1

I just verified a few things on my local

(transformer) vaibhavpandey@Vaibhavs-MacBook-Pro transformers % python - <<'PY'
import torch
print("torch:", torch.__version__)
print("mps available:", torch.backends.mps.is_available())
print("cuda available:", torch.cuda.is_available())
PY
torch: 2.9.1
mps available: True
cuda available: False
(transformer) vaibhavpandey@Vaibhavs-MacBook-Pro transformers % CUDA_VISIBLE_DEVICES="" python -m pytest -q \
  tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32

tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32 PASSED [100%]

=================================================== 1 passed in 2.71s ===================================================

github-actions · 2026-01-17T14:20:51Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43314&sha=df6f55

43125: Ensure correct checkpoint saving behavior by simplifying Train…

f6ecac5

…er.save_model parallelism logic

vaibhav-research mentioned this pull request Jan 16, 2026

Saving bug using FSDP 2.0 + parallelism_config set (while works fine without providing parallelism_config in TrainingArguments) #43125

Open

SunMarc reviewed Jan 16, 2026

View reviewed changes

removed tp_size check in trainer

30338f7

Merge branch 'main' into fix_trainer_save_model

58c1246

SunMarc enabled auto-merge (squash) January 16, 2026 15:50

vaibhav-research added 2 commits January 16, 2026 11:36

Merge branch 'main' into fix_trainer_save_model

df4e719

Merge branch 'main' into fix_trainer_save_model

df6f55f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

43125: Ensure correct checkpoint saving behavior by simplifying Trainer.save_model parallelism logic #43314

43125: Ensure correct checkpoint saving behavior by simplifying Trainer.save_model parallelism logic #43314

vaibhav-research commented Jan 16, 2026

Uh oh!

SunMarc Jan 16, 2026

Uh oh!

vaibhav-research Jan 16, 2026

Uh oh!

SunMarc left a comment

Uh oh!

vaibhav-research commented Jan 16, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 16, 2026

Uh oh!

vaibhav-research commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

43125: Ensure correct checkpoint saving behavior by simplifying Trainer.save_model parallelism logic #43314

Are you sure you want to change the base?

43125: Ensure correct checkpoint saving behavior by simplifying Trainer.save_model parallelism logic #43314

Conversation

vaibhav-research commented Jan 16, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

SunMarc Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

vaibhav-research Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

vaibhav-research commented Jan 16, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 16, 2026

Uh oh!

vaibhav-research commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants