Skip to content

Conversation

@vaibhav-research
Copy link
Contributor

What does this PR do?

This PR improves the correctness and robustness of model checkpoint saving by removing a redundant parallelism_config branch from Trainer.save_model().

The removed logic attempted to special-case checkpoint saving whenever accelerator.parallelism_config was present. In practice, this routing was unnecessary and error-prone, as checkpointing behavior for supported parallelism strategies (FSDP, DeepSpeed, tensor parallelism) is already handled by their respective, dedicated code paths.

As discussed in the linked issue, the presence of this branch could intercept the save flow prematurely, leading to incorrect save semantics—particularly in configurations involving FSDP—by invoking a generic _save() path instead of the required backend-specific logic.

Fixes #43125

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

self._save(output_dir)
# If we drop to here, we're in 1D parallelism, so all ranks need to go to `save_pretrained`
elif (tp_size := getattr(self.model, "_tp_size", 0)) is not None and tp_size > 1:
self._save(output_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this also

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve removed the TP branch and pushed the update.

I didn’t drop it earlier because I wasn’t fully sure whether _tp_size > 1 still required all ranks to go through _save() for correct shard handling, and I wanted to avoid removing something that might still be relied on implicitly.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a nit

@vaibhav-research
Copy link
Contributor Author

Thanks, just a nit

thanks for the review @SunMarc, updated the PR

@SunMarc SunMarc enabled auto-merge (squash) January 16, 2026 15:50
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vaibhav-research
Copy link
Contributor Author

the CI is failing on following test which seems unrelated to this change

( python3 -m pytest -m 'not generate' -n 8  -p no:warning -o junit_family=xunit1 --junitxml=test-results/junit.xml --reruns 5 --reruns-delay 2 --only-rerun '(OSError|Timeout|ConnectionError|FileNotFoundError|PIL.UnidentifiedImageError|HTTPError|AssertionError: Tensor-likes are not close!|TypeError: expected str, bytes or os.PathLike object, not NoneType|TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType|Converting from Tiktoken failed|KeyError: <class |TypeError: not a string)' --max-worker-restart=0 -vvv -rsfE --make-reports=tests_torch $(cat splitted_tests.txt) | tee tests_output.txt)

SKIPPED [1] tests/generation/test_utils.py:2063: Model is an encoder-decoder
SKIPPED [1] tests/models/t5gemma2/test_modeling_t5gemma2.py:1003: SiglipVisionModel (vision backbone) does not support standalone training
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:175: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:180: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [1] tests/models/vitdet/test_modeling_vitdet.py:184: Does not work on the tiny model as we keep hitting edge cases.
SKIPPED [3] tests/test_modeling_common.py:3437: VitDetModel does not support grouped_mm
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32 - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels - RuntimeError: expected data_ptr to be aligned to 16 bytes
FAILED tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask - RuntimeError: expected data_ptr to be aligned to 16 bytes
= 9 failed, 2647 passed, 3963 skipped, 6 xfailed, 26 warnings in 178.34s (0:02:58) =

Exited with code exit status 1

#!/bin/bash -eo pipefail
python3 .circleci/parse_test_outputs.py --file tests_output.txt --fail

   9 failed because RuntimeError -> expected data_ptr to be aligned to 16 bytes
Number of failed tests: 9

Exited with code exit status 1

I just verified a few things on my local

(transformer) vaibhavpandey@Vaibhavs-MacBook-Pro transformers % python - <<'PY'
import torch
print("torch:", torch.__version__)
print("mps available:", torch.backends.mps.is_available())
print("cuda available:", torch.cuda.is_available())
PY
torch: 2.9.1
mps available: True
cuda available: False
(transformer) vaibhavpandey@Vaibhavs-MacBook-Pro transformers % CUDA_VISIBLE_DEVICES="" python -m pytest -q \
  tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32

tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_eager_matches_batched_and_grouped_inference_1_fp32 PASSED [100%]

=================================================== 1 passed in 2.71s ===================================================

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43314&sha=df6f55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Saving bug using FSDP 2.0 + parallelism_config set (while works fine without providing parallelism_config in TrainingArguments)

3 participants