chore(deps): update dependency accelerate to v1.12.0#469
Open
konflux-internal-p02[bot] wants to merge 1 commit intorhoai-3.4-ea.1from
Open
chore(deps): update dependency accelerate to v1.12.0#469konflux-internal-p02[bot] wants to merge 1 commit intorhoai-3.4-ea.1from
konflux-internal-p02[bot] wants to merge 1 commit intorhoai-3.4-ea.1from
Conversation
Signed-off-by: konflux-internal-p02 <170854209+konflux-internal-p02[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==1.0.1->==1.12.0Release Notes
huggingface/accelerate (accelerate)
v1.12.0: : Deepspeed Ulysses/ALSTCompare Source
Deepspeed Ulysses/ALST integration
Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.
To enable Deepspeed Ulysses, you first need to create
ParallelismConfigand settingsprelated args:Then, you need to make sure to compute the correct loss as described on our docs
... losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group) good_tokens = (shift_labels != -100).view(-1).sum() good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group) total_loss = sum( losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size) if good_tokens_per_rank[rank] > 0 ) total_good_tokens = sum(good_tokens_per_rank) loss = total_loss / max(total_good_tokens, 1)Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !
This feature will also be available in HF Trainer thanks for this PR from @stas00: huggingface/transformers#41832
Minor changes
cpu_ram_efficient_loadingby @SunMarc in #3816New Contributors
Full Changelog: huggingface/accelerate@v1.11.0...v1.12.0
v1.11.0: : TE MXFP8, FP16/BF16 with MPS, Python 3.10Compare Source
TE MXFP8 support
We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set
use_mxfp8_block_scalinginfp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)FP16/BF16 Training for MPS devices
BF16 and FP16 support for MPS devices is finally here. You can now pass
mixed_precision = "fp16" or "bf16"when training on a mac (fp16requires torch 2.8 andbf16requires torch 2.6)FSDP updates
The following PRs add respectively support to
ignored_paramsandno_sync()for FSDPv2:Mixed precision can now be passed as a dtype string from accelerate cli flag or
fsdp_configin accelerate config file:Nd-parallel updates
Some minor updates concerning nd-parallelism.
Bump to Python 3.10
We've dropped support for python 3.9 as it reached EOL in October.
Lots of minor fixes:
cpuand offloaded tometaby @Qubitium in #3796within Accelerator.autocast()instead of__enter__()and__exit__()for more elegant style. by @EquationWalker in #3767SWANLAB_MODEby @SunMarc in #3808New Contributors
Full Changelog: huggingface/accelerate@v1.10.1...v1.11.0
v1.10.1: : PatchfixCompare Source
Full Changelog: huggingface/accelerate@v1.10.0...v1.10.1
v1.10.0: : N-D ParallelismCompare Source
N-D Parallelism
Training large models across multiple GPUs can be complex, especially when combining different parallelism strategies (e.g TP, CP, DP). To simplify this process, we've collaborated with Axolotl to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a
ParallelismConfigspecifying the size of each parallelism type—it's that simple.Learn more about how it works in our latest blogpost.
ParallelismConfigfromPartialStateby @SunMarc in #3720FSDP improvements
We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains
q_projandv_projparameters. This is especially important for fine-tuninggpt-ossmodel.Minor improvements
New Contributors
Full Changelog: huggingface/accelerate@v1.9.0...v1.10.0
v1.9.0: : Trackio support, Model loading speedup, Minor distributed improvementsCompare Source
Trackio tracker support
We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces.
Main features are:
space_id.To use it with accelerate, you need to set
log_withand initialize the trackersThanks @pcuenca for the integration !
Model loading speedup when relying
set_module_tensor_to_deviceSetting tensor while clearing cache is very slow, so we added
clear_deviceoption to disable it.Another small optimization is using
non_blockingeverywhere and syncing just before returning control to the user. This makes the loading slightly faster.FDSP, Deepspeed, FP8 minor improvements
Accelerator()configuring by @pstjohn in #3677🚨🚨🚨 Breaking changes 🚨🚨🚨
find_executable_batch_size()will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity.What's Changed
New Contributors
Full Changelog: huggingface/accelerate@v1.8.1...v1.9.0
v1.8.1: : PatchfixCompare Source
Full Changelog: huggingface/accelerate@v1.8.0...v1.8.1
v1.8.0: : FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecationCompare Source
FSDPv2 refactor + FP8 support
We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!
Faster Distributed Training on Intel CPUs
We updated the
CCL_WORKER_COUNTvariable and addedKMPparameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.Regional Compilation for DeepSpeed
We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!
ipex.optimize deprecation
ipex.optimizeis being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.Better XPU Support
We've greatly expanded and stabilized support for Intel XPUs:
Trackers
We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.
What's Changed
accelerator().load_state()by @luiz0992 in #3540dtype_byte_sizeby @SunMarc in #3625New Contributors
Full Changelog: huggingface/accelerate@v1.7.0...v1.8.0
v1.7.0: : Regional compilation, Layerwise casting hook, FSDPv2 + QLoRACompare Source
Regional compilation
Instead of compiling the entire model at once, regional compilation targets repeated blocks (such as decoder layers) first. This allows the compiler to cache and reuse optimized code for subsequent blocks, significantly reducing the cold start compilation time typically seen during the first inference. Thanks @IlyasMoutawwakil for the feature ! You can view the full benchmark here, and check out our updated compilation guide for more details!
To enable this feature, set
use_regional_compilation=Truein theTorchDynamoPluginconfiguration.Layerwise casting hook
We've introduced a new hook that enables per-layer upcasting and downcasting (e.g., for Linear layers) during inference. This allows users to run models with separate storage and compute dtypes, resulting in memory savings. The concept was first implemented in diffusers, where downcasting models to FP8 proved effective without major quality degradation. Contributed by @sayakpaul in #3427
Better FSDP2 support
This release includes numerous new features and bug fixes. Notably, we’ve added support for
FULL_STATE_DICT, a widely used option in FSDP, now enabling.save_pretrained()in transformers to work with FSDP2 wrapped models. QLoRA training is now supported as well but more testing is needed. We have also resolved a backend issue related to parameter offloading to CPU. Additionally, a significant memory spike that occurred whencpu_ram_efficient_loading=Truewas enabled has been fixed. Several other minor improvements and fixes are also included—see the What’s Changed section for full details.FULL_STATE_DICThave been enabled by @S1ro1 in #3527cpu_ram_efficient_loading=Trueby @S1ro1 in #3482Better HPU support:
We have added a documentation for Intel Gaudi hardware !
The support is already available since v1.5.0 through this PR.
Torch.compile breaking change for
dynamicargumentWe've updated the logic for setting
self.dynamicto explicitly preserve None rather than defaulting toFalsewhen theUSE_DYNAMICenvironment variable is unset. This change aligns the behavior with the PyTorch documentation for torch.compile. Thanks to @yafshar for contributing this improvement in #3567.What's Changed
low_precision_trainingguide by @sadra-barikbin in #3488torch.distributed.checkpoint.state_dict.set_model_state_dictinload_checkpoint_in_modelby @ringohoffman in #3432weights_only=Trueby @bzhong-solink in #3497cpu_ram_efficient_loading=Trueby @S1ro1 in #3482accelerator.prepare+ IPEX for 2+nn.Modelsand/oroptim.Optimizersby @mariusarvinte in #3517set_epochdoes not take effect. by @hongjx175 in #3556_cast_and_contiguousby @dlvp in #3559cpu_ram_efficient_loadingby @SumanthRH in #3307synchronizecall for xpu in_gpu_gatherby @faaany in #3563New Contributors
Full Changelog: huggingface/accelerate@v1.6.0...v1.7.0
v1.6.0: : FSDPv2, DeepSpeed TP and XCCL backend supportCompare Source
FSDPv2 support
This release introduces the support for FSDPv2 thanks to @S1ro1.
If you are using python code, you need to set
fsdp_version=2inFullyShardedDataParallelPlugin:If want to convert a YAML config that contains the FSDPv1 config to FSDPv2 one , use our conversion tool:
To learn more about the difference between FSDPv1 and FSDPv2, read the following documentation.
DeepSpeed TP support
We have added initial support for DeepSpeed + TP. Not many changes were required as the DeepSpeed APIs was already compatible. We only needed to make sure that the dataloader was compatible with TP and that we were able to save the TP weights. Thanks @inkcherry for the work ! #3390.
To use TP with deepspeed, you need to update the setting in the deepspeed config file by including
tensor_parallelkey:More details in this deepspeed PR.
Support for XCCL distributed backend
We've added support for XCCL which is an Intel distributed backend which can be used with XPU devices. More details in this torch PR. Thanks @dvrogozh for the [integration](https://redirect.github.com/hugg
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about these updates again.
To execute skipped test pipelines write comment
/ok-to-test.Documentation
Find out how to configure dependency updates in MintMaker documentation or see all available configuration options in Renovate documentation.