Deepspeed Ulysses/ALST integration by stas00 · Pull Request #3817 · huggingface/accelerate

stas00 · 2025-10-22T16:54:48Z

This is the completion of the work started by @S1ro1 at #3782 to integrate the ALST/Ulysses long sequence training into HF Accelerate. Paper https://arxiv.org/abs/2506.13996. This is Matej's original code with lots of additional work on top and docs+tests from me.

Here is the corresponding HF Trainer integration PR: huggingface/transformers#41832

If you want to try it out please first install deepspeed from deepspeed@master as deepspeed needed some tweaks to make this integration work.

To use this feature a user needs

to create ParallelismConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(attn_implementation="flash_attention_2"),
)

accelerator = Accelerator(parallelism_config=parallelism_config)

add to their code the use of shift_labels and an aggregation of loss across ranks

    shift_labels = batch["shift_labels"]
    loss = model.module.loss_function(
        logits=outputs.logits,
        labels=None,
        shift_labels=shift_labels,
        vocab_size=model.module.config.vocab_size,
    )

    # differentiable weighted per-shard-loss aggregation across ranks
    losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
    # special dealing with SFT that has prompt tokens that aren't used in loss computation
    good_tokens = (shift_labels != -100).view(-1).sum()
    good_tokens_per_rank = torch.distributed.nn.functional.all_gather(
        good_tokens, group=sp_group
    )
    total_loss = sum(
        losses_per_rank[rank] * good_tokens_per_rank[rank]
        for rank in range(sp_world_size)
    )
    total_good_tokens = sum(good_tokens_per_rank)
    loss = total_loss / max(total_good_tokens, 1)

Quality validation

I wrote 3 accelerate-based scripts (attached at the end of the OP):

1 gpu
4 gpus w/ fsdp
4 gpus w/ deepspeed ulysses

The loss checks out with very small variations due the precision loss in aggregation.

TODO

tests
docs
Deepspeed PR got merged Ulysses HF Accelerate integration deepspeedai/DeepSpeed#7638
deepspeed dependency version needs to be bumped up to whatever next version will be

These are really needed for the HF Trainer PR huggingface/transformers#41832 but since it anchors on accelerate let's make the dependency here instead.

need to wait for a merge of ulysses mpu: additional api deepspeedai/DeepSpeed#7649
need to wait for a new Deepspeed release after above is merged probably 1.18.2 and update here the requirements.

Scripts used to perform the quality validation

The scripts and config files are:

You run the .sh files

cc: @SunMarc

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

SunMarc

Thanks a lot, this is already looking quite nice ! Left some minor comments. Please ping me when you have finished the integration !

src/accelerate/accelerator.py

src/accelerate/utils/dataclasses.py

src/accelerate/accelerator.py

src/accelerate/parallelism_config.py

HuggingFaceDocBuilderDev · 2025-10-22T17:38:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

Ulysses/ALST integration with HF Accelerate: - Allow `UlyssesSPAttentionHF.register_with_transformers` to get a `model` obj as an argument, to match HF accelerate's workflow - Fix existing Ulysses' tests to tests z2 instead of z1 - Improve documentation - Add a defensive check The HF Accelerate PR that depends on this PR is here huggingface/accelerate#3817 --------- Signed-off-by: Stas Bekman <stas@stason.org>

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 · 2025-10-22T23:28:57Z

@SunMarc,

I did some more tweaks to improve/simplify UX
added docs - did I miss any places where I should mention this backend?
added tests - I didn't actually see any torch CP e2e tests - do they even exist? in any case I wrote a simple e2e test for this PR - the quality checks are already extensively tested in the deepspeed repo

So we just have a few conversations above to complete and otherwise I'm just waiting for the deepspeed to make a new version so that we could anchor on it here. Otherwise it's ready for your complete review - but don't merge just yet until we get the new ds version here.

And then we can discuss the HF Trainer integration. Should we somehow mark this API as experimental to let users use it for a bit and possibly adjust things? If so please give me an example to follow.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

SunMarc · 2025-10-23T10:19:33Z

added tests - I didn't actually see any torch CP e2e tests - do they even exist? in any case I wrote a simple e2e test for this PR - the quality checks are already extensively tested in the deepspeed repo

There are some e2e examples in the example/torch_native_parallelism folder but we are not running them in the CI.

And then we can discuss the HF Trainer integration. Should we somehow mark this API as experimental to let users use it for a bit and possibly adjust things? If so please give me an example to follow.

Let's try to integrate it into HF Trainer before merging this PR. Once it is tightly coupled to Trainer, even if the API is marked as experimental, we will most likely try to limit breaking changes. For experimental features, we just put it on the docs, like for big model inference (probably need to remove the warning for this feature)

<Tip warning={true}>

This API is quite new and still in its experimental stage. While we strive to provide a stable API, it's possible some small parts of the public API will change in the future.

</Tip>

SunMarc

Thanks for adding the docs and the tests, this looks really nice. Just some minor nits

docs/source/concept_guides/context_parallelism.md

src/accelerate/parallelism_config.py

tests/test_dataclasses.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

stas00 · 2025-10-23T17:22:37Z

Taking this out of code comments so that it doesn't disappear with 'Resolve conversation'

To give a sense of what ALST made possible - it allowed us to train in bf16 with 500K tokens on a single H100 GPU, 3.7M on a single node, and 15M on Llama-8B using just four nodes. This feature of HF Accelerate enables only 1 of the 3 ALST components so the achievable sequence length will be smaller. You'd want TiledMLP, Activation checkpoint offload to CPU and a few other things enabled to get the full power of ALST, for details please refer to this tutorial.

@SunMarc: what would it take to enable the other features ?

Those features belong to

HF Transformers:

TiledFusedLogitsLoss TiledFusedLogitsLoss (LigerFusedLinearCrossEntropy, FLCE) Flag in from_pretrained() transformers#41306 - status: discussion
TiledMLP Integrating TiledMLP for a much smaller memory footprint transformers#41826 - status: just created

HF Transformers and pytorch

Activation checkpoint offload to cpu/nvme - trying to get this into pytorch core first extend torch.util.checkpoint to support offload to cpu/nvme pytorch/pytorch#158657 - status: discussion - once pytorch supports it, we can easily integrate into HF Transformers.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

… into alst-integration

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

docs/source/concept_guides/sequence_parallelism.md

src/accelerate/test_utils/scripts/external_deps/test_ds_alst_ulysses_sp.py

src/accelerate/parallelism_config.py

tests/test_dataclasses.py

SunMarc · 2025-11-20T15:26:54Z

Can we merge this PR @stas00 ?

stas00 · 2025-11-20T17:08:46Z

Thank you for further improvements/fixes, Kashif

Yes, let's merge it. Thank you.

egangu · 2025-11-26T15:56:12Z

Merging this PR seems to have invalidated the original cp implement.
In _prepare_cp, the function automatically passes when parallelism_config.sp_backend is "deepspeed"

accelerate/src/accelerate/accelerator.py

Line 1639 in d1c96ba

if self.parallelism_config.sp_backend == "deepspeed":

However, parallelism_config.sp_backend can only be set to "deepspeed"

accelerate/src/accelerate/parallelism_config.py

Line 319 in d1c96ba

valid_sp_backends = ["deepspeed"]

kashif · 2025-11-26T16:11:26Z

@egangu so for CP one would need to add the appropriate cp_size and cp_backend configs

egangu · 2025-11-27T14:57:27Z

@egangu so for CP one would need to add the appropriate cp_size and cp_backend configs

You're right. But what I mean is that the original CP cannot be enabled on the current version, no matter how user set the cp_size or cp_backend configurations.
This is because in the _prepare_cp function (the core code that enables the original CP), as I pointed out above, the current implementation automatically skips it.
Rolling back the accelerate version to 1.11 will enable the original CP.

kashif · 2025-11-27T15:19:48Z

thanks for the report @egangu let me test and fix

Feat: initial impl

439fa7e

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 mentioned this pull request Oct 22, 2025

WIP: initial implementation #3782

Closed

improve

e238f7b

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 mentioned this pull request Oct 22, 2025

Ulysses HF Accelerate integration deepspeedai/DeepSpeed#7638

Merged

SunMarc reviewed Oct 22, 2025

View reviewed changes

sfc-gh-sbekman added 4 commits October 22, 2025 17:42

s/flavour/backend/

876fa2d

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

style + ver

b016317

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

better check

a63d094

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

check

73d3dbc

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

sfc-gh-sbekman added 2 commits October 22, 2025 22:19

docs + example

8b4a4f4

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

add tests

76288a8

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

sfc-gh-sbekman added 3 commits October 22, 2025 23:33

add tests

209eab7

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

cleanup

c013677

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

cleanup

685453c

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

SunMarc approved these changes Oct 23, 2025

View reviewed changes

Apply suggestions from code review

a396904

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

sfc-gh-sbekman added 7 commits October 23, 2025 17:29

add experimental notice

fb80e02

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

style

21a4a2d

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

Merge branch 'alst-integration' of https://github.com/stas00/accelerate…

05b6ac1

… into alst-integration

new deepspeed version

60f7493

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

additional checks + tests

453fb55

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

more docs

8677f23

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

more docs

8ee9b03

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

sfc-gh-sbekman mentioned this pull request Oct 23, 2025

[Trainer] accelerate contextparallel support in trainer huggingface/transformers#40205

Merged

stas00 mentioned this pull request Oct 23, 2025

HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate huggingface/transformers#41832

Merged

6 tasks