Skip to content

Commit c7e59dd

Browse files
stas00S1ro1sfc-gh-sbekmanSunMarckashif
authored
Deepspeed Ulysses/ALST integration (#3817)
* Feat: initial impl Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * improve Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * s/flavour/backend/ Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * style + ver Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * better check Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * check Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * docs + example Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * add tests Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * add tests Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * cleanup Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * cleanup Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * add experimental notice Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * style Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * new deepspeed version Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * additional checks + tests Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * more docs Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * more docs Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * working now Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * style Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * update docs Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * more robust config parsing Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * check backend, integrate ulysses API improvement Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * style Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix default to match the doc Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * deepspeed=0.18.2 is out Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestions from code review Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * s/cp/sp Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fixes Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/parallelism_config.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * suggestion Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Update docs/source/concept_guides/sequence_parallelism.md Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update sequence_parallelism.md * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: S1ro1 <matej.sirovatka@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
1 parent e8f241f commit c7e59dd

File tree

19 files changed

+943
-35
lines changed

19 files changed

+943
-35
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,8 @@
9494
title: FSDP1 vs FSDP2
9595
- local: concept_guides/context_parallelism
9696
title: Context parallelism
97+
- local: concept_guides/sequence_parallelism
98+
title: Sequence parallelism
9799
- local: concept_guides/low_precision_training
98100
title: Low precision training methods
99101
- local: concept_guides/training_tpu

docs/source/concept_guides/context_parallelism.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ rendered properly in your Markdown viewer.
1717

1818
This guide will cover basics of using context parallelism in 🤗`accelerate`, for the more curious readers, we will also cover some technicalities in the later sections.
1919

20+
See also the very related [Guide to Sequence Parallellism](./sequence_parallelism.md).
21+
2022
## Why context parallelism?
2123

2224
With the advent of large language models, and recently reasoning models, the sequence length has been growing rapidly. This, combined with quadratic memory complexity of attention, has led to a need for more efficient ways to train models with long sequences.
@@ -176,8 +178,8 @@ You can directly see this issue in the profiler output in the image below:
176178
177179
## Why only FSDP2?
178180
179-
We only support context parallelism with `FSDP2`, as we create a joint mesh of `context_parallel_size` and `dp_shard_size` to
180-
utilize its full potential.
181+
We only support context parallelism with `FSDP2`, as we create a joint mesh of `context_parallel_size` and `dp_shard_size` to
182+
utilize its full potential.
181183
How it works is: we shard the model across the joint mesh of size `cp_size*dp_shard_size`, which maximizes the memory savings.
182184
This is a "free lunch" of sorts, as `FSDP` communication is fully overlapped with the computation of attention, as shown in the images below.
183185
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
-->
15+
16+
# Sequence parallel in 🤗`accelerate`
17+
18+
This guide will cover basics of using sequence parallelism in 🤗`accelerate`.
19+
20+
See also the very related [Context Parallellism](./context_parallelism.md).
21+
22+
## Why sequence parallelism?
23+
24+
With the advent of large language models, and recently reasoning models, the sequence length has been growing rapidly. This, combined with quadratic memory complexity of attention, has led to a need for more efficient ways to train models with long sequences.
25+
With sequence length of 128k, the memory requirement of the attention matrix is `128k * 128k * 2 bytes * num_heads = ~32 GB * num_heads` for `bf16` precision, given vanilla attention implementation. Granted, with usage of `flash attention` or `SDPA` which do not materialize these attention weights, this decreases drastically, but the growth in memory requirements is still considerable.
26+
27+
Ulysses Sequence parallelism allows us to shard the inputs to the attention computation along the sequence dimension and compute the attention normally, but using only a slice of attention heads on each GPU. With this, we can train models with long sequences, with a few more tools, scaling to 15M+ sequence length. To see how to augment Ulysses SP with TiledMLP, Liger-Kernel, Activation checkpoint offload to cpu and a few other tricks pleae refer to the paper: [Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences](https://arxiv.org/abs/2506.13996).
28+
29+
## How is Ulysses SP different from FSDP CP
30+
31+
In the document [Context Parallellism](./context_parallelism.md) you can learn about deploying another technology called Context Parallelism, which too slices on the sequence dimension but uses Ring Attention instead of slicing on the head dimension.
32+
33+
The following articles go into a very detailed explanation of the differences between the two technologies:
34+
- https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/
35+
- https://huggingface.co/blog/exploding-gradients/ulysses-ring-attention
36+
37+
A quick summary adapting from one of the articles:
38+
- Ulysses SP has a relatively low communication overhead, but is limited by the number of Attention Heads and thus it has certain requirements for network topology (number of attention heads has has to be divisible by the number of participating gpus for a single replica). All-to-all communication is sensitive to latency and it requires Deepspeed.
39+
- FSDP CP Ring-Attention's P2P ring communication has no aforementioned divisibilty requirements, but has a higher communication volume.
40+
41+
Finally it should be possible to combine SP + CP as explained in the paper [USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](https://arxiv.org/abs/2405.07719) to support an even longer sequence length, albeit this is not yet integrated into 🤗`accelerate`.
42+
43+
44+
## Supported sequence parallelism backends
45+
46+
Currently the only sequence parallelism backend is `deepspeed`, which comes from the modernized Ulysses SP which is part of the [Arctic Long Sequence Training technology](https://arxiv.org/abs/2506.13996). There is also a [tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/) should you want to integrate it into your own code directly.
47+
48+
## How to use sequence parallelism?
49+
50+
```diff
51+
from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig
52+
53+
+# Example: 4 GPUs with sp_size=4, dp_shard_size=1
54+
+# Ensure: dp_replicate_size × dp_shard_size × sp_size = 1 × 1 × 4 = 4 GPUs
55+
parallelism_config = ParallelismConfig(
56+
+ sp_backend="deepspeed",
57+
+ sp_size=4,
58+
+ dp_shard_size=1, # Explicit: no data parallelism
59+
+ sp_handler=DeepSpeedSequenceParallelConfig(
60+
+ sp_seq_length_is_variable: true,
61+
+ sp_attn_implementation="sdpa",
62+
+ ),
63+
+ )
64+
65+
accelerator = Accelerator(
66+
...,
67+
parallelism_config=parallelism_config,
68+
)
69+
```
70+
71+
As with any other feature in 🤗`accelerate`, you can enable sequence parallelism also by passing the corresponding flags to `accelerate launch`. In this case, it's no different:
72+
73+
```bash
74+
accelerate launch --parallelism-config-sp-size 8 ...
75+
```
76+
77+
> [!Tip]
78+
> You can also set the `sp_size` and other configuration in the `accelerate config` command, which will save them in your `accelerate` configuration file, so you don't have to pass them every time you launch your script.
79+
80+
> [!Tip]
81+
> sequence parallelism combines with data parallelism. It doesn't require additional GPUs.
82+
> So if you have 8 gpus you can do: `--parallelism-config-dp-shard-size 8 --parallelism-config-sp-size 8`. Or you can use the `ParallelismConfig` class to set them programmatically.
83+
>
84+
> **Important**: You must ensure `dp_replicate_size × dp_shard_size × sp_size = num_processes`. For example, with 8 GPUs and `sp_size=8`, you need `dp_shard_size=1` (since 1 × 1 × 8 = 8). With 4 GPUs and `sp_size=2`, you could use `dp_shard_size=2` (since 1 × 2 × 2 = 4) for 2D parallelism.
85+
86+
87+
## ALST/Ulysses SP backend configuration
88+
89+
ALST/UlyssesSP implements sequence parallelism using attention head parallelism, as explained in [this paper](https://arxiv.org/abs/2506.13996). For simplicity, we reuse the concept and setup of sequence parallelism, which, from the user's perspective, is the same: multiple GPUs are used to process a single batch.
90+
91+
To give a sense of what ALST made possible - it allowed us to train in bf16 with 500K tokens on a single H100 GPU, 3.7M on a single node, and 15M on Llama-8B using just four nodes. This feature of HF Accelerate enables only 1 of the 3 ALST components, so the achievable sequence length will be smaller. You'd want TiledMLP, Activation checkpoint offload to CPU, and a few other things enabled to get the full power of ALST. For details, please refer to [this tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/).
92+
93+
To configure the `deepspeed` backend:
94+
95+
```python
96+
# Example: 4 GPUs with sp_size=4, dp_shard_size=1
97+
# Ensure: dp_replicate_size × dp_shard_size × sp_size = 1 × 1 × 4 = 4 GPUs
98+
parallelism_config = ParallelismConfig(
99+
sp_backend="deepspeed",
100+
sp_size=4,
101+
dp_shard_size=1, # Explicit: no data parallelism
102+
sp_handler=DeepSpeedSequenceParallelConfig(
103+
sp_seq_length=256,
104+
sp_seq_length_is_variable=True,
105+
sp_attn_implementation="sdpa",
106+
),
107+
)
108+
accelerator = Accelerator(
109+
...,
110+
parallelism_config=parallelism_config,
111+
)
112+
```
113+
114+
- `sp_backend`: set to `deepspeed` here
115+
- `sp_size` is the degree of the sequence parallelism - in the above example it's 4, therefore 4 gpus will be used to process a single batch (while doing DP=4 over the same gpus)
116+
- `sp_seq_length` and `sp_seq_length_is_variable` are used to deal with sequence lengths. If `sp_seq_length_is_variable=True` the backend will work with a sequence length that may change between batches, in which case `sp_seq_length` value can be set to anything divisible by the sequence parallel degree or not set at all. In this case on every `forward` the sequence variables will be derived from input. If `False` then `seq_length` needs to match the batch's sequence length dimension, which then will have to be padded to be always the same. The default is `True`.
117+
- `sp_attn_implementation` is one of `sdpa`, `flash_attention_2` or `flash_attention_3`. This sequence parallel implementation uses `position_ids` instead of `attention_mask` therefore, `eager` can't work here until it supports working with `position_ids`. Also, please note that `sdpa` doesn't handle multiple samples combined into one correctly; it will attend to the whole sample as one. If the samples aren't combined, `sdpa` will work correctly. Therefore, Flash Attention should be the ideal choice as it always works.
118+
119+
Instead of setting these values in `DeepSpeedSequenceParallelConfig` object, you can also use the environment variables to accomplish the same - here they are correspondingly to the end of the list above.
120+
- `PARALLELISM_CONFIG_SP_BACKEND`
121+
- `PARALLELISM_CONFIG_SP_SEQ_LENGTH`
122+
- `PARALLELISM_CONFIG_SP_SEQ_LENGTH_IS_VARIABLE`
123+
- `PARALLELISM_CONFIG_SP_ATTN_IMPLEMENTATION`
124+
125+
If not passed in the code, `sp_size` can be set via `--parallelism_config_sp_size` CLI argument. Same for other arguments. You can also do the accelerate config file style config, e.g., for 2 GPUs:
126+
127+
```yaml
128+
distributed_type: DEEPSPEED
129+
deepspeed_config:
130+
deepspeed_config_file: path/to/ds_config.json
131+
machine_rank: 0
132+
num_machines: 1
133+
num_processes: 2
134+
parallelism_config:
135+
parallelism_config_dp_replicate_size: 1
136+
parallelism_config_dp_shard_size: 1 # Must satisfy: 1 × 1 × 2 = 2 num_processes
137+
parallelism_config_sp_size: 2
138+
parallelism_config_sp_backend: deepspeed
139+
parallelism_config_sp_seq_length_is_variable: true
140+
parallelism_config_sp_attn_implementation: sdpa
141+
142+
```
143+
144+
As mentioned earlier Ulysses sequence parallelism is normally overlayed with data parallelism - same ranks are used for feeding unique data streams and also perform Ulysses Sequence Parallelism. But you could also create replicas like so:
145+
146+
```python
147+
# Example: 4 GPUs with 2D parallelism (SP=2, DP=2)
148+
# Ensure: dp_replicate_size × dp_shard_size × sp_size = 2 × 1 × 2 = 4 GPUs
149+
parallelism_config = ParallelismConfig(
150+
dp_replicate_size=2,
151+
dp_shard_size=1, # Explicit: no sharding within replicas
152+
sp_size=2,
153+
sp_backend="deepspeed",
154+
sp_handler=DeepSpeedSequenceParallelConfig(...),
155+
)
156+
```
157+
Here we use 4 gpus, with 2 sequence parallelism replicas. Deepspeed-ZeRO is what drives the data parallelism here.
158+
159+
Please note that a lot of magic is hidden inside [UlyssesSPDataLoaderAdapter](https://github.com/deepspeedai/DeepSpeed/blob/64c0052fa08438b4ecf4cae30af15091a92d2108/deepspeed/runtime/sequence_parallel/ulysses_sp.py#L442). It's used behind the scenes, wrapping your original DataLoader object, but you should be aware of it should you run into any problems. It also automatically injects the correct `shift_labels` into the batch dictionary, before the batch gets sharded across the participating ranks.
160+
161+
Now the only remaining piece to start using ALST/UlyssesSP is to aggregate the loss across ranks using a differentiable `all_gather` to get the grads right. The following code does it, while also excluding any masked out with `-100` tokens, to get the correct average:
162+
163+
```python
164+
sp_size = parallelism_config.sp_size if parallelism_config is not None else 1
165+
if sp_size > 1:
166+
sp_group = accelerator.torch_device_mesh["sp"].get_group()
167+
sp_world_size = parallelism_config.sp_size
168+
169+
# Normal training loop
170+
for iter, batch in enumerate(dl):
171+
optimizer.zero_grad()
172+
173+
batch = move_to_device(batch, model.device)
174+
outputs = model(**batch)
175+
176+
# only if not using liger-kernel
177+
shift_labels = batch["shift_labels"]
178+
loss = unwrapped_model.loss_function(
179+
logits=outputs.logits,
180+
labels=None,
181+
shift_labels=shift_labels,
182+
vocab_size=unwrapped_model.config.vocab_size,
183+
)
184+
185+
if sp_size > 1:
186+
# differentiable weighted per-shard-loss aggregation across ranks
187+
losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
188+
# special dealing with SFT that has prompt tokens that aren't used in loss computation
189+
good_tokens = (shift_labels != -100).view(-1).sum()
190+
good_tokens_per_rank = torch.distributed.nn.functional.all_gather(
191+
good_tokens, group=sp_group
192+
)
193+
# Skip ranks with zero valid tokens to avoid NaN contamination (NaN * 0 = NaN)
194+
total_loss = sum(
195+
losses_per_rank[rank] * good_tokens_per_rank[rank]
196+
for rank in range(sp_world_size)
197+
if good_tokens_per_rank[rank] > 0
198+
)
199+
total_good_tokens = sum(good_tokens_per_rank)
200+
loss = total_loss / max(total_good_tokens, 1)
201+
202+
if rank == 0: accelerator.print(f"{iter}: {loss=}")
203+
accelerator.log(dict(train_loss=loss, step=iter))
204+
205+
accelerator.backward(loss)
206+
optimizer.step()
207+
```
208+
209+
If you use [Liger Kernel](https://github.com/linkedin/Liger-Kernel) it already knows how to handle `shift_labels` so you don't need to go through manual loss calculation, just calling `model(**batch)` will already get the `loss` calculated and done in a very memory-efficient way. If you didn't know about Liger-Kernel - it's highly recommended to be used especially for long sequence length, since it liberates a lot of working GPU memory that can be used for handling longer sequences. For example, it performs a fused logit-loss computation, never manifesting the full logits tensor in memory.
210+
211+
If you want to see what HF Accelerate did behind the scenes please read [this full integration tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/).
212+
213+
For an example of an Accelerate training loop with enabled ALST/UlyssesSP see [examples/alst_ulysses_sequence_parallelism](https://github.com/huggingface/accelerate/blob/main/examples/alst_ulysses_sequence_parallelism).
214+
215+
[!Warning]
216+
> This API is quite new and still in its experimental stage. While we strive to provide a stable API, some small parts of the public API may change in the future.
217+
218+
Since this is a Deepspeed backend the usual Deepspeed configuration applies, so you can combine sequence parallelism with optimizer states and/or weights offloading as well to liberate more gpu memory and enable an even longer sequence length. This technology has been tested to work with DeepSpeed ZeRO stage 2 and 3.
219+
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Deepspeed's ALST/Ulysses sequence parallelism
2+
3+
This is an example of the use of Ulysses Sequence Parallelism, which uses attention head parallelism and is part of the Arctic Long Sequence Training project at [ArcticTraining](https://github.com/snowflakedb/ArcticTraining). [This paper](https://arxiv.org/abs/2506.13996) goes into the details of this protocol.
4+
5+
For nuances of usage please refer to the main HF Accelerate tutorial on [Context Parallelism](https://huggingface.co/docs/accelerate/en/concept_guides/context_parallelism).
6+
7+
You need to use at least `2` gpus to enable ALST/Ulysses sequence parallelism.
8+
9+
To run the example with `4` gpus:
10+
11+
```bash
12+
bash ./sp-alst.sh
13+
```
14+
15+
Change `4` to the desired sequence parallelism degree in these 2 files:
16+
```
17+
sp-alst.accelerate-config.yml:num_processes: 4
18+
sp-alst.py: sp_size=4,
19+
```
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
compute_environment: LOCAL_MACHINE
2+
deepspeed_config:
3+
deepspeed_config_file: sp-alst.ds-config.json
4+
zero3_init_flag: false
5+
distributed_type: DEEPSPEED
6+
machine_rank: 0
7+
main_training_function: main
8+
num_machines: 1
9+
num_processes: 4
10+
rdzv_backend: static
11+
same_network: true
12+
use_cpu: false
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"bf16": {
3+
"enabled": true
4+
},
5+
"zero_optimization": {
6+
"stage": 3
7+
},
8+
"gradient_accumulation_steps": 1,
9+
"train_batch_size": "auto",
10+
"train_micro_batch_size_per_gpu": "auto",
11+
"seq_parallel_communication_data_type": "bf16"
12+
}

0 commit comments

Comments
 (0)