You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ALST/UlyssesSP: more intuitive API wrt variable seqlen (#7656)
As I was integrating ALST/Ulysses SP into HF Accelerate/Trainer I
noticed that the initial
`UlyssesSPAttentionHF.register_with_transformers` API was a bit
inflexible/confusing wrt variable seqlen.
This PR deprecates the misleading `max_length` arg name, replaces it
with `seq_length` and makes the latter optional if
`seq_length_is_variable` is True.
Updated tests and docs.
Signed-off-by: Stas Bekman <[email protected]>
It also creates nccl process groups encapsulated by the `mpu` object it returns.
161
161
162
+
For the `model_name_or_path` argument you can also pass the already existing HF Transformers `model` object.
163
+
162
164
`UlyssesSPAttentionHF.register_with_transformers` has to be called before `from_pretrained` is called.
163
165
166
+
If `seq_length_is_variable` is `True` (which is also the default value), `UlyssesSPAttentionHF` will recalculate the shapes on each `forward` based on the incoming batch's shapes - in which case you don't need to set `seq_length` - you can just skip it like so:
If, however, all your batches have an identical sequence length, then you'd save a few microseconds per run with using the `seq_length_is_variable=False` code path, which will pre-measure all shapes once and re-use them in all runs:
If you pass `seq_length`, remember that it has to be divisible by `sequence_parallel_size`. And of course, this also applies to all batches, even if you use `seq_length_is_variable=True`.
This takes an existing DataLoader object and returns a new one that will shard the batches on the sequence dimension and synchronize all GPUs of the replica to return only its corresponding shard.
202
+
This takes an existing DataLoader object and returns a new one that will shard the batches on the sequence dimension and synchronize all GPUs of the replica to return to each rank only its corresponding sequence shard.
177
203
178
-
It also takes care of pre-shifting labels and replacing `labels` with `shift_labels` in the batch.
204
+
It also takes care of replacing `labels` with `shift_labels` in the batch, by pre-shifting labels, which is crucial for the correct loss calculation when using Ulysses sequence parallelism.
0 commit comments