Sequence Parallel is incompatible with Rotary Positional Embedding

I would like to finetune llama2 on long sequence data. (more than or eq 32K)

I follow the example below for sequence parallel:

https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/deepspeed4science/megatron_long_seq_support/pretrain_gpt_30B_seq_parallel.sh

Sadly, the lm loss is NaN if I use rotary positional embedding.
When I disable rotary positional embedding, the loss is ok even other parameters/arguments are the same as before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions