I would like to finetune llama2 on long sequence data. (more than or eq 32K)
I follow the example below for sequence parallel:
https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/deepspeed4science/megatron_long_seq_support/pretrain_gpt_30B_seq_parallel.sh
Sadly, the lm loss is NaN if I use rotary positional embedding.
When I disable rotary positional embedding, the loss is ok even other parameters/arguments are the same as before.