Replies: 1 comment 3 replies
-
Nope, Context Parallelism works with sample packing and SFT. We use ring-flash-attn (and FA) to make this work.
@djsaunde , does SP work with pretraining? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team, I saw recently axolotl start supporting context parallelism with accelerate, from the accelerate doc: https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/context_parallelism.md
it mentions something like "Context parallelism works only with SDPA and only with no mask or causal mask. We can't properly detect this for you, so it's your responsibility to ensure that you are using SDPA with no mask or causal mask. If you use any other attention implementation, it will raise an error."
Does this mean axolotl's cp can only use with sdpa and pre-training style sft like no causal mask on the prompt? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions