How should pre-tokenized data be configured? #2646
-
https://docs.axolotl.ai/docs/dataset-formats/tokenized.html {"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]} I need to organize my data in the same way as above, but I want some additional clarification.
In labels, I don't know which tokens should be left at -100 and which should show the actual token. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
Are you sure you want to bring your own dataset in such a case? Maybe use our chat_template processing and let us handle the masking for simplicity? https://docs.axolotl.ai/docs/dataset-formats/#conversation-dataset To answer your question, it really depends on your use case, but you would set -100 for all non-assistant responses. |
Beta Was this translation helpful? Give feedback.
See example 5, you need to set
eot_tokens
https://docs.axolotl.ai/docs/dataset-formats/conversation.html#examples