How should pre-tokenized data be configured? #2646

Foreist · 2025-05-07T14:27:00Z

Foreist
May 7, 2025

https://docs.axolotl.ai/docs/dataset-formats/tokenized.html

{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]}

I need to organize my data in the same way as above, but I want some additional clarification.
For example

<bos><start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn><eos>

In labels, I don't know which tokens should be left at -100 and which should show the actual token.
In the example above, from “<start_of_turn>model” to “”, are you doing actual token processing and not -100?

Answered by NanoCode012

May 8, 2025

See example 5, you need to set eot_tokens https://docs.axolotl.ai/docs/dataset-formats/conversation.html#examples

View full answer

NanoCode012 · 2025-05-07T14:32:00Z

NanoCode012
May 7, 2025
Maintainer

Are you sure you want to bring your own dataset in such a case? Maybe use our chat_template processing and let us handle the masking for simplicity? https://docs.axolotl.ai/docs/dataset-formats/#conversation-dataset

To answer your question, it really depends on your use case, but you would set -100 for all non-assistant responses.

9 replies

Foreist May 8, 2025
Author

But something isn't right.
When I run preprocess --debug with gemma3, the <end_of_turn> at the end is masked.

Gemma who?<end_of_turn>
<eos>

-> preprocess --debug

Gemma who?-> tokenizing
<end_of_turn> -> -100
<eos> -> tokenizing

NanoCode012 May 8, 2025
Maintainer

Sorry, I'm not so clear on what you mean. Do you mean, the above is when you use our chat_template masking?

If this is from your own pre-tokenized data, we don't do anything to it.

Foreist May 8, 2025
Author

yeah, I used axolotl chat_teplate masking.

NanoCode012 May 8, 2025
Maintainer

See example 5, you need to set eot_tokens https://docs.axolotl.ai/docs/dataset-formats/conversation.html#examples

Answer selected by Foreist

Foreist May 8, 2025
Author

sorry to my bad. thx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How should pre-tokenized data be configured? #2646

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How should pre-tokenized data be configured? #2646

Uh oh!

Foreist May 7, 2025

Replies: 1 comment · 9 replies

Uh oh!

NanoCode012 May 7, 2025 Maintainer

Uh oh!

Uh oh!

Foreist May 8, 2025 Author

Uh oh!

NanoCode012 May 8, 2025 Maintainer

Uh oh!

Foreist May 8, 2025 Author

Uh oh!

NanoCode012 May 8, 2025 Maintainer

Uh oh!

Foreist May 8, 2025 Author

Foreist
May 7, 2025

Replies: 1 comment 9 replies

NanoCode012
May 7, 2025
Maintainer

Foreist May 8, 2025
Author

NanoCode012 May 8, 2025
Maintainer

Foreist May 8, 2025
Author

NanoCode012 May 8, 2025
Maintainer

Foreist May 8, 2025
Author