Finetuning without pre-tokenizing #1229

hanoonaR · 2024-01-30T17:30:31Z

hanoonaR
Jan 30, 2024

I have a question regarding fine-tuning a model without pre-tokenizing. I am unclear about the correct configuration settings in this context.

If the original fine-tuning with pre-tokenize for the dataset is as follows:

datasets:
    path: mhenrichsen/alpaca_2k_test
    type: alpaca

Would replacing it with the following be appropriate for this need:

pretraining_dataset:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca

Would replacing the initial configuration with the pretraining_dataset configuration be suitable for my purpose of fine-tuning without pre-tokenizing? Are there any specific implications or differences I should be aware of when opting for the pretraining_dataset configuration over the datasets configuration in this context?

I look forward to your guidance, Thank you in advance.

NanoCode012 · 2024-02-23T17:27:31Z

NanoCode012
Feb 23, 2024
Maintainer

Could you clarify without pre-tokenizing? The dataset in your first code block is not tokenized. Axolotl tokenizes it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Finetuning without pre-tokenizing #1229

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Finetuning without pre-tokenizing #1229

Uh oh!

hanoonaR Jan 30, 2024

Replies: 1 comment

Uh oh!

NanoCode012 Feb 23, 2024 Maintainer

hanoonaR
Jan 30, 2024

NanoCode012
Feb 23, 2024
Maintainer