You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+30-2Lines changed: 30 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,7 +120,11 @@ You need to specify the following arguments:
120
120
121
121
### 🤩 Prepare your own dataset
122
122
123
-
Besides the provided ShareGPT/Ultrachat datasets, you can also prepare your own dataset. You should prepare the dataset in jsonl format and the schema should look like this:
123
+
Besides the provided ShareGPT/Ultrachat datasets, you can also prepare your own dataset. We support two formats:
124
+
125
+
#### Option 1: Conversation Format
126
+
127
+
You should prepare the dataset in jsonl format and the schema should look like this:
124
128
125
129
```json
126
130
{
@@ -134,6 +138,30 @@ Besides the provided ShareGPT/Ultrachat datasets, you can also prepare your own
134
138
}
135
139
```
136
140
141
+
#### Option 2: Pre-formatted Text Format
142
+
143
+
If you already have conversations formatted with a specific chat template, you can use the pre-formatted text directly:
144
+
145
+
```json
146
+
{
147
+
"id": "xxxx",
148
+
"text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there!<|im_end|>\n"
149
+
}
150
+
```
151
+
152
+
This format is useful when you have pre-formatted prompts that were used during training of the target model and have raw generations from the target model.
153
+
154
+
To use pre-formatted datasets, add the `--is-preformatted` flag to your training command. Note that the `--chat-template` parameter is still needed and should match the template used in your pre-formatted text, as it is used to identify user/assistant tokens to determine the assistant spans and generate the corresponding loss mask.
Once you have the `jsonl` file ready, you can go straight for online training or hidden states generation for offline training.
138
166
139
167
If you have multiple datasets, you can just merge them into the one jsonl file. For example, you can do something like this
@@ -256,7 +284,7 @@ When `tp_size` is greater than 1, the script will automatically load the distrib
256
284
257
285
#### Customize Draft Model
258
286
259
-
If you want to change the draft model configuration, you can write your own configuration file and pass its path to the `--draft-model-config` argument. If you wish to serve your customized draft model with SGLang, make sure you implement the draft model in SGLang as well and the architecture name must match. To implement your own draft model, you can create a new class and inherit it from the `Eagle3DraftModel` class in the `specforge.modeling.draft.base.py` file.
287
+
If you want to change the draft model configuration, you can write your own configuration file and pass its path to the `--draft-model-config` argument. Or, if you do not provide the `--draft-model-config` argument, the script will automatically generate the draft model configuration based on the target model configuration. If you wish to serve your customized draft model with SGLang, make sure you implement the draft model in SGLang as well and the architecture name must match. To implement your own draft model, you can create a new class and inherit it from the `Eagle3DraftModel` class in the `specforge.modeling.draft.base.py` file.
0 commit comments