You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🔥 MFTCoder-accelerate supports finetuning most of mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
22
22
23
23
## 2. Data Format
24
24
### 2.1 Training Data Format
25
-
The training data is in a uniformed JSONL format, in which each line of data has the following JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs.
25
+
The training data is required to be a uniformed JSONL format, in which each line of data has the following "chatML"-style JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs.
26
+
The reason why we selected "chatML" style as our training and inference data format is that "chatML" style is compatible with both "conversation" and "instruction/response" scenarios.
26
27
27
28
For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple or "system/user/assistant" tuple.
28
29
@@ -33,40 +34,36 @@ For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple o
33
34
"chat_rounds":[
34
35
{
35
36
"role": "system",
36
-
"content": "You are a expert in coding and help answer code questions",
37
-
"chat_round_id": 0
37
+
"content": "You are a expert in coding and help answer code questions"
38
38
},
39
39
{
40
40
"role": "human",
41
-
"content": "Write a python function of quick sort",
42
-
"chat_round_id": 1
41
+
"content": "Write a python function of quick sort"
43
42
},
44
43
{
45
44
"role": "bot",
46
-
"content": "Below is the function of quick sort: ...",
47
-
"chat_round_id": 1
45
+
"content": "Below is the function of quick sort: ..."
48
46
},
49
47
{
50
48
"role": "human",
51
-
"content": "Explain the code",
52
-
"chat_round_id": 2
49
+
"content": "Explain the code"
53
50
},
54
51
{
55
52
"role": "bot",
56
-
"content": "OK, this code ...",
57
-
"chat_round_id": 2
53
+
"content": "OK, this code ..."
58
54
}
59
55
]
60
56
}
61
57
```
62
58
63
59
### 2.2 Default Inference Data Format
64
-
The default inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
60
+
Inference data format is the real string format consumed by tokenizers and then LLMs. It is also the string format to which the training data is converted before tokenization.
61
+
The default inference data format contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
65
62
It is used as the data "seen"(before tokenization) by the model in training process.
66
63
It is used as input during the inference process as well.
67
-
Here is an example format of the concatenated string:
64
+
Here is an example format of the inference string:
68
65
69
-
```python
66
+
```
70
67
"""
71
68
<s>system
72
69
System instruction
@@ -225,7 +222,7 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
225
222
226
223
-**pretrained_model_path**: Local/Shared disk path or model name on HuggingFace for the pre-trained model.
227
224
228
-
-**weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyper-parameters.
225
+
-**weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyperparameters.
229
226
230
227
-**padding_mode**: The way tokenized data is set. "padding" means padding for each sample to seq_length, "pack" means putting samples into seq_length as many as possible.
0 commit comments