Skip to content

Commit 86c68a8

Browse files
committed
mftcoder-accelerate readme
1 parent 193095f commit 86c68a8

File tree

2 files changed

+29
-41
lines changed

2 files changed

+29
-41
lines changed

mftcoder_accelerate/README.md

Lines changed: 14 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@
1616

1717
🔥 MFTCoder-accelerate supports Full-parameters/QLoRA/LoRA using accelerate + DeepSpeed Framework;
1818

19-
🔥 MFTCoder-accelerate supports Multiple Task Finetuning, which is able to balance diffenrent tasks in data level.
19+
🔥 MFTCoder-accelerate supports Multitask Fine-Tuning(MFT), which is able to balance diffenrent tasks in data level.
2020

21-
🔥 MFTCoder-accelerate supports finetuning multiple mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
21+
🔥 MFTCoder-accelerate supports finetuning most of mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
2222

2323
## 2. Data Format
2424
### 2.1 Training Data Format
25-
The training data is in a uniformed JSONL format, in which each line of data has the following JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs.
25+
The training data is required to be a uniformed JSONL format, in which each line of data has the following "chatML"-style JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs.
26+
The reason why we selected "chatML" style as our training and inference data format is that "chatML" style is compatible with both "conversation" and "instruction/response" scenarios.
2627

2728
For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple or "system/user/assistant" tuple.
2829

@@ -33,40 +34,36 @@ For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple o
3334
"chat_rounds":[
3435
{
3536
"role": "system",
36-
"content": "You are a expert in coding and help answer code questions",
37-
"chat_round_id": 0
37+
"content": "You are a expert in coding and help answer code questions"
3838
},
3939
{
4040
"role": "human",
41-
"content": "Write a python function of quick sort",
42-
"chat_round_id": 1
41+
"content": "Write a python function of quick sort"
4342
},
4443
{
4544
"role": "bot",
46-
"content": "Below is the function of quick sort: ...",
47-
"chat_round_id": 1
45+
"content": "Below is the function of quick sort: ..."
4846
},
4947
{
5048
"role": "human",
51-
"content": "Explain the code",
52-
"chat_round_id": 2
49+
"content": "Explain the code"
5350
},
5451
{
5552
"role": "bot",
56-
"content": "OK, this code ...",
57-
"chat_round_id": 2
53+
"content": "OK, this code ..."
5854
}
5955
]
6056
}
6157
```
6258

6359
### 2.2 Default Inference Data Format
64-
The default inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
60+
Inference data format is the real string format consumed by tokenizers and then LLMs. It is also the string format to which the training data is converted before tokenization.
61+
The default inference data format contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
6562
It is used as the data "seen"(before tokenization) by the model in training process.
6663
It is used as input during the inference process as well.
67-
Here is an example format of the concatenated string:
64+
Here is an example format of the inference string:
6865

69-
```python
66+
```
7067
"""
7168
<s>system
7269
System instruction
@@ -225,7 +222,7 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
225222

226223
- **pretrained_model_path**: Local/Shared disk path or model name on HuggingFace for the pre-trained model.
227224

228-
- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyper-parameters.
225+
- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyperparameters.
229226

230227
- **padding_mode**: The way tokenized data is set. "padding" means padding for each sample to seq_length, "pack" means putting samples into seq_length as many as possible.
231228

mftcoder_accelerate/README_cn.md

Lines changed: 15 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -30,36 +30,31 @@
3030
"chat_rounds":[
3131
{
3232
"role": "system",
33-
"content": "你是一个智能代码助手,可以回复用户与代码相关的问题",
34-
"chat_round_id": 0
33+
"content": "你是一个智能代码助手,可以回复用户与代码相关的问题"
3534
},
3635
{
3736
"role": "human",
38-
"content": "写一个快速排序",
39-
"chat_round_id": 1
37+
"content": "写一个快速排序"
4038
},
4139
{
4240
"role": "bot",
43-
"content": "以下是一个快速排序算法xxxxxx",
44-
"chat_round_id": 1
41+
"content": "以下是一个快速排序算法xxxxxx"
4542
},
4643
{
4744
"role": "human",
48-
"content": "解释一下这段代码",
49-
"chat_round_id": 2
45+
"content": "解释一下这段代码"
5046
},
5147
{
5248
"role": "bot",
53-
"content": "好的,这段代码xxx",
54-
"chat_round_id": 2
49+
"content": "好的,这段代码xxx"
5550
}
5651
]
5752
}
5853
```
5954

6055
### 2.2 推理数据格式
6156
推理数据格式为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:
62-
```python
57+
```
6358
"""
6459
<s>system
6560
这是System指令
@@ -148,31 +143,27 @@ QLoRA论文指出,该方法可以在一张V100上对33B的模型进行微调
148143

149144
执行如下命令即可进行 Lora/QLora/全量 微调:
150145
#### Launch via Deepspeed
151-
deepspeed配置在accelerate_ds_config.yaml中。
146+
DeepSpeed配置在accelerate_ds_config.yaml中。
152147
```bash
153-
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "deepspeed"
148+
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "DeepSpeed"
154149
```
155150
或者
156151

157-
修改并执行如下sh脚本:
158-
159-
deepspeed配置在脚本中通过命令行输入。
152+
DeepSpeed配置在脚本中通过命令行输入。
160153
```bash
161154
sh ds_single_launch.sh
162155
```
163156

164157
#### Launch via FSDP
165-
deepspeed配置在accelerate_ds_config.yaml中。
158+
FSDP配置在accelerate_fsdp_config.yaml中。
166159
```bash
167-
accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "fsdp"
160+
accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "FSDP"
168161
```
169162
或者
170163

171-
修改并执行如下sh脚本:
172-
173-
deepspeed配置在脚本中通过命令行输入。
164+
FSDP配置在脚本中通过命令行输入。
174165
```bash
175-
sh ds_single_launch.sh
166+
sh fsdp_single_launch.sh
176167
```
177168

178169
#### 训练参数
@@ -209,7 +200,7 @@ _**训练需要的参数配置在```configs/*_train_config```中,主要参数
209200
- **warmup_steps**:warm up步数。学习率经过多少步,增长到指定的数值。
210201
- **seed**:随机种子,用于复现实验结果。
211202
- **saving_limit**:整数,ckpt存储数量上限, 全量训练必须设置。默认null即不限制数量。
212-
- **role_markers**: null,即使用{"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n}。 你可以自定义 "system", "user" and "assistant"的模板, 用于定制自己的问答或者对话模板,比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
203+
- **role_markers**: null,即使用{"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n"}。 你可以自定义 "system", "user" and "assistant"的模板, 用于定制自己的问答或者对话模板,比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
213204

214205
## 4. 模型使用
215206

@@ -288,7 +279,7 @@ CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_con
288279
对于LoRA/QLoRA, 我们推荐使用DeepSpeed作为底层分布式框架,它具有易用性和兼容性好的特点,并且速度很快。
289280
FSDP 不支持QLoRA, 因为bitsandbytes暂不支持FSDP。
290281

291-
对于全量微调,我们推荐使用FSDP, 因为它在全量训练时可以发挥fully sharding的优势,已达到更快的训练速度
282+
对于全量微调,我们推荐使用FSDP, 因为它在全量训练时可以发挥fully sharding的优势,达到更快的训练速度
292283

293284
#### 问题6:当前支持的模型中,有什么区别
294285
国产大模型比如chatglm2, chatglm3, baichuan2, qwen, aquila2等,使用的是和模型共同发布的modeling_xxx.py.

0 commit comments

Comments
 (0)