Skip to content

Commit 5ed9915

Browse files
committed
mftcoder_accelerate readme
1 parent 4d7aee6 commit 5ed9915

File tree

3 files changed

+137
-54
lines changed

3 files changed

+137
-54
lines changed

mftcoder_accelerate/README.md

Lines changed: 67 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,17 @@
88

99
## 1. Updates
1010

11-
🔥 MFTCoder supports QLoRA/LoRA using Huggingface accelerate + DeepSpeed Framework;
11+
🔥 MFTCoder-accelerate supports Full-parameters/LoRA using accelerate + FSDP Framework;
1212

13-
🔥 MFTCoder supports Multiple Task Finetuning, which is able to balance diffenrent tasks in data level.
13+
🔥 MFTCoder-accelerate supports MFT/SFT on more new mainstream open-source base models: mistral, mixtral-8x7b(Mixture of Experts), deepseek, chatglm3;
1414

15-
🔥 MFTCoder supports finetuning multiple mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
15+
🔥 MFTCoder-accelerate supports Self-Paced Loss for Convergence Balance;
16+
17+
🔥 MFTCoder-accelerate supports Full-parameters/QLoRA/LoRA using accelerate + DeepSpeed Framework;
18+
19+
🔥 MFTCoder-accelerate supports Multiple Task Finetuning, which is able to balance diffenrent tasks in data level.
20+
21+
🔥 MFTCoder-accelerate supports finetuning multiple mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
1622

1723
## 2. Data Format
1824
### 2.1 Training Data Format
@@ -54,8 +60,8 @@ For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple o
5460
}
5561
```
5662

57-
### 2.2 Inference Data Format
58-
The inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
63+
### 2.2 Default Inference Data Format
64+
The default inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format.
5965
It is used as the data "seen"(before tokenization) by the model in training process.
6066
It is used as input during the inference process as well.
6167
Here is an example format of the concatenated string:
@@ -86,7 +92,7 @@ When applying inference, you always make your input string end with ```<s>bot\n`
8692

8793

8894
## 3. Model Training
89-
Currently, the "MFTCoder_accelerate" codebase supports QLoRA instruction fine-tuning, and LoRA instruction fine-tuning and Full parameter MFT.
95+
Currently, the "MFTCoder-accelerate" codebase supports Full-parameters/LoRA/QLoR along with Multi-Task FineTuning(MFT).
9096
In theory, this project can be used to train any publicly available model in the HuggingFace Format.
9197

9298
Here are some excellent pre-trained models weights available on Huggingface that can be finetuned with this codebase:
@@ -97,6 +103,36 @@ Here are some excellent pre-trained models weights available on Huggingface that
97103

98104
🤗 [Multilingual powerhouse, Qwen-7b](https://huggingface.co/Qwen/Qwen-7B): Suitable for multilingual tasks, including Chinese tasks, for instruction fine-tuning.
99105

106+
**mftcoder_accelerate directory structure**
107+
```
108+
mftcoder_accelerate
109+
|
110+
src
111+
configs
112+
|
113+
data
114+
|
115+
model
116+
|
117+
*pefts*
118+
|
119+
tokenizer
120+
|
121+
utils
122+
|
123+
evals
124+
```
125+
我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化, 详见```src```目录下的实现。
126+
127+
训练入口文件是```mftcoder_accelerate/src/pefts/mft_accelerate.py```
128+
129+
参数配置存储在```mftcoder_accelerate/src/configs```目录下,方便统一管理和更改。
130+
131+
**_所以,在你开启训练之前,请进入src目录_**
132+
```
133+
cd mftcoder_accelerate/src
134+
```
135+
100136
You can find the implementations in the ```mftcoder_accelerate/src``` directory.
101137
The entry directory for fine-tuning training is ```mftcoder_accelerate/src```, and the entry file for training is ```mftcoder_accelerate/src/pefts/mft_accelerate.py```.
102138
Configurations are stored in the ```mftcoder_accelerate/src/configs``` directory for easy management and modification.
@@ -107,7 +143,9 @@ cd mftcoder_accelerate/src
107143
```
108144

109145
### 3.1 Tokenization
110-
During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned earlier) and then tokenize it. In this format, ```<s>human\n``` starts the user's input (i.e., prompt),```<s>bot\n``` starts the assistant's output (i.e., response)
146+
During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned before) and then tokenize it.
147+
148+
In default format, ```<s>human\n``` starts the user's input (i.e., prompt),```<s>bot\n``` starts the assistant's output (i.e., response)
111149

112150
```{EOS_TOKEN}``` represents the proper eos_token.
113151
We have different eos_tokens in ```src/pefts/model_mapping.py``` which fits different base models.
@@ -122,28 +160,41 @@ By including all target parts from multiple turns in a single training iteration
122160

123161

124162
### 3.2 LoRA/QLoRA
163+
164+
#### Intro
125165
You can refer to the Lora paper for details about LoRA:[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
166+
126167
You can refer to the Qlora paper for details about QLoRA:[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
127168

128169
QLoRA (Quantized LoRA) is a method that combines 4-bit nf4 quantization and additional adapters to achieve a balance between reducing GPU memory consumption and approaching the performance of full-parameter fine-tuning.
129170

130171
According to the QLoRA paper, this method enables fine-tuning of a 33B model on a single V100 GPU while achieving performance close to that of full-parameter fine-tuning.
131172

132173
To perform LoRA/QLoRA fine-tuning, you can execute the following command:
133-
```bash
134-
cd mftcoder_accelerate/src
135174

136-
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/lora_train_config.json
175+
#### Launch via Deepspeed
176+
DeepSpeed config in accelerate_ds_config.yaml.
177+
```bash
178+
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "DeepSpeed"
137179
```
138-
OR
139-
140-
You can launch the training by:
180+
or
181+
DeepSpeed config in command line arguments
141182
```bash
142-
cd mftcoder_accelerate/src
183+
sh ds_single_launch.sh
184+
```
143185

186+
#### Launch via FSDP
187+
FSDP config in accelerate_fsdp_config.yaml.
188+
```bash
189+
accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "FSDP"
190+
```
191+
or
192+
FSDP config in command line arguments
193+
```bash
144194
sh ds_single_launch.sh
145195
```
146196

197+
#### Traing Arguments
147198
All arguments allowed in ***_train_config.josn are defined in ```arguments.py```.
148199

149200
Frequently used arguments are provided in ```configs/***_train_config``` and explained as follows. You can modify these parameters according to your needs:
@@ -210,6 +261,8 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
210261

211262
- **saving_limit**: ckpt saving limit num, must be set in Full-parameter training.
212263

264+
- **role_markers**: {"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n} as default(null). You could set your preferred role_markers as the templates startting "system", "user" and "assistant". e.g. {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
265+
213266

214267
## 4. Model Usage
215268

mftcoder_accelerate/README_cn.md

Lines changed: 68 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,17 @@
77
[**中文**] [[English]](README.md)
88

99
## 1. 更新
10+
🔥 MFTCoder-accelerate 新增支持accelerate + FSDP框架, 支持全量微调和LoRA;
1011

11-
🔥 MFTCoder在Huggingface accelerate + DeepSpeed框架下支持QLoRA/LoRA微调;
12+
🔥 MFTCoder-accelerate 支持最新更多主流开源模型: mistral, mixtral-8x7b(Mixture of Experts), deepseek, chatglm3;
1213

13-
🔥 MFTCoder在训练中支持了多任务微调, 可以同时平衡多个任务的训练,训练的模型支持多任务推理;
14+
🔥 MFTCoder-accelerate 新增self-paced Loss, 用于收敛均衡;
1415

15-
🔥 MFTCoder在训练中支持多种模型基座: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen等
16+
🔥 MFTCoder-accelerate 支持使用accelerate + DeepSpeed框架下支持 全量参数/QLoRA/LoRA微调;
17+
18+
🔥 MFTCoder-accelerate 在训练中支持了多任务微调MFT, 可以同时平衡多个任务的训练,训练的模型支持多任务推理;
19+
20+
🔥 MFTCoder-accelerate 在训练中支持多种模型基座: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen等
1621

1722
## 2. 数据格式
1823
### 2.1 训练数据格式
@@ -87,8 +92,26 @@
8792

8893
🤗 [多语言能手Qwen-7b](https://huggingface.co/Qwen/Qwen-7B) :适用于多语言任务,也适用中文任务。进行指令微调时。
8994

90-
我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化,详见src目录下的实现。
91-
微调训练的根目录是```mftcoder_accelerate/src/```,
95+
**mftcoder_accelerate文件结构**
96+
```
97+
mftcoder_accelerate
98+
|
99+
src
100+
configs
101+
|
102+
data
103+
|
104+
model
105+
|
106+
*pefts*
107+
|
108+
tokenizer
109+
|
110+
utils
111+
|
112+
evals
113+
```
114+
我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化, 详见```src```目录下的实现。
92115

93116
训练入口文件是```mftcoder_accelerate/src/pefts/mft_accelerate.py```
94117

@@ -99,9 +122,14 @@
99122
cd mftcoder_accelerate/src
100123
```
101124

125+
126+
102127
### 3.1 数据tokenization
103-
训练时,我们将多轮对话拼接成如下格式(也是上文中的推理string格式),然后进行tokenize。其中```<s>human\n```表示human输入提示符,```<s>bot\n```表示bot输出提示符,```{EOS_TOKEN}``` 表示eos_token。
104-
其中eos_token可以根据不同模型修改替换。
128+
训练时,我们将多轮对话拼接成如下格式(也是上文中的推理数据格式),然后进行tokenize。
129+
其中,默认情况下:
130+
131+
```<s>human\n```作为human/user的起始符,```<s>bot\n```作为bot/assistant的起始符,```{EOS_TOKEN}``` 表示eos_token。
132+
其中eos_token可以根据不同模型修改替换。不同角色的起始符可以配置,用来实现不同的对话/问答模版。
105133
```
106134
"<s>human\n{input1}<s>bot\n{target1}{EOS_TOKEN}<s>human\n{input2}<s>bot\n{target2}{EOS_TOKEN}\n"
107135
```
@@ -147,39 +175,41 @@ deepspeed配置在脚本中通过命令行输入。
147175
sh ds_single_launch.sh
148176
```
149177

178+
#### 训练参数
150179
_**训练需要的参数配置在```configs/*_train_config```中,主要参数说明如下:**_
151180

152-
- load_raw_dataset : 需要保持true,后续会支持其它模式数据,当前仅支持jsonl输入
153-
- data_paths: "[path1,path2,path3]" 输入数据地址,字符串,开头结尾用[],中间用```,```间隔不同path,每个path是一个目录,目录的最后一级名字作为任务名称,下面包含1到多个jsonl数据
154-
- output_dir:训练输出目录,存储checkpoint(全量训练时)、lora_adaptor(Lora或者Qlora时)等
155-
- tb_dir: 存储tensorboard等
156-
- model_type: "mixtral|mistral|deepseek|llama|starcoder|chatglm2|qwen|gpt_neox"
157-
- attn_implementation: "flash_attention_2" 或者 "eager"
158-
- peft_type: lora或者qlora或者null(全量微调)
159-
- lora_rank: lora rank
160-
- lora_alpha: lora alpha
161-
- lora_dropout: lora dropout
162-
- target_modules: List[str], lora目标模块,如果null,会使用默认,参考model_mapping.py
163-
- quantization: 是否量化,"4bit", "8bit" 或者null, qlora推荐4bit量化
164-
- pretrained_model_path:预训练模型的本地目录,或者在huggingface上的模型名称。
165-
- weighted_loss_mode: 多任务loss加权模式, "case3"是当前推荐。
166-
- padding_mode: 数据的样本组织方式, "padding"是将每个原始样本填充到seq_length, "pack"是将尽量多的样本打包到每个seq_length的序列中。
167-
- num_train_epochs:训练的轮次。如果数据量足够大,一般建议只训1-2个epoch。
168-
- per_device_train_batch_size:每张显卡train的batch size。
169-
- per_device_eval_batch_size:每张显卡eval的batch size。
170-
- gradient_accumulation_steps:梯度累计步数。global batch=num_gpus * per_device_train_batch_size * gradient_accumulation_steps。
171-
- learning_rate:学习率。全量参数微调的时候,建议小一些,1e-5或5e-6。qlora中的学习率设置更大一些,一般为1e-4、2e-4。
172-
- min_lr: 最低学习率, 一般是learning_rate的十分之一
173-
- seq_length:训练时的最大长度。按照自己的设备进行设置,越长需要占用越多显存。
174-
- log_interval:每隔多少步统计一次train loss。
175-
- checkpointing_steps:每隔多少步保存一个模型。
176-
- evalation_steps:每隔多少步在验证集上evaluate一次。
177-
- early_stopping : 是否执行early_stop
178-
- early_stopping_stall_num: 多少个eval point不继续收敛,则停止训练
179-
- lr_scheduler_type:学习率变化策略。常用"cosine"
180-
- warmup_steps:warm up步数。学习率经过多少步,增长到指定的数值。
181-
- seed:随机种子,用于复现实验结果。
182-
- saving_limit:整数,ckpt存储数量上限, 全量训练必须设置。默认null即不限制数量。
181+
- **load_raw_dataset**: 需要保持true,后续会支持其它模式数据,当前仅支持jsonl输入
182+
- **data_paths**: "[path1,path2,path3]" 输入数据地址,字符串,开头结尾用[],中间用```,```间隔不同path,每个path是一个目录,目录的最后一级名字作为任务名称,下面包含1到多个jsonl数据
183+
- **output_dir**:训练输出目录,存储checkpoint(全量训练时)、lora_adaptor(Lora或者Qlora时)等
184+
- **tb_dir**: 存储tensorboard等
185+
- **model_type**: "mixtral|mistral|deepseek|llama|starcoder|chatglm2|qwen|gpt_neox"
186+
- **attn_implementation**: "flash_attention_2" 或者 "eager"
187+
- **peft_type**: lora或者qlora或者null(全量微调)
188+
- **lora_rank**: lora rank
189+
- **lora_alpha**: lora alpha
190+
- **lora_dropout**: lora dropout
191+
- **target_modules**: List[str], lora目标模块,如果null,会使用默认,参考model_mapping.py
192+
- **quantization**: 是否量化,"4bit", "8bit" 或者null, qlora推荐4bit量化
193+
- **pretrained_model_path**:预训练模型的本地目录,或者在huggingface上的模型名称。
194+
- **weighted_loss_mode**: 多任务loss加权模式, "case3"是当前推荐。
195+
- **padding_mode**: 数据的样本组织方式, "padding"是将每个原始样本填充到seq_length, "pack"是将尽量多的样本打包到每个seq_length的序列中。
196+
- **num_train_epochs**:训练的轮次。如果数据量足够大,一般建议只训1-2个epoch。
197+
- **per_device_train_batch_size**:每张显卡train的batch size。
198+
- **per_device_eval_batch_size**:每张显卡eval的batch size。
199+
- **gradient_accumulation_steps**:梯度累计步数。global batch=num_gpus * per_device_train_batch_size * gradient_accumulation_steps。
200+
- **learning_rate**:学习率。全量参数微调的时候,建议小一些,1e-5或5e-6。qlora中的学习率设置更大一些,一般为1e-4、2e-4。
201+
- **min_lr**: 最低学习率, 一般是learning_rate的十分之一
202+
- **seq_length**:训练时的最大长度。按照自己的设备进行设置,越长需要占用越多显存。
203+
- **log_interval**:每隔多少步统计一次train loss。
204+
- **checkpointing_steps**:每隔多少步保存一个模型。
205+
- **evaluation_steps**:每隔多少步在验证集上evaluate一次。
206+
- **early_stopping** : 是否执行early_stop
207+
- **early_stopping_stall_num**: 多少个eval point不继续收敛,则停止训练
208+
- **lr_scheduler_type**:学习率变化策略。常用"cosine"
209+
- **warmup_steps**:warm up步数。学习率经过多少步,增长到指定的数值。
210+
- **seed**:随机种子,用于复现实验结果。
211+
- **saving_limit**:整数,ckpt存储数量上限, 全量训练必须设置。默认null即不限制数量。
212+
- **role_markers**: null,即使用{"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n}。 你可以自定义 "system", "user" and "assistant"的模板, 用于定制自己的问答或者对话模板,比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
183213

184214
## 4. 模型使用
185215

mftcoder_accelerate/src/data/preprocess_data.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ def is_question_response_format(data):
9494
else:
9595
return False
9696

97+
9798
def is_question_answer_format(data):
9899
if "question" in data and "answer" in data:
99100
return True
@@ -131,10 +132,10 @@ def encode(self, text):
131132

132133

133134
class UniformEncoder(Encoder):
134-
135135

136136
def __init__(self, args, mode='sft'):
137137
super().__init__(args)
138+
self.verbose = False
138139
self.mode = mode
139140
# seq_length + 1 for shifting
140141
if args.load_raw_dataset:
@@ -268,7 +269,6 @@ def _tokenize_fields(self, data, data_type):
268269
input_ids += prompt_ids + answer_ids
269270
loss_mask += [0] * len(prompt_ids) + [1] * len(answer_ids)
270271

271-
272272
# print(self.mode)
273273
if self.mode == 'pretrain':
274274
# change loss mask to all 1s

0 commit comments

Comments
 (0)