Skip to content

Commit f9669ce

Browse files
committed
readme
1 parent edd150b commit f9669ce

File tree

3 files changed

+78
-39
lines changed

3 files changed

+78
-39
lines changed

mftcoder_accelerate/README.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# MFTCoder Training: Huggingface accelerate + DeepSpeed Framework
1+
# MFTCoder-accelerate: Training Framework with accelerate and deepspeed
22
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai)
33
<a href="https://github.com/codefuse-ai/MFTCoder/blob/main/LICENSE">
44
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
@@ -64,24 +64,24 @@ Here is an example format of the concatenated string:
6464
"""
6565
<s>system
6666
System instruction
67-
<s>user
67+
<s>human
6868
User 1st round input
69-
<s>assistant
69+
<s>bot
7070
Assistant 1st round output{EOS_TOKEN}
71-
<s>user
71+
<s>human
7272
User 2nd round input
73-
<s>assistant
73+
<s>bot
7474
Assistant 2nd round output{EOS_TOKEN}
7575
...
7676
...
7777
...
78-
<s>user
78+
<s>human
7979
User nth round input
80-
<s>assistant
80+
<s>bot
8181
{Assistant output to be genreated}{EOS_TOKEN}
8282
"""
8383
```
84-
When applying inference, you always make your input string end with ```<s>assistant\n``` to request the model generating answers.
84+
When applying inference, you always make your input string end with ```<s>bot\n``` to request the model generating answers.
8585

8686

8787

@@ -107,14 +107,14 @@ cd mftcoder_accelerate/src
107107
```
108108

109109
### 3.1 Tokenization
110-
During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned earlier) and then tokenize it. In this format, ```<s>user\n``` starts the user's input (i.e., prompt),```<s>assistant\n``` starts the assistant's output (i.e., response)
110+
During training, we concatenate multi-turn dialogues into the following format (also known as the inference data format mentioned earlier) and then tokenize it. In this format, ```<s>human\n``` starts the user's input (i.e., prompt),```<s>bot\n``` starts the assistant's output (i.e., response)
111111

112112
```{EOS_TOKEN}``` represents the proper eos_token.
113113
We have different eos_tokens in ```src/pefts/model_mapping.py``` which fits different base models.
114114

115115
Here is a visionable example of the training data after formatting:
116116
```
117-
f"<s>user\n{input1}<s>assistant\n{target1}{EOS_TOKEN}\n<s>user\n{input2}<s>assistant\ntarget2{EOS_TOKEN}\n"
117+
f"<s>human\n{input1}<s>bot\n{target1}{EOS_TOKEN}\n<s>human\n{input2}<s>bot\ntarget2{EOS_TOKEN}\n"
118118
```
119119
During the calculation of loss, we use a ```loss mask``` to ensure that the loss from the input part does not contribute to parameter updates. Only the loss from the ```target{EOS_TOKEN}``` part is used for updating parameters.
120120
This approach takes full advantage of the benefits of model parallelism, making training more efficient. It also leverages the characteristic of decoder-only models with left-to-right attention.
@@ -149,6 +149,8 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
149149

150150
- **model_type**: Type of the model to train, e.g., "mixtral | llama | starcoder | chatglm2 | qwen | gpt_neox".
151151

152+
- **attn_implementation**: "flash_attention_2" or "eager" or "sdpa", worked when model is supported by transformers officially
153+
152154
- **peft_type**: either "lora" or "qlora".
153155

154156
- **lora_rank**: Rank value for Lora.
@@ -226,8 +228,8 @@ tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("<|end▁of▁sentenc
226228
tokenizer.pad_token_id = tokenizer.eos_token_id
227229
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
228230

229-
HUMAN_ROLE_START_TAG = "<s>user\n"
230-
BOT_ROLE_START_TAG = "<s>assistant\n"
231+
HUMAN_ROLE_START_TAG = "<s>human\n"
232+
BOT_ROLE_START_TAG = "<s>bot\n"
231233
texts = ["write a python function of quick sort."]
232234
texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts]
233235

mftcoder_accelerate/README_cn.md

Lines changed: 62 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# MFTCoder训练: Huggingface accelerate + DeepSpeed框架篇
1+
# MFTCoder: Accelerate + DeepSpeed框架篇
22
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai)
33
<a href="https://github.com/codefuse-ai/MFTCoder/blob/main/LICENSE">
44
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
@@ -56,16 +56,23 @@
5656
推理数据格式为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:
5757
```python
5858
"""
59-
<|role_start|>system<|role_end|>这是System指令
60-
<|role_start|>human<|role_end|>这是第1轮用户输入的问题
61-
<|role_start|>bot<|role_end|>这是第1轮模型生成的内容</s>
62-
<|role_start|>human<|role_end|>这是第2轮用户输入的问题
63-
<|role_start|>bot<|role_end|>这是第2轮模型生成的内容</s>
59+
<s>system
60+
这是System指令
61+
<s>human
62+
这是第1轮用户输入的问题
63+
<s>bot
64+
这是第1轮模型生成的内容{EOS_TOKEN}
65+
<s>human
66+
这是第2轮用户输入的问题
67+
<s>bot
68+
这是第2轮模型生成的内容{EOS_TOKEN}
6469
...
6570
...
6671
...
67-
<|role_start|>human<|role_end|>这是第n轮用户输入的问题
68-
<|role_start|>bot<|role_end|>{模型现在要生成的内容}</s>
72+
<s>human
73+
这是第n轮用户输入的问题
74+
<s>bot
75+
{模型现在要生成的内容}{EOS_TOKEN}
6976
"""
7077
```
7178

@@ -80,15 +87,25 @@
8087

8188
🤗 [多语言能手Qwen-7b](https://huggingface.co/Qwen/Qwen-7B) :适用于多语言任务,也适用中文任务。进行指令微调时。
8289

83-
我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化,详见src目录下的实现。微调训练的入口目录是```src/pefts```, 训练入口文件是```src/pefts/mft_accelerate.py```, 参数配置存储在```src/pefts/configs```目录下,方便统一管理和更改。
90+
我们将训练中使用的各种组件抽取出来,以便后续的扩展和优化,详见src目录下的实现。
91+
微调训练的根目录是```mftcoder_accelerate/src/```,
92+
93+
训练入口文件是```mftcoder_accelerate/src/pefts/mft_accelerate.py```
94+
95+
参数配置存储在```mftcoder_accelerate/src/configs```目录下,方便统一管理和更改。
96+
97+
**_所以,在你开启训练之前,请进入src目录_**
98+
```
99+
cd mftcoder_accelerate/src
100+
```
84101

85102
### 3.1 数据tokenization
86-
训练时,我们将多轮对话拼接成如下格式(也是上文中的推理string格式),然后进行tokenize。其中<|role_start|>human<|role_end|>表示human输入提示符,<|role_start|>bot<|role_end|>表示bot输出提示符,`````</s>````` 表示eos_token。
103+
训练时,我们将多轮对话拼接成如下格式(也是上文中的推理string格式),然后进行tokenize。其中```<s>human\n```表示human输入提示符,```<s>bot\n```表示bot输出提示符,```{EOS_TOKEN}``` 表示eos_token。
87104
其中eos_token可以根据不同模型修改替换。
88105
```
89-
"<|role_start|>human<|role_end|>input1</s>target1</s>input2</s>target2</s>...
106+
"<s>human\n{input1}<s>bot\n{target1}{EOS_TOKEN}<s>human\n{input2}<s>bot\n{target2}{EOS_TOKEN}\n"
90107
```
91-
在计算loss时,我们通过loss mask的方式,input部分的loss不参与参数更新,只有“target</s>”部分的loss参与参数更新。
108+
在计算loss时,我们通过loss mask的方式,input部分的loss不参与参数更新,只有“target{EOS_TOKEN}”部分的loss参与参数更新。
92109
这种方式充分利用了模型并行计算的优势,训练更加高效,同时也充分利用了decoder-only模型从左到右attention的特性,一次性将多轮对话中的每个target部分都参与了训练,训练更充分高效。
93110

94111
### 3.2 LoRA/QLoRA微调
@@ -101,7 +118,7 @@ QLoRA论文指出,该方法可以在一张V100上对33B的模型进行微调
101118

102119
执行如下命令即可进行Lora/QLora微调:
103120
```bash
104-
accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --train_config configs/xxx_train_config.json
121+
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json
105122
```
106123

107124
```configs/*_train_config```中的主要参数说明如下,以下参数可以根据需求进行修改,其他参数建议不做修改:
@@ -110,14 +127,16 @@ accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --tr
110127
- output_dir:训练输出目录,存储checkpoint、lora_adaptor等
111128
- tb_dir: 存储tensorboard等
112129
- model_type: "llama|starcoder|chatglm2|qwen|gpt_nex"
130+
- attn_implementation: "flash_attention_2" 或者 "eager"
113131
- peft_type: lora或者qlora
114132
- lora_rank: lora rank
115133
- lora_alpha: lora alpha
116134
- lora_dropout: lora dropout
135+
- target_modules: List[str], lora目标模块,如果null,会使用默认,参考model_mapping.py
117136
- quantization: 是否量化,"4bit", "8bit" 或者null, qlora推荐4bit量化
118137
- pretrained_model_path:预训练模型的本地目录,或者在huggingface上的模型名称。
119-
- **weighted_loss_mode**: 多任务loss加权模式, "case3"是当前推荐。
120-
- **padding_mode**: 数据的样本组织方式, "padding"是将每个原始样本填充到seq_length, "pack"是将尽量多的样本打包到每个seq_length的序列中。
138+
- weighted_loss_mode: 多任务loss加权模式, "case3"是当前推荐。
139+
- padding_mode: 数据的样本组织方式, "padding"是将每个原始样本填充到seq_length, "pack"是将尽量多的样本打包到每个seq_length的序列中。
121140
- num_train_epochs:训练的轮次。如果数据量足够大,一般建议只训1-2个epoch。
122141
- per_device_train_batch_size:每张显卡train的batch size。
123142
- per_device_eval_batch_size:每张显卡eval的batch size。
@@ -133,11 +152,20 @@ accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --tr
133152
- lr_scheduler_type:学习率变化策略。常用"cosine"
134153
- warmup_steps:warm up步数。学习率经过多少步,增长到指定的数值。
135154
- seed:随机种子,用于复现实验结果。
155+
- saving_limit:整数,ckpt存储数量上限, 全量训练必须设置。默认null即不限制数量。
136156

137157
## 4. 模型使用
138158

139159
### 4.1 权重合并
140-
如果使用LoRA或者QLoRA进行训练,本项目仅保存adapter的权重和配置文件,需要将adapter权重与base model进行合并。脚本见```src/pefts/merge_base_and_lora_to_hf.py```
160+
如果使用LoRA或者QLoRA进行训练,本项目仅保存adapter的权重和配置文件,需要将adapter权重与base model进行合并。
161+
可以使用如下merge_base_and_lora_to_hf.py脚本。
162+
```
163+
python pefts/merge_base_and_lora_to_hf.py \
164+
--base_model_or_path model_path \
165+
--adaptor_path lora_adapter_path \
166+
--model_type model_type \
167+
--merged_output_path output_path
168+
```
141169

142170
### 4.2 模型推理
143171
我们提供了单轮对话和多轮对话的如下脚本,该脚本可同时兼容大部分huggingface格式的模型。
@@ -146,14 +174,14 @@ from transformers import (
146174
AutoTokenizer,
147175
AutoModelForCausalLM,
148176
)
149-
tokenizer = AutoTokenizer.from_pretrained(mode_name_or_path, trust_remote_code=True, use_fast=False, legacy=False)
150-
tokenizer.padding_side = "left"
151-
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("<unk>")
152-
tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("</s>")
153-
model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, trust_remote_code=True)
154-
155-
HUMAN_ROLE_START_TAG = "<|role_start|>human<|role_end|>"
156-
BOT_ROLE_START_TAG = "<|role_start|>bot<|role_end|>"
177+
model_name_or_path = "codefuse-ai/CodeFuse-Deepseek-33B"
178+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, padding_side="left")
179+
tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("<|end▁of▁sentence|>")
180+
tokenizer.pad_token_id = tokenizer.eos_token_id
181+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
182+
183+
HUMAN_ROLE_START_TAG = "<s>human\n"
184+
BOT_ROLE_START_TAG = "<s>bot\n"
157185
texts = ["write a python function of quick sort."]
158186
texts = [f"{HUMAN_ROLE_START_TAG}{text}{BOT_ROLE_START_TAG}" for text in texts]
159187

@@ -187,9 +215,18 @@ print(gen_text)
187215
#### 问题3:如何指定使用某些卡训练?
188216
通过如下方式,即可指定使用0和1号卡进行训练:
189217
```bash
190-
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --train_config configs/xxx_train_config.json
218+
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_config.yaml mft_accelerate.py --train_config configs/xxx_train_config.json
191219
```
192220

221+
#### 问题4:如果无法安装flash attention 2, 该如何训练
222+
参数"attn_implementation" 设置成 "eager" 可以用naive attention
223+
224+
如果你可以自行安装环境并使用torch>=2.1.1,可以尝试设置参数"attn_implementation"为 "sdpa"。这样会尝试使用transformers兼容的torch.nn.functional.scaled_dot_product_attention。支持的模型不全面。
225+
226+
#### 问题5:当前支持的模型中,有什么区别
227+
国产大模型比如chatglm2, chatglm3, baichuan2, qwen, aquila2等,使用的是和模型共同发布的modeling_xxx.py.
228+
其它被transformers官方支持的大模型,由于已经升级支持flash attention等,所以全面切换到官方的modeling支持训练,之前的自定义modeling会被deprecated
229+
193230

194231

195232

mftcoder_accelerate/src/data/preprocess_data.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,8 +203,8 @@ def _tokenize_fields(self, data, data_type):
203203
assistant_marker = self.args.role_markers["assistant"]
204204
else:
205205
system_marker = '<s>system\n'
206-
user_marker = '<s>user\n'
207-
assistant_marker = '<s>assistant\n'
206+
user_marker = '<s>human\n'
207+
assistant_marker = '<s>bot\n'
208208
elif self.mode == 'pretrain':
209209
system_marker = ''
210210
user_marker = ''

0 commit comments

Comments
 (0)