Skip to content

Commit 4d7aee6

Browse files
committed
bugfix, fsdp for lora
1 parent 78298c6 commit 4d7aee6

File tree

9 files changed

+74
-34
lines changed

9 files changed

+74
-34
lines changed

mftcoder_accelerate/README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -280,10 +280,16 @@ However, this may slightly slow down the training speed.
280280

281281
#### Q2:install packages
282282
Please refer to init_env.sh and requirements.txt
283-
283+
We highly recommend you install Flash Attention 2 (flash_attn>=2.1.0, 2.3.6 used by us) first to get memory-efficient and fast training.
284284

285285
#### Q3:How should I specify the GPUs for training?
286286
You can specify the visiable GPUs as below:
287287
```bash
288-
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file accelerate_ds_config.yaml mft_accelerate.py --train_config configs/xxx_train_config.json
288+
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json
289289
```
290+
291+
#### Q4:Whats is a recommended Distributed Training?
292+
For LoRA/QLoRA, we recommend DeepSpeed(ZeRO2) as the underlying framework, because it is easy and stable to use, moreover it is more compatable for different settings.
293+
And FSDP does not support Quantization(integer type in training).
294+
295+
For Full-parameter finetuning, FSDP is usually faster, and may help you with very large models by sharding parameters and gradients.

mftcoder_accelerate/README_cn.md

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@
7878

7979

8080
## 3. 模型训练
81-
目前支持全量参数指令微调、QLoRA指令微调,LoRA指令微调。
81+
目前支持全量参数(Full-parameters)指令微调、QLoRA指令微调,LoRA指令微调。
8282
一些优秀的代码预训练模型权重,理论上,HuggingFace上开源的模型,均可使用本项目进行训练:
8383

8484
🤗 [最新代码预训练SOTA,CodeLlama](https://huggingface.co/codellama/CodeLlama-34b-Python-hf) :code-llama-34b, code-llama-34b-python, 新的SOTA基座。
@@ -109,18 +109,34 @@ cd mftcoder_accelerate/src
109109
这种方式充分利用了模型并行计算的优势,训练更加高效,同时也充分利用了decoder-only模型从左到右attention的特性,一次性将多轮对话中的每个target部分都参与了训练,训练更充分高效。
110110

111111
### 3.2 LoRA/QLoRA微调
112+
113+
#### LoRA/QLoRA微调简介
112114
关于LoRA的详细介绍可参考论文:[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
113115

114116
关于QLoRA的详细介绍可参考论文:[QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
115117

116118
QLoRA通过4-bit的nf4量化,且加入更多adapter,在大幅减少显存消耗的同时,尽可能逼近全量参数微调的效果。
117119
QLoRA论文指出,该方法可以在一张V100上对33B的模型进行微调,并且性能逼近全量参数微调。
118120

119-
执行如下命令即可进行Lora/QLora/全量 微调:
121+
执行如下命令即可进行 Lora/QLora/全量 微调:
122+
#### Launch via Deepspeed
123+
deepspeed配置在accelerate_ds_config.yaml中。
124+
```bash
125+
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "deepspeed"
126+
```
127+
或者
128+
129+
修改并执行如下sh脚本:
120130

131+
deepspeed配置在脚本中通过命令行输入。
132+
```bash
133+
sh ds_single_launch.sh
134+
```
135+
136+
#### Launch via FSDP
121137
deepspeed配置在accelerate_ds_config.yaml中。
122138
```bash
123-
accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json
139+
accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "fsdp"
124140
```
125141
或者
126142

@@ -226,19 +242,23 @@ print(gen_text)
226242
#### 问题3:如何指定使用某些卡训练?
227243
通过如下方式,即可指定使用0和1号卡进行训练:
228244
```bash
229-
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_config.yaml mft_accelerate.py --train_config configs/xxx_train_config.json
245+
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "deepspeed"
230246
```
231247

232-
#### 问题4:如果无法安装flash attention 2, 该如何训练
233-
参数"attn_implementation" 设置成 "eager" 可以用naive attention
248+
#### 问题4:关于Flash Attention, 该如何配置训练?
249+
首先,我们强烈建议您安装Flash Attention 2(FA2),(>=2.1.0, 2.3.6功能更齐全)。
250+
251+
训练参数中"attn_implementation" 设置成 "eager" 可以用naive attention,也就是未经加速的attention。
234252

235-
如果你可以自行安装环境并使用torch>=2.1.1,可以尝试设置参数"attn_implementation"为 "sdpa"。这样会尝试使用transformers兼容的torch.nn.functional.scaled_dot_product_attention。支持的模型不全面
253+
训练参数中"attn_implementation" 设置成 "flash_attention_2" 可以用FA2,速度快,省显存
236254

237-
#### 问题5:在FDSP模式下,使用LoRA + Flash Attention,需要注意什么?
238-
FSDP模式下,由于dtype统一的问题,FA需要将queue, key, value同时加入target_modules,适配这种情况不影响最终结果。
255+
如果你可以自行安装环境并使用torch>=2.1.1,可以尝试设置参数"attn_implementation"为 "sdpa"。这样会尝试使用transformers兼容的torch.nn.functional.scaled_dot_product_attention。支持的模型还不全面。
239256

240-
FSDP模式下,不支持QLoRA, 因为目前对int类型的支持不够完全。
257+
#### 问题5:推荐的分布式框架是怎样的?
258+
对于LoRA/QLoRA, 我们推荐使用DeepSpeed作为底层分布式框架,它具有易用性和兼容性好的特点,并且速度很快。
259+
FSDP 不支持QLoRA, 因为bitsandbytes暂不支持FSDP。
241260

261+
对于全量微调,我们推荐使用FSDP, 因为它在全量训练时可以发挥fully sharding的优势,已达到更快的训练速度。
242262

243263
#### 问题6:当前支持的模型中,有什么区别
244264
国产大模型比如chatglm2, chatglm3, baichuan2, qwen, aquila2等,使用的是和模型共同发布的modeling_xxx.py.

mftcoder_accelerate/src/ds_single_launch.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
#!/bin/sh
2+
# Author: Chaoyu Chen
3+
# Last Modified: 2024/12/11
4+
# Description: An alternative(Command line) way to launch DeepSpeed training
5+
16
# Launch script on single node
27
N_GPU_PER_NODE=8
38

@@ -26,5 +31,5 @@ accelerate launch \
2631
--machine_rank 0 \
2732
--rdzv_backend 'static' \
2833
pefts/mft_accelerate.py --train_config configs/"xxx_train_config.json" \
29-
--distributed_type "DeepSpeed" \
34+
--distributed_type "deepspeed" \
3035
> MFTCoder-training-"$TODAY".log 2>&1 &

mftcoder_accelerate/src/fsdp_single_launch.sh

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
#!/bin/sh
2+
# Author: Chaoyu Chen
3+
# Last Modified: 2024/12/11
4+
# Description: An alternative(command line) way to launch FSDP training
5+
16
# Launch script on single node
27
N_GPU_PER_NODE=8
38

@@ -7,10 +12,11 @@ export TOKENIZERS_PARALLELISM=False
712

813
TODAY=$(date +%Y-%m%d-%H%M)
914

10-
ccelerate launch \
15+
# accelerate launch --config_file accelerate_fsdp_config.yaml \
16+
accelerate launch \
1117
--use_fsdp \
1218
--num_machines=1 \
13-
--num_processes=2 \
19+
--num_processes=$N_GPU_PER_NODE \
1420
--fsdp_sharding_strategy=1 \
1521
--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
1622
--fsdp_state_dict_type=FULL_STATE_DICT \
@@ -24,6 +30,6 @@ ccelerate launch \
2430
--machine_rank=0 \
2531
--rdzv_backend=static \
2632
pefts/mft_accelerate.py --train_config configs/"xxx_train_config.json" \
27-
--distributed_type "FSDP" \
33+
--distributed_type "fsdp" \
2834
> MFTCoder-training-"$TODAY".log 2>&1 &
2935

mftcoder_accelerate/src/pefts/merge_base_and_lora_to_hf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
"""
22
# @author Chaoyu Chen
33
# @date 2023/10/19
4-
Merge base and adaptor
4+
5+
Merge base and lora adaptor
56
"""
67
import os
78
import sys

mftcoder_accelerate/src/pefts/mft_accelerate.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
# @author qumu
2+
# @author Chaoyu Chen
33
# @date 2023/12/11
44
# @module mft_accelerate.py
55
@@ -374,8 +374,7 @@ def main():
374374
# args.saving_limit = None
375375
else:
376376
model.gradient_checkpointing_enable()
377-
assert (args.saving_limit is not None and isinstance(args.saving_limit,
378-
int)), "saving_limit must be a integer in Full Training"
377+
assert (args.saving_limit is not None and isinstance(args.saving_limit, int)), "saving_limit must be a integer in Full Training"
379378

380379
# Potentially load in the lora from a previous save
381380
if args.peft_type:
@@ -412,7 +411,7 @@ def main():
412411
adam_optimizer = Adam
413412
elif accelerator.distributed_type == DistributedType.FSDP:
414413
accelerator.print("DISTRIBUTED TRAINING USING FSDP")
415-
if getattr(accelerator.state, "fsdp_plugin", None) is not None:
414+
if args.peft_type and getattr(accelerator.state, "fsdp_plugin", None) is not None:
416415
from peft.utils.other import fsdp_auto_wrap_policy
417416
accelerator.state.fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(model)
418417
model = accelerator.prepare(model)

mftcoder_accelerate/src/pefts/model_mapping.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
"""
22
# @author Chaoyu Chen
3-
# @date 2023/10/11
3+
# @date 2023/12/11
4+
45
Manage supported models and their special token used in training.
56
Default targeting modules for LoRA/QLora
67
4.36 is stable now
78
"""
89
# Models that Transformers support FA2
910
from transformers import (
1011
AutoConfig,
11-
AutoTokenizer,
12+
AutoTokenizer,
1213
AutoModelForCausalLM,
1314
GPTNeoXForCausalLM,
1415
GPTBigCodeForCausalLM,
@@ -24,6 +25,7 @@
2425
from model.qwen.modeling_qwen import QWenLMHeadModel
2526
from model.chatglm2.modeling_chatglm import ChatGLMForConditionalGeneration as ChatGLMForConditionalGeneration2
2627
from model.chatglm3.modeling_chatglm import ChatGLMForConditionalGeneration as ChatGLMForConditionalGeneration3
28+
2729
# from model.phi.modeling_mixformer_sequential import MixFormerSequentialForCausalLM
2830

2931
MODEL_TYPES = {
@@ -43,22 +45,21 @@
4345
}
4446

4547
FULL_LORA_TARGETING_MODULES = {
46-
"aquila": ["q_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
48+
"aquila": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
4749
"baichuan": ["W_pack", "o_proj", "gate_proj", "down_proj", "up_proj"],
4850
"chatglm2": ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"],
4951
"chatglm3": ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"],
50-
"deepseek": ["q_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
51-
"code_llama": ["q_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
52+
"deepseek": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
53+
"code_llama": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
5254
"gpt_neox": ["query_key_value", 'dense', 'dense_h_to_4h', 'dense_4h_to_h'],
53-
"llama": ["q_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
55+
"llama": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
5456
"mistral": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"],
5557
"mixtral": ["q_proj", "k_proj", "v_proj", "o_proj"],
5658
"phi": ["query_key_value", 'dense', 'fc1', 'fc2'],
5759
"qwen": ["c_proj", "c_attn", "w1", "w2"],
58-
"starcoder": ["c_proj", "c_attn", "q_attn", "c_fc"],
60+
"starcoder": ["c_proj", "c_attn", "q_attn", "c_fc"],
5961
}
6062

61-
6263
MODEL_SPECIAL_TOKENS = {
6364
"gpt_neox": {
6465

@@ -94,7 +95,7 @@
9495

9596
"eos_token": "<|endoftext|>",
9697
"pad_token": "<|extra_1|>",
97-
98+
9899
},
99100
"chatglm2": {
100101

@@ -125,7 +126,7 @@
125126
"pad_token": "<|end▁of▁sentence|>",
126127

127128
},
128-
"mixtral": {
129+
"mixtral": {
129130

130131
"eos_token": "</s>",
131132
"pad_token": "<unk>",
@@ -137,4 +138,4 @@
137138
"pad_token": "<unk>",
138139

139140
},
140-
}
141+
}

mftcoder_accelerate/src/tokenizer/chat_template.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# -*- coding: utf-8 -*-
2-
# @author qumu
2+
# @author Chaoyu Chen
33
# @date 2023/12/25
4-
# @module chat_template
4+
55
# store possible chat_template for tokenizers to prepare input string
66
# -------------------------------------------------- Import ------------------------------------------------------------
77
from transformers import (

mftcoder_accelerate/src/tokenizer/tokenizer.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
"""
22
# @author Chaoyu Chen
33
# @date 2023/6/19
4+
5+
Build tokenizer
46
"""
57

68

0 commit comments

Comments
 (0)