Skip to content

Commit cb45fad

Browse files
Support FSDP + QLoRA (#659)
1 parent 898783b commit cb45fad

File tree

9 files changed

+113
-3
lines changed

9 files changed

+113
-3
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
3939
Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
4040

4141
## 🎉 News
42+
- 2024.04.04: Support **QLoRA+FSDP** to train a 70B model with two 24G memory GPUs, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh) to train.
4243
- 🔥2024.04.03: Support **Qwen1.5-32B** series: Qwen1.5-32B, Qwen1.5-32B-Chat, Qwen1.5-32B-Chat-GPTQ-Int4.use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_32b_chat/lora_mp/sft.sh) to start training!
4344
- 🔥2024.04.02: Support the fine-tuning and inference of Mengzi3-13B-Base model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/mengzi3_13b_base/lora_ddp_ds/sft.sh) to start training!
4445
- 🔥2024.04.01: Support **dbrx** series: dbrx-base and dbrx-instruct, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/dbrx-instruct/lora_mp/sft.sh) to start training!

README_CN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**(多模态大模型)的训练、推理、
4040
此外,我们也在拓展其他模态的能力,目前我们支持了AnimateDiff的全参数训练和LoRA训练。
4141

4242
## 🎉 新闻
43+
- 2024.04.04: 支持使用**QLoRA+FSDP**来使用两张24G显卡训练70B模型, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh)开始训练.
4344
- 🔥2024.04.03: 支持**Qwen1.5-32B**系列: Qwen1.5-32B, Qwen1.5-32B-Chat, Qwen1.5-32B-Chat-GPTQ-Int4。使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_32b_chat/lora_mp/sft.sh)来开始训练!
4445
- 🔥2024.04.02: 支持Mengzi3-13B-Base模型的推理与微调, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/mengzi3_13b_base/lora_ddp_ds/sft.sh)来开始训练!
4546
- 🔥2024.04.01: 支持**dbrx**系列, dbrx-base和dbrx-instruct, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/dbrx-instruct/lora_mp/sft.sh)来开始训练!.

docs/source/LLM/命令行参数.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
- `--bnb_4bit_comp_dtype`: 在进行4bit量化时, 我们需要在模型的forward和backward时, 将其进行反量化. 该参数用于指定反量化后的torch_dtype. 默认为`'AUTO'`, 即与`dtype`保持一致. 可选择的值包括: 'fp16', 'bf16', 'fp32'. 当quantization_bit为0时, 该参数无效.
4444
- `--bnb_4bit_quant_type`: 4bit量化时的量化方式, 默认是`'nf4'`. 可选择的值包括: 'nf4', 'fp4'. 当quantization_bit为0时, 该参数无效.
4545
- `--bnb_4bit_use_double_quant`: 是否在4bit量化时开启double量化, 默认为`True`. 当quantization_bit为0时, 该参数无效.
46+
- `--bnb_4bit_quant_storage`: 默认值为`None`. 量化参数的存储类型. 若`quantization_bit`设置为0, 则该参数失效.
4647
- `--lora_target_modules`: 指定lora模块, 默认为`['DEFAULT']`. 如果lora_target_modules传入`'DEFAULT'` or `'AUTO'`, 则根据`model_type`查找`MODEL_MAPPING`中的`lora_target_modules`(默认指定为qkv). 如果传入`'ALL'`, 则将所有的Linear层(不含head)指定为lora模块. 如果传入`'EMBEDDING'`, 则Embedding层指定为lora模块. 如果内存允许, 建议设置成'ALL'. 当然, 你也可以设置`['ALL', 'EMBEDDING']`, 将所有的Linear和embedding层指定为lora模块. 该参数只有当`sft_type`指定为'lora'时才生效.
4748
- `--lora_rank`: 默认为`8`. 只有当`sft_type`指定为'lora'时才生效.
4849
- `--lora_alpha`: 默认为`32`. 只有当`sft_type`指定为'lora'时才生效.
@@ -104,6 +105,11 @@
104105
- `--train_dataset_mix_ds`: 默认为`['ms-bench']`. 用于防止知识遗忘的通用知识数据集.
105106
- `--use_loss_scale`: 默认为`False`. 生效时会将Agent的部分字段(Action/Action Input部分)的loss权重加强以强化CoT, 对普通SFT场景没有任何效果.
106107

108+
### FSDP参数
109+
110+
- `--fsdp`: 默认值`''`, fsdp类型, 详情可以查看该参数[原始文档](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp).
111+
- `--fsdp_config`: 默认值`None`, fsdp配置文件的路径, 支持传入`fsdp_offload`, 该文件为SWIFT提供的默认配置, 具体可以查看[这里](https://github.com/modelscope/swift/tree/main/swift/llm/fsdp_config/fsdp_offload.json).
112+
107113
### LoRA+微调参数
108114

109115
- `--lora_lr_ratio`: 默认值`None`, 建议值`10~16`, 使用lora时指定该参数即可使用lora+.
@@ -184,6 +190,7 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
184190
- `--bnb_4bit_comp_dtype`: 默认值为`'AUTO'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
185191
- `--bnb_4bit_quant_type`: 默认值为`'nf4'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
186192
- `--bnb_4bit_use_double_quant`: 默认值为`True`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
193+
- `--bnb_4bit_quant_storage`: 默认值为`True`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
187194
- `--max_new_tokens`: 生成新token的最大数量, 默认值为`2048`.
188195
- `--do_sample`: 是使用贪婪生成的方式还是采样生成的方式, 默认值为`True`.
189196
- `--temperature`: 默认值为`0.3`. 该参数只有在`do_sample`设置为True时才生效. 该参数会在部署参数中作为默认值使用.
@@ -206,10 +213,10 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
206213
- `--vllm_max_lora_rank`: 默认为`16`. vllm对于lora支持的参数.
207214
- `--vllm_lora_modules`: 默认为`[]`, 输入的格式为`'{lora_name}-{lora_path}'`, e.g. `--vllm_lora_modules lora_name1=lora_path1 lora_name2=lora_path2`. `ckpt_dir`会以`f'default-lora={args.ckpt_dir}'`的形式加入args.vllm_lora_modules.
208215

209-
210216
## export 参数
211217

212218
export参数继承了infer参数, 除此之外增加了以下参数:
219+
213220
- `--to_peft_format`: 默认为`False`. 将lora的swift format转成peft format.
214221
- `--merge_lora`: 默认为`False`. 该参数已在InferArguments中定义, 不属于新增参数. 是否将lora权重merge到基模型中, 并保存完整的权重. 权重会保存在`ckpt_dir`的同级目录中, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx-merged'`目录下.
215222
- `--quant_bits`: 量化的bits数. 默认为`0`, 即不进行量化. 如果你设置了`--quant_method awq`, 你可以设置为`4`进行4bits量化. 如果你设置了`--quant_method gptq`, 你可以设置为`2`,`3`,`4`,`8`进行对应bits的量化. 如果对原始模型进行量化, 权重会保存在`f'{args.model_type}-{args.quant_method}-int{args.quant_bits}'`目录中. 如果对微调后模型进行量化, 权重会保存在`ckpt_dir`的同级目录中, e.g. `f'/path/to/your/vx-xxx/checkpoint-xxx-{args.quant_method}-int{args.quant_bits}'`目录下.
@@ -224,7 +231,6 @@ export参数继承了infer参数, 除此之外增加了以下参数:
224231
- `--hub_private_repo`: 默认为`False`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
225232
- `--commit_message`: 默认是`'update files'`.
226233

227-
228234
## app-ui 参数
229235

230236
app-ui参数继承了infer参数, 除此之外增加了以下参数:

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
- `--bnb_4bit_comp_dtype`: When doing 4bit quantization, we need to dequantize during model forward and backward passes. This specifies the torch_dtype after dequantization. Default is `'AUTO'`, i.e. consistent with `dtype`. Options: 'fp16', 'bf16', 'fp32'. Has no effect when quantization_bit is 0.
4444
- `--bnb_4bit_quant_type`: Quantization method for 4bit quantization, default is `'nf4'`. Options: 'nf4', 'fp4'. Has no effect when quantization_bit is 0.
4545
- `--bnb_4bit_use_double_quant`: Whether to enable double quantization for 4bit quantization, default is `True`. Has no effect when quantization_bit is 0.
46+
- `--bnb_4bit_quant_storage`: Default vlaue `None`.This sets the storage type to pack the quanitzed 4-bit prarams. Has no effect when quantization_bit is 0.
4647
- `--lora_target_modules`: Specify lora modules, default is `['DEFAULT']`. If lora_target_modules is passed `'DEFAULT'` or `'AUTO'`, look up `lora_target_modules` in `MODEL_MAPPING` based on `model_type` (default specifies qkv). If passed `'ALL'`, all Linear layers (excluding head) will be specified as lora modules. If passed `'EMBEDDING'`, Embedding layer will be specified as lora module. If memory allows, setting to 'ALL' is recommended. You can also set `['ALL', 'EMBEDDING']` to specify all Linear and embedding layers as lora modules. This parameter only takes effect when `sft_type` is 'lora'.
4748
- `--lora_rank`: Default is `8`. Only takes effect when `sft_type` is 'lora'.
4849
- `--lora_alpha`: Default is `32`. Only takes effect when `sft_type` is 'lora'.
@@ -104,6 +105,12 @@
104105
- `--train_dataset_mix_ds`: Default is `ms-bench`. General knowledge dataset used to prevent knowledge forgetting.
105106
- `--use_loss_scale`: Default is `False`. When taking effect, strengthens loss weight of some Agent fields (Action/Action Input part) to enhance CoT, has no effect in regular SFT scenarios.
106107

108+
### FSDP Parameters
109+
110+
- `--fsdp`: Default value`''`, the FSDP type, please check[this documentation](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp) for details.
111+
112+
- `--fsdp_config`: Default value`None`, the FSDP config file path, `fsdp_offload` is a special value, check [here](https://github.com/modelscope/swift/tree/main/swift/llm/fsdp_config/fsdp_offload.json) for details.
113+
107114
### LoRA+ Fine-tuning Parameters
108115

109116
- `--lora_lr_ratio`: Default `None`, recommended value `10~16`, specify this parameter when using lora to enable lora+.
@@ -184,6 +191,7 @@ dpo parameters inherit from sft parameters, with the following added parameters:
184191
- `--bnb_4bit_comp_dtype`: Default is `'AUTO'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
185192
- `--bnb_4bit_quant_type`: Default is `'nf4'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
186193
- `--bnb_4bit_use_double_quant`: Default is `True`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
194+
- `--bnb_4bit_quant_storage`: Default value `None`.See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
187195
- `--max_new_tokens`: Maximum number of new tokens to generate, default is `2048`.
188196
- `--do_sample`: Whether to use greedy generation or sampling generation, default is `True`.
189197
- `--temperature`: Default is `0.3`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# 2 GPU 80G memory total
2+
PYTHONPATH=../../.. \
3+
CUDA_VISIBLE_DEVICES=0 \
4+
python llm_infer.py \
5+
--ckpt_dir "output/llama2-70b-chat/vxx-xxx-xxxx/checkpoint-xx" \
6+
--load_dataset_config true \
7+
--max_new_tokens 2048 \
8+
--temperature 0.1 \
9+
--top_p 0.7 \
10+
--repetition_penalty 1. \
11+
--do_sample true \
12+
--merge_lora false \
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# 2 GPU * 24G
2+
# bitsandbytes>=0.43.0 needed
3+
nproc_per_node=2
4+
5+
PYTHONPATH=../../.. \
6+
CUDA_VISIBLE_DEVICES=0,1 \
7+
accelerate launch --config_file "../../../swift/llm/fsdp_config/fsdp_offload.json" \
8+
llm_sft.py \
9+
--model_type llama2-70b-chat \
10+
--model_revision master \
11+
--sft_type lora \
12+
--tuner_backend peft \
13+
--template_type llama \
14+
--dtype bf16 \
15+
--output_dir output \
16+
--dataset leetcode-python-en \
17+
--train_dataset_sample -1 \
18+
--num_train_epochs 1 \
19+
--max_length 2048 \
20+
--check_dataset_strategy warning \
21+
--quantization_bit 4 \
22+
--bnb_4bit_comp_dtype "bf16" \
23+
--bnb_4bit_quant_storage bfloat16 \
24+
--lora_rank 8 \
25+
--lora_alpha 32 \
26+
--lora_dtype bf16 \
27+
--lora_dropout_p 0.05 \
28+
--lora_target_modules DEFAULT \
29+
--gradient_checkpointing true \
30+
--batch_size 1 \
31+
--weight_decay 0.1 \
32+
--learning_rate 1e-4 \
33+
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
34+
--max_grad_norm 0.5 \
35+
--warmup_ratio 0.03 \
36+
--eval_steps 50 \
37+
--save_steps 50 \
38+
--save_total_limit 2 \
39+
--logging_steps 10 \
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"compute_environment": "LOCAL_MACHINE",
3+
"debug": false,
4+
"distributed_type": "FSDP",
5+
"downcast_bf16": "no",
6+
"fsdp_config": {
7+
"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
8+
"fsdp_backward_prefetch": "BACKWARD_PRE",
9+
"fsdp_cpu_ram_efficient_loading": true,
10+
"fsdp_forward_prefetch": false,
11+
"fsdp_offload_params": true,
12+
"fsdp_sharding_strategy": "FULL_SHARD",
13+
"fsdp_state_dict_type": "FULL_STATE_DICT",
14+
"fsdp_sync_module_states": true,
15+
"fsdp_use_orig_params": false
16+
},
17+
"machine_rank": 0,
18+
"main_training_function": "main",
19+
"mixed_precision": "no",
20+
"num_machines": 1,
21+
"num_processes": 2,
22+
"rdzv_backend": "static",
23+
"same_network": true,
24+
"tpu_env": [],
25+
"tpu_use_cluster": false,
26+
"tpu_use_sudo": false,
27+
"use_cpu": false
28+
}

swift/llm/sft.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
6363
args.load_in_4bit,
6464
bnb_4bit_compute_dtype=args.bnb_4bit_compute_dtype,
6565
bnb_4bit_quant_type=args.bnb_4bit_quant_type,
66+
bnb_4bit_quant_storage=args.bnb_4bit_quant_storage,
6667
bnb_4bit_use_double_quant=args.bnb_4bit_use_double_quant)
6768
logger.info(f'quantization_config: {quantization_config.__dict__}')
6869
model_kwargs['quantization_config'] = quantization_config

swift/llm/utils/argument.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ class SftArguments:
104104
bnb_4bit_comp_dtype: Literal['fp16', 'bf16', 'fp32', 'AUTO'] = 'AUTO'
105105
bnb_4bit_quant_type: Literal['fp4', 'nf4'] = 'nf4'
106106
bnb_4bit_use_double_quant: bool = True
107+
bnb_4bit_quant_storage: Optional[str] = None
107108
# lora
108109
lora_target_modules: List[str] = field(default_factory=lambda: ['DEFAULT'])
109110
lora_rank: int = 8
@@ -112,7 +113,7 @@ class SftArguments:
112113
lora_bias_trainable: Literal['none', 'all'] = 'none'
113114
# e.g. ['wte', 'ln_1', 'ln_2', 'ln_f', 'lm_head']
114115
lora_modules_to_save: List[str] = field(default_factory=list)
115-
lora_dtype: Literal['fp16', 'bf16', 'fp32', 'AUTO'] = 'fp32'
116+
lora_dtype: Literal['fp16', 'bf16', 'fp32'] = 'fp32'
116117
lora_lr_ratio: float = None
117118

118119
use_rslora: bool = False
@@ -237,6 +238,11 @@ class SftArguments:
237238
deepspeed_config_path: Optional[str] = None
238239
model_cache_dir: Optional[str] = None
239240

241+
# fsdp option
242+
fsdp: Optional[str] = ''
243+
# fsdp config file
244+
fsdp_config: Optional[str] = None
245+
240246
def _prepare_target_modules(self, target_modules) -> List[str]:
241247
if isinstance(target_modules, str):
242248
target_modules = [target_modules]
@@ -283,6 +289,12 @@ def __post_init__(self) -> None:
283289
elif self.deepspeed == 'default-zero3':
284290
self.deepspeed = os.path.abspath(
285291
os.path.join(ds_config_folder, 'zero3.json'))
292+
293+
fsdp_config_folder = os.path.join(__file__, '..', '..', 'fsdp_config')
294+
if self.fsdp_config == 'fsdp_offload':
295+
self.fsdp_config = os.path.abspath(
296+
os.path.join(fsdp_config_folder, 'fsdp_offload.json'))
297+
286298
handle_path(self)
287299
set_model_type(self)
288300
if isinstance(self.dataset, str):
@@ -527,6 +539,8 @@ def _init_training_args(self) -> None:
527539
acc_strategy=self.acc_strategy,
528540
save_safetensors=self.save_safetensors,
529541
logging_first_step=True,
542+
fsdp=self.fsdp,
543+
fsdp_config=self.fsdp_config,
530544
**kwargs)
531545

532546
training_args.ddp_find_unused_parameters = self.ddp_find_unused_parameters

0 commit comments

Comments
 (0)