Support FSDP + QLoRA (#659)

tastelikefeet · web-flow · commit cb45fadb5bdc · 2024-04-04T22:35:23.000+08:00
diff --git a/README.md b/README.md
@@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
 Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
 
 ## 🎉 News
+- 2024.04.04: Support **QLoRA+FSDP** to train a 70B model with two 24G memory GPUs, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh) to train.
 - 🔥2024.04.03: Support **Qwen1.5-32B** series: Qwen1.5-32B, Qwen1.5-32B-Chat, Qwen1.5-32B-Chat-GPTQ-Int4.use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_32b_chat/lora_mp/sft.sh) to start training!
 - 🔥2024.04.02: Support the fine-tuning and inference of Mengzi3-13B-Base model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/mengzi3_13b_base/lora_ddp_ds/sft.sh) to start training!
 - 🔥2024.04.01: Support **dbrx** series: dbrx-base and dbrx-instruct, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/dbrx-instruct/lora_mp/sft.sh) to start training!
diff --git a/README_CN.md b/README_CN.md
@@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**（多模态大模型）的训练、推理、
 此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的全参数训练和LoRA训练。
 
 ## 🎉 新闻
+- 2024.04.04: 支持使用**QLoRA+FSDP**来使用两张24G显卡训练70B模型, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh)开始训练.
 - 🔥2024.04.03: 支持**Qwen1.5-32B**系列: Qwen1.5-32B, Qwen1.5-32B-Chat, Qwen1.5-32B-Chat-GPTQ-Int4。使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_32b_chat/lora_mp/sft.sh)来开始训练！
 - 🔥2024.04.02: 支持Mengzi3-13B-Base模型的推理与微调, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/mengzi3_13b_base/lora_ddp_ds/sft.sh)来开始训练！
 - 🔥2024.04.01: 支持**dbrx**系列, dbrx-base和dbrx-instruct, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/dbrx-instruct/lora_mp/sft.sh)来开始训练！.
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -43,6 +43,7 @@
 - `--bnb_4bit_comp_dtype`: 在进行4bit量化时, 我们需要在模型的forward和backward时, 将其进行反量化. 该参数用于指定反量化后的torch_dtype. 默认为`'AUTO'`, 即与`dtype`保持一致. 可选择的值包括: 'fp16', 'bf16', 'fp32'. 当quantization_bit为0时, 该参数无效.
 - `--bnb_4bit_quant_type`: 4bit量化时的量化方式, 默认是`'nf4'`. 可选择的值包括: 'nf4', 'fp4'. 当quantization_bit为0时, 该参数无效.
 - `--bnb_4bit_use_double_quant`: 是否在4bit量化时开启double量化, 默认为`True`. 当quantization_bit为0时, 该参数无效.
+- `--bnb_4bit_quant_storage`: 默认值为`None`. 量化参数的存储类型. 若`quantization_bit`设置为0, 则该参数失效.
 - `--lora_target_modules`: 指定lora模块, 默认为`['DEFAULT']`. 如果lora_target_modules传入`'DEFAULT'` or `'AUTO'`, 则根据`model_type`查找`MODEL_MAPPING`中的`lora_target_modules`(默认指定为qkv). 如果传入`'ALL'`, 则将所有的Linear层(不含head)指定为lora模块. 如果传入`'EMBEDDING'`, 则Embedding层指定为lora模块. 如果内存允许, 建议设置成'ALL'. 当然, 你也可以设置`['ALL', 'EMBEDDING']`, 将所有的Linear和embedding层指定为lora模块. 该参数只有当`sft_type`指定为'lora'时才生效.
 - `--lora_rank`: 默认为`8`. 只有当`sft_type`指定为'lora'时才生效.
 - `--lora_alpha`: 默认为`32`. 只有当`sft_type`指定为'lora'时才生效.
@@ -104,6 +105,11 @@
 - `--train_dataset_mix_ds`: 默认为`['ms-bench']`. 用于防止知识遗忘的通用知识数据集.
 - `--use_loss_scale`: 默认为`False`. 生效时会将Agent的部分字段(Action/Action Input部分)的loss权重加强以强化CoT, 对普通SFT场景没有任何效果.
 
+### FSDP参数
+
+- `--fsdp`: 默认值`''`, fsdp类型, 详情可以查看该参数[原始文档](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp).
+- `--fsdp_config`: 默认值`None`, fsdp配置文件的路径, 支持传入`fsdp_offload`, 该文件为SWIFT提供的默认配置, 具体可以查看[这里](https://github.com/modelscope/swift/tree/main/swift/llm/fsdp_config/fsdp_offload.json).
+
 ### LoRA+微调参数
 
 - `--lora_lr_ratio`: 默认值`None`, 建议值`10~16`, 使用lora时指定该参数即可使用lora+.
@@ -184,6 +190,7 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
 - `--bnb_4bit_comp_dtype`: 默认值为`'AUTO'`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
 - `--bnb_4bit_quant_type`: 默认值为`'nf4'`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
 - `--bnb_4bit_use_double_quant`: 默认值为`True`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
+- `--bnb_4bit_quant_storage`: 默认值为`True`.  具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
 - `--max_new_tokens`: 生成新token的最大数量, 默认值为`2048`.
 - `--do_sample`: 是使用贪婪生成的方式还是采样生成的方式, 默认值为`True`.
 - `--temperature`: 默认值为`0.3`. 该参数只有在`do_sample`设置为True时才生效. 该参数会在部署参数中作为默认值使用.
@@ -206,10 +213,10 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
 - `--vllm_max_lora_rank`: 默认为`16`. vllm对于lora支持的参数.
 - `--vllm_lora_modules`: 默认为`[]`, 输入的格式为`'{lora_name}-{lora_path}'`, e.g. `--vllm_lora_modules lora_name1=lora_path1 lora_name2=lora_path2`. `ckpt_dir`会以`f'default-lora={args.ckpt_dir}'`的形式加入args.vllm_lora_modules.
 
-
 ## export 参数
 
 export参数继承了infer参数, 除此之外增加了以下参数:
+
 - `--to_peft_format`: 默认为`False`. 将lora的swift format转成peft format.
 - `--merge_lora`: 默认为`False`. 该参数已在InferArguments中定义, 不属于新增参数. 是否将lora权重merge到基模型中, 并保存完整的权重. 权重会保存在`ckpt_dir`的同级目录中, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx-merged'`目录下.
 - `--quant_bits`: 量化的bits数. 默认为`0`, 即不进行量化. 如果你设置了`--quant_method awq`, 你可以设置为`4`进行4bits量化. 如果你设置了`--quant_method gptq`, 你可以设置为`2`,`3`,`4`,`8`进行对应bits的量化. 如果对原始模型进行量化, 权重会保存在`f'{args.model_type}-{args.quant_method}-int{args.quant_bits}'`目录中. 如果对微调后模型进行量化, 权重会保存在`ckpt_dir`的同级目录中, e.g. `f'/path/to/your/vx-xxx/checkpoint-xxx-{args.quant_method}-int{args.quant_bits}'`目录下.
@@ -224,7 +231,6 @@ export参数继承了infer参数, 除此之外增加了以下参数:
 - `--hub_private_repo`: 默认为`False`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--commit_message`: 默认是`'update files'`.
 
-
 ## app-ui 参数
 
 app-ui参数继承了infer参数, 除此之外增加了以下参数:
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -43,6 +43,7 @@
 - `--bnb_4bit_comp_dtype`: When doing 4bit quantization, we need to dequantize during model forward and backward passes. This specifies the torch_dtype after dequantization. Default is `'AUTO'`, i.e. consistent with `dtype`. Options: 'fp16', 'bf16', 'fp32'. Has no effect when quantization_bit is 0.
 - `--bnb_4bit_quant_type`: Quantization method for 4bit quantization, default is `'nf4'`. Options: 'nf4', 'fp4'. Has no effect when quantization_bit is 0.
 - `--bnb_4bit_use_double_quant`: Whether to enable double quantization for 4bit quantization, default is `True`. Has no effect when quantization_bit is 0.
+- `--bnb_4bit_quant_storage`: Default vlaue `None`.This sets the storage type to pack the quanitzed 4-bit prarams. Has no effect when quantization_bit is 0.
 - `--lora_target_modules`: Specify lora modules, default is `['DEFAULT']`. If lora_target_modules is passed `'DEFAULT'` or `'AUTO'`, look up `lora_target_modules` in `MODEL_MAPPING` based on `model_type` (default specifies qkv). If passed `'ALL'`, all Linear layers (excluding head) will be specified as lora modules. If passed `'EMBEDDING'`, Embedding layer will be specified as lora module. If memory allows, setting to 'ALL' is recommended. You can also set `['ALL', 'EMBEDDING']` to specify all Linear and embedding layers as lora modules. This parameter only takes effect when `sft_type` is 'lora'.
 - `--lora_rank`: Default is `8`. Only takes effect when `sft_type` is 'lora'.
 - `--lora_alpha`: Default is `32`. Only takes effect when `sft_type` is 'lora'.
@@ -104,6 +105,12 @@
 - `--train_dataset_mix_ds`: Default is `ms-bench`. General knowledge dataset used to prevent knowledge forgetting.
 - `--use_loss_scale`: Default is `False`. When taking effect, strengthens loss weight of some Agent fields (Action/Action Input part) to enhance CoT, has no effect in regular SFT scenarios.
 
+### FSDP Parameters
+
+- `--fsdp`: Default value`''`, the FSDP type, please check[this documentation](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp) for details.
+
+- `--fsdp_config`: Default value`None`, the FSDP config file path, `fsdp_offload` is a special value, check [here](https://github.com/modelscope/swift/tree/main/swift/llm/fsdp_config/fsdp_offload.json) for details.
+
 ### LoRA+ Fine-tuning Parameters
 
 - `--lora_lr_ratio`: Default `None`, recommended value `10~16`, specify this parameter when using lora to enable lora+.
@@ -184,6 +191,7 @@ dpo parameters inherit from sft parameters, with the following added parameters:
 - `--bnb_4bit_comp_dtype`: Default is `'AUTO'`.  See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
 - `--bnb_4bit_quant_type`: Default is `'nf4'`.  See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
 - `--bnb_4bit_use_double_quant`: Default is `True`.  See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
+- `--bnb_4bit_quant_storage`: Default value `None`.See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
 - `--max_new_tokens`: Maximum number of new tokens to generate, default is `2048`.
 - `--do_sample`: Whether to use greedy generation or sampling generation, default is `True`.
 - `--temperature`: Default is `0.3`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/infer.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/infer.sh
@@ -0,0 +1,12 @@
+# 2 GPU 80G memory total
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python llm_infer.py \
+    --ckpt_dir "output/llama2-70b-chat/vxx-xxx-xxxx/checkpoint-xx" \
+    --load_dataset_config true \
+    --max_new_tokens 2048 \
+    --temperature 0.1 \
+    --top_p 0.7 \
+    --repetition_penalty 1. \
+    --do_sample true \
+    --merge_lora false \
diff --git a/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh b/examples/pytorch/llm/scripts/llama2_70b_chat/qlora_fsdp/sft.sh
@@ -0,0 +1,39 @@
+# 2 GPU * 24G
+# bitsandbytes>=0.43.0 needed
+nproc_per_node=2
+
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0,1 \
+accelerate launch --config_file "../../../swift/llm/fsdp_config/fsdp_offload.json" \
+    llm_sft.py \
+    --model_type llama2-70b-chat \
+    --model_revision master \
+    --sft_type lora \
+    --tuner_backend peft \
+    --template_type llama \
+    --dtype bf16 \
+    --output_dir output \
+    --dataset leetcode-python-en \
+    --train_dataset_sample -1 \
+    --num_train_epochs 1 \
+    --max_length 2048 \
+    --check_dataset_strategy warning \
+    --quantization_bit 4 \
+    --bnb_4bit_comp_dtype "bf16" \
+    --bnb_4bit_quant_storage bfloat16 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --lora_dtype bf16 \
+    --lora_dropout_p 0.05 \
+    --lora_target_modules DEFAULT \
+    --gradient_checkpointing true \
+    --batch_size 1 \
+    --weight_decay 0.1 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --max_grad_norm 0.5 \
+    --warmup_ratio 0.03 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 10 \
diff --git a/swift/llm/fsdp_config/fsdp_offload.json b/swift/llm/fsdp_config/fsdp_offload.json
@@ -0,0 +1,28 @@
+{
+  "compute_environment": "LOCAL_MACHINE",
+  "debug": false,
+  "distributed_type": "FSDP",
+  "downcast_bf16": "no",
+  "fsdp_config": {
+    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+    "fsdp_backward_prefetch": "BACKWARD_PRE",
+    "fsdp_cpu_ram_efficient_loading": true,
+    "fsdp_forward_prefetch": false,
+    "fsdp_offload_params": true,
+    "fsdp_sharding_strategy": "FULL_SHARD",
+    "fsdp_state_dict_type": "FULL_STATE_DICT",
+    "fsdp_sync_module_states": true,
+    "fsdp_use_orig_params": false
+  },
+  "machine_rank": 0,
+  "main_training_function": "main",
+  "mixed_precision": "no",
+  "num_machines": 1,
+  "num_processes": 2,
+  "rdzv_backend": "static",
+  "same_network": true,
+  "tpu_env": [],
+  "tpu_use_cluster": false,
+  "tpu_use_sudo": false,
+  "use_cpu": false
+}
diff --git a/swift/llm/sft.py b/swift/llm/sft.py
@@ -63,6 +63,7 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
             args.load_in_4bit,
             bnb_4bit_compute_dtype=args.bnb_4bit_compute_dtype,
             bnb_4bit_quant_type=args.bnb_4bit_quant_type,
+            bnb_4bit_quant_storage=args.bnb_4bit_quant_storage,
             bnb_4bit_use_double_quant=args.bnb_4bit_use_double_quant)
         logger.info(f'quantization_config: {quantization_config.__dict__}')
         model_kwargs['quantization_config'] = quantization_config
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -104,6 +104,7 @@ class SftArguments:
     bnb_4bit_comp_dtype: Literal['fp16', 'bf16', 'fp32', 'AUTO'] = 'AUTO'
     bnb_4bit_quant_type: Literal['fp4', 'nf4'] = 'nf4'
     bnb_4bit_use_double_quant: bool = True
+    bnb_4bit_quant_storage: Optional[str] = None
     # lora
     lora_target_modules: List[str] = field(default_factory=lambda: ['DEFAULT'])
     lora_rank: int = 8
@@ -112,7 +113,7 @@ class SftArguments:
     lora_bias_trainable: Literal['none', 'all'] = 'none'
     # e.g. ['wte', 'ln_1', 'ln_2', 'ln_f', 'lm_head']
     lora_modules_to_save: List[str] = field(default_factory=list)
-    lora_dtype: Literal['fp16', 'bf16', 'fp32', 'AUTO'] = 'fp32'
+    lora_dtype: Literal['fp16', 'bf16', 'fp32'] = 'fp32'
     lora_lr_ratio: float = None
 
     use_rslora: bool = False
@@ -237,6 +238,11 @@ class SftArguments:
     deepspeed_config_path: Optional[str] = None
     model_cache_dir: Optional[str] = None
 
+    # fsdp option
+    fsdp: Optional[str] = ''
+    # fsdp config file
+    fsdp_config: Optional[str] = None
+
     def _prepare_target_modules(self, target_modules) -> List[str]:
         if isinstance(target_modules, str):
             target_modules = [target_modules]
@@ -283,6 +289,12 @@ def __post_init__(self) -> None:
         elif self.deepspeed == 'default-zero3':
             self.deepspeed = os.path.abspath(
                 os.path.join(ds_config_folder, 'zero3.json'))
+
+        fsdp_config_folder = os.path.join(__file__, '..', '..', 'fsdp_config')
+        if self.fsdp_config == 'fsdp_offload':
+            self.fsdp_config = os.path.abspath(
+                os.path.join(fsdp_config_folder, 'fsdp_offload.json'))
+
         handle_path(self)
         set_model_type(self)
         if isinstance(self.dataset, str):
@@ -527,6 +539,8 @@ def _init_training_args(self) -> None:
             acc_strategy=self.acc_strategy,
             save_safetensors=self.save_safetensors,
             logging_first_step=True,
+            fsdp=self.fsdp,
+            fsdp_config=self.fsdp_config,
             **kwargs)
 
         training_args.ddp_find_unused_parameters = self.ddp_find_unused_parameters