modelscope
diff --git a/‎README.md‎
Lines changed: 31 additions & 1 deletion b/‎README.md‎
Lines changed: 31 additions & 1 deletion
diff --git a/‎README_CN.md‎
Lines changed: 31 additions & 1 deletion b/‎README_CN.md‎
Lines changed: 31 additions & 1 deletion
diff --git a/‎docs/source/LLM/LLM量化文档.md‎
Lines changed: 5 additions & 4 deletions b/‎docs/source/LLM/LLM量化文档.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/source/LLM/NPU推理与微调最佳实践.md‎
Lines changed: 20 additions & 24 deletions b/‎docs/source/LLM/NPU推理与微调最佳实践.md‎
Lines changed: 20 additions & 24 deletions
diff --git a/‎docs/source/LLM/命令行参数.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/LLM/命令行参数.md‎
Lines changed: 1 addition & 1 deletion
@@ -340,6 +340,36 @@ swift sft \
 ```
 
 
+#### Multi-node Multi-GPU
+```shell
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=8 \
+swift sft \
+    --model_id_or_path qwen1half-32b-chat \
+    --sft_type full \
+    --dataset blossom-math-zh \
+    --output_dir output \
+    --deepspeed default-zero3 \
+
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=8 \
+swift sft \
+    --model_id_or_path qwen1half-32b-chat \
+    --sft_type full \
+    --dataset blossom-math-zh \
+    --output_dir output \
+    --deepspeed default-zero3 \
+```
+
+
 ### Inference
 Original model:
 ```shell
@@ -406,7 +436,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | Model Type                                     | Model Introduction                                                     | Language           | Model Size                             | Model Type                                 |
 |------------------------------------------------|------------------------------------------------------------------------|--------------------|----------------------------------------|------------------------------------------- |
 | Qwen<br>Qwen1.5                                   | [Tongyi Qwen 1.0 and 1.5 series models](https://github.com/QwenLM)  | Chinese<br>English    | 0.5B-72B<br>including quantized versions | base model<br>chat model<br>MoE model<br>code model                      |
-| ChatGLM2<br>ChatGLM3<br>Codegeex2                    | [Zhipu ChatGLM series models](https://github.com/THUDM)               | Chinese<br>English    | 6B                                     | base model<br>chat model<br>code model  |
+| ChatGLM2<br>ChatGLM3<br>Codegeex2                    | [Zhipu ChatGLM series models](https://github.com/THUDM)               | Chinese<br>English    | 6B                                     | base model<br>chat model<br>code model<br>long text model  |
 | Baichuan/Baichuan2                             | [Baichuan 1 and Baichuan 2](https://github.com/baichuan-inc)           | Chinese<br>English    | 7B-13B<br>including quantized versions             | base model<br>chat model                       |
 | Yuan2                                          | [Langchao Yuan series models](https://github.com/IEIT-Yuan)             | Chinese<br>English    | 2B-102B                                | instruct model                                 |
 | XVerse                                         | [XVerse series models](https://github.com/xverse-ai)                    | Chinese<br>English    | 7B-65B                                 | base model<br>chat model<br>long text model<br>MoE model                |
 
@@ -337,6 +337,36 @@ swift sft \
     --deepspeed zero3-offload \
 ```
 
+#### 多机多卡
+```shell
+# node0
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NNODES=2 \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+NPROC_PER_NODE=8 \
+swift sft \
+    --model_id_or_path qwen1half-32b-chat \
+    --sft_type full \
+    --dataset blossom-math-zh \
+    --output_dir output \
+    --deepspeed default-zero3 \
+
+# node1
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+NPROC_PER_NODE=8 \
+swift sft \
+    --model_id_or_path qwen1half-32b-chat \
+    --sft_type full \
+    --dataset blossom-math-zh \
+    --output_dir output \
+    --deepspeed default-zero3 \
+```
+
+
 ### 推理
 原始模型:
 ```shell
@@ -403,7 +433,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | 模型类型                                            | 模型介绍                                                     | 语言       | 模型大小                  | 模型类型                                      |
 | --------------------------------------------------- | ------------------------------------------------------------ |----------| ------------------------- |-------------------------------------------|
 | Qwen<br>Qwen1.5                                        | [通义千问1.0和1.5系列模型](https://github.com/QwenLM)        | 中文<br>英文 | 0.5B-72B<br>包含量化版本     | base模型<br>chat模型<br>MoE模型<br>代码模型             |                          |
-| ChatGLM2<br>ChatGLM3<br>Codegeex2                         | [智谱ChatGLM系列模型](https://github.com/THUDM/)             | 中文<br>英文 | 6B                        | base模型<br>chat模型<br>代码模型                  |
+| ChatGLM2<br>ChatGLM3<br>Codegeex2                         | [智谱ChatGLM系列模型](https://github.com/THUDM/)             | 中文<br>英文 | 6B                        | base模型<br>chat模型<br>代码模型<br>长文本模型             |
 | Baichuan<br>Baichuan2                                  | [百川1和百川2](https://github.com/baichuan-inc)              | 中文<br>英文 | 7B-13B<br>包含量化版本         | base模型<br>chat模型                          |
 | Yuan2                                               | [浪潮源系列模型](https://github.com/IEIT-Yuan)               | 中文<br>英文 | 2B-102B                   | instruct模型                                |
 | XVerse                                              | [元象系列模型](https://github.com/xverse-ai)                 | 中文<br>英文 | 7B-65B                    | base模型<br>chat模型<br>长文本模型<br>MoE模型             |                |
 
@@ -1,5 +1,5 @@
 # LLM量化文档
-swift支持使用awq, gptq技术对模型进行量化. 这两种量化技术支持vllm进行推理加速.
+swift支持使用awq, gptq技术对模型进行量化. 这两种量化技术支持vllm进行推理加速, 且量化后的模型支持qlora微调.
 
 
 ## 目录
@@ -32,17 +32,18 @@ pip install -r requirements/llm.txt  -U
 ## 原始模型
 这里展示对qwen1half-7b-chat进行awq, gptq量化.
 ```bash
-# awq-int4量化 (使用A100大约需要18分钟, 显存占用: 12GB)
+# awq-int4量化 (使用A100大约需要18分钟, 显存占用: 13GB)
 # 如果出现量化的时候OOM, 可以适度降低`--quant_n_samples`(默认256)和`--quant_seqlen`(默认2048).
-# gptq-int4量化 (使用A100大约需要15分钟, 显存占用: 6GB)
+# gptq-int4量化 (使用A100大约需要20分钟, 显存占用: 7GB)
 
 # awq: 使用`ms-bench-mini`作为量化数据集
 CUDA_VISIBLE_DEVICES=0 swift export \
     --model_type qwen1half-7b-chat --quant_bits 4 \
     --dataset ms-bench-mini --quant_method awq
 
 # gptq: 使用`ms-bench-mini`作为量化数据集
-CUDA_VISIBLE_DEVICES=0 swift export \
+# gptq量化请先查看此issue: https://github.com/AutoGPTQ/AutoGPTQ/issues/439
+OMP_NUM_THREADS=14 CUDA_VISIBLE_DEVICES=0 swift export \
     --model_type qwen1half-7b-chat --quant_bits 4 \
     --dataset ms-bench-mini --quant_method gptq
 
 
@@ -8,47 +8,42 @@
 
 ## 环境准备
 
-实验环境：8 * 昇腾910B3 64G
+实验环境：8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持～)
 
 ```shell
 # 创建新的conda虚拟环境(可选)
-conda create -n npu python=3.10.12 -y
-conda activate npu
+conda create -n swift-npu python=3.10 -y
+conda activate swift-npu
+
 # 设置pip全局镜像 (可选,加速下载)
 pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
 
 # 安装ms-swift(当前推荐从源码安装, 待发版后可直接pip安装)
 git clone https://github.com/modelscope/swift.git
 cd swift
 pip install -e '.[llm]'
+
 # 安装torch-npu
-pip install torch-npu
-# 如果你想要使用deepspeed(控制显存占用,训练速度会有一定下降)
-pip install deepspeed -U
-# datasets==2.19.0不向下兼容,需指定安装2.18.0版本
-pip install datasets==2.18.0
-# 安装依赖缺失的包
-pip install decorator
-
-# 环境对齐 (可选,通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install torch-npu decorator
+# 如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)
+pip install deepspeed
+
+# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
 pip install -r requirements/framework.txt  -U
 pip install -r requirements/llm.txt  -U
-
 ```
 
-测试环境是否安装正确,NPU能否被正常加载：
+测试环境是否安装正确，NPU能否被正常加载：
 ```python
 from transformers.utils import is_torch_npu_available
 import torch
-import torch_npu
-
-torch.randn((10,), device='npu:0')
-torch.npu.set_device(0)
 
 print(is_torch_npu_available())  # True
 print(torch.npu.device_count())  # 8
+print(torch.randn(10, device='npu:0'))
 ```
-查看NPU的P2P连接,这里看到每个NPU都通过7条HCCS与其他NPU互联
+
+查看NPU的P2P连接，这里看到每个NPU都通过7条HCCS与其他NPU互联
 ```shell
 (valle) root@valle:~/src# npu-smi info -t topo
 	   NPU0       NPU1       NPU2       NPU3       NPU4       NPU5       NPU6       NPU7       CPU Affinity
@@ -70,10 +65,9 @@ Legend:
   PXB  = Path traversing multipul PCIe switches
   HCCS = Connection traversing HCCS.
   NA   = Unknown relationship.
-
 ```
-查看NPU状态,
-[npu-smi命令详解](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)
+
+查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)
 ```shell
 (valle) root@valle:~/src# npu-smi info
 +------------------------------------------------------------------------------------------------+
@@ -106,8 +100,8 @@ Legend:
 | 7     910B3               | OK            | 98.2        44                0    / 0             |
 | 0                         | 0000:42:00.0  | 0           0    / 0          3315 / 65536         |
 +===========================+===============+====================================================+
-
 ```
+
 ## 微调
 以下介绍LoRA的微调, 全参数微调设置参数`--sft_type full`即可.
 
@@ -122,6 +116,7 @@ Legend:
 | 14B  | 8     | None        | 8 * 51 GB |
 | 14B  | 8     | zero2       | 8 * 49 GB |
 | 14B  | 8     | zero3       | 8 * 31 GB |
+
 ### 单卡训练
 
 通过如下命令启动单卡微调:
@@ -140,7 +135,8 @@ swift sft \
 ```
 
 
-### 数据并行训练,4卡ddp, qwen1.5-7B-Chat
+### 数据并行训练
+我们使用其中的4卡进行ddp训练
 
 ```shell
 # 实验环境: 4 * 昇腾910B3
 
@@ -224,7 +224,7 @@ export参数继承了infer参数, 除此之外增加了以下参数:
 - `--quant_bits`: 量化的bits数. 默认为`0`, 即不进行量化. 如果你设置了`--quant_method awq`, 你可以设置为`4`进行4bits量化. 如果你设置了`--quant_method gptq`, 你可以设置为`2`,`3`,`4`,`8`进行对应bits的量化. 如果对原始模型进行量化, 权重会保存在`f'{args.model_type}-{args.quant_method}-int{args.quant_bits}'`目录中. 如果对微调后模型进行量化, 权重会保存在`ckpt_dir`的同级目录中, e.g. `f'/path/to/your/vx-xxx/checkpoint-xxx-{args.quant_method}-int{args.quant_bits}'`目录下.
 - `--quant_method`: 量化方法, 默认为`'awq'`. 你可以选择为'awq', 'gptq'.
 - `--dataset`: 该参数已在InferArguments中定义, 在export时含义为量化数据集. 默认为`[]`. 推荐设置为`--dataset ms-bench-mini`. 该数据集含多语言的内容(中文为主)且质量很高, 量化中文模型具有很好的效果. 你也可以设置`--dataset pileval`, 使用autoawq默认量化数据集, 该数据集的语言为英文. 更多细节: 包括如何自定义量化数据集, 可以参考[LLM量化文档](LLM量化文档.md).
-- `--quant_n_samples`: 量化参数, 默认为`None`, 如果使用awq量化设置为`256`, 如果使用gptq量化设置为`1024`. 当设置为`--quant_method awq`时, 如果出现量化的时候OOM, 可以适度降低`--quant_n_samples`和`--quant_seqlen`. `--quant_method gptq`通常不会出现量化OOM.
+- `--quant_n_samples`: 量化参数, 默认为`256`. 当设置为`--quant_method awq`时, 如果出现量化的时候OOM, 可以适度降低`--quant_n_samples`和`--quant_seqlen`. `--quant_method gptq`通常不会出现量化OOM.
 - `--quant_seqlen`: 量化参数, 默认为`2048`.
 - `--quant_device_map`: 默认为`'cpu'`, 节约显存. 你可以指定为'cuda:0', 'auto', 'cpu'等, 表示量化时模型导入的设备.
 - `--push_to_hub`: 默认为`False`. 是否将最后的`ckpt_dir`push到ModelScope Hub中. 如果你指定了`merge_lora`, 则将推送全量参数; 如果你还指定了`quant_bits`, 则将推送量化后的模型.