88
99## 环境准备
1010
11- 实验环境:8 * 昇腾910B3
11+ 实验环境:8 * 昇腾910B3 64G
1212
1313``` shell
14- pip install ms-swift -U
14+ # 创建新的conda虚拟环境(可选)
15+ conda create -n npu python=3.10.12 -y
16+ conda activate npu
17+ # 设置pip全局镜像 (可选,加速下载)
18+ pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
19+
20+ # 安装ms-swift(当前推荐从源码安装, 待发版后可直接pip安装)
21+ git clone https://github.com/modelscope/swift.git
22+ cd swift
23+ pip install -e ' .[llm]'
24+ # 安装torch-npu
1525pip install torch-npu
26+ # 如果你想要使用deepspeed(控制显存占用,训练速度会有一定下降)
27+ pip install deepspeed -U
28+ # datasets==2.19.0不向下兼容,需指定安装2.18.0版本
29+ pip install datasets==2.18.0
30+ # 安装依赖缺失的包
31+ pip install decorator
32+
33+ # 环境对齐 (可选,通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
34+ pip install -r requirements/framework.txt -U
35+ pip install -r requirements/llm.txt -U
36+
1637```
1738
18- 测试环境是否安装正确:
39+ 测试环境是否安装正确,NPU能否被正常加载 :
1940``` python
2041from transformers.utils import is_torch_npu_available
2142import torch
43+ import torch_npu
44+
45+ torch.randn((10 ,), device = ' npu:0' )
46+ torch.npu.set_device(0 )
2247
2348print (is_torch_npu_available()) # True
2449print (torch.npu.device_count()) # 8
2550```
51+ 查看NPU的P2P连接,这里看到每个NPU都通过7条HCCS与其他NPU互联
52+ ``` shell
53+ (valle) root@valle:~ /src# npu-smi info -t topo
54+ NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
55+ NPU0 X HCCS HCCS HCCS HCCS HCCS HCCS HCCS 144-167
56+ NPU1 HCCS X HCCS HCCS HCCS HCCS HCCS HCCS 144-167
57+ NPU2 HCCS HCCS X HCCS HCCS HCCS HCCS HCCS 96-119
58+ NPU3 HCCS HCCS HCCS X HCCS HCCS HCCS HCCS 96-119
59+ NPU4 HCCS HCCS HCCS HCCS X HCCS HCCS HCCS 0-23
60+ NPU5 HCCS HCCS HCCS HCCS HCCS X HCCS HCCS 0-23
61+ NPU6 HCCS HCCS HCCS HCCS HCCS HCCS X HCCS 48-71
62+ NPU7 HCCS HCCS HCCS HCCS HCCS HCCS HCCS X 48-71
63+
64+ Legend:
65+
66+ X = Self
67+ SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
68+ PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
69+ PIX = Path traversing a single PCIe switch
70+ PXB = Path traversing multipul PCIe switches
71+ HCCS = Connection traversing HCCS.
72+ NA = Unknown relationship.
73+
74+ ```
75+ 查看NPU状态,
76+ [ npu-smi命令详解] ( https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668 )
77+ ``` shell
78+ (valle) root@valle:~ /src# npu-smi info
79+ +------------------------------------------------------------------------------------------------+
80+ | npu-smi 24.1.rc1.b030 Version: 24.1.rc1.b030 |
81+ +---------------------------+---------------+----------------------------------------------------+
82+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
83+ | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
84+ +===========================+===============+====================================================+
85+ | 0 910B3 | OK | 101.8 43 0 / 0 |
86+ | 0 | 0000:C1:00.0 | 0 0 / 0 3318 / 65536 |
87+ +===========================+===============+====================================================+
88+ | 1 910B3 | OK | 92.0 39 0 / 0 |
89+ | 0 | 0000:C2:00.0 | 0 0 / 0 3314 / 65536 |
90+ +===========================+===============+====================================================+
91+ | 2 910B3 | OK | 102.0 40 0 / 0 |
92+ | 0 | 0000:81:00.0 | 0 0 / 0 3314 / 65536 |
93+ +===========================+===============+====================================================+
94+ | 3 910B3 | OK | 99.8 40 0 / 0 |
95+ | 0 | 0000:82:00.0 | 0 0 / 0 3314 / 65536 |
96+ +===========================+===============+====================================================+
97+ | 4 910B3 | OK | 98.6 45 0 / 0 |
98+ | 0 | 0000:01:00.0 | 0 0 / 0 3314 / 65536 |
99+ +===========================+===============+====================================================+
100+ | 5 910B3 | OK | 99.7 44 0 / 0 |
101+ | 0 | 0000:02:00.0 | 0 0 / 0 3314 / 65536 |
102+ +===========================+===============+====================================================+
103+ | 6 910B3 | OK | 103.8 45 0 / 0 |
104+ | 0 | 0000:41:00.0 | 0 0 / 0 3314 / 65536 |
105+ +===========================+===============+====================================================+
106+ | 7 910B3 | OK | 98.2 44 0 / 0 |
107+ | 0 | 0000:42:00.0 | 0 0 / 0 3315 / 65536 |
108+ +===========================+===============+====================================================+
26109
110+ ```
27111## 微调
28112以下介绍LoRA的微调, 全参数微调设置参数` --sft_type full ` 即可.
29113
30-
114+ | 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
115+ | ------| -------| -------------| -----------|
116+ | 7B | 1 | None | 1 * 28 GB |
117+ | 7B | 4 | None | 4 * 22 GB |
118+ | 7B | 4 | zero2 | 4 * 28 GB |
119+ | 7B | 4 | zero3 | 4 * 22 GB |
120+ | 7B | 8 | None | 8 * 22 GB |
121+ | 14B | 1 | None | 1 * 45 GB |
122+ | 14B | 8 | None | 8 * 51 GB |
123+ | 14B | 8 | zero2 | 8 * 49 GB |
124+ | 14B | 8 | zero3 | 8 * 31 GB |
31125### 单卡训练
32126
33- 通过如下命令启动单卡微调:
127+ 通过如下命令启动单卡微调:
34128
35129``` shell
36130# 实验环境: 昇腾910B3
37- # 显存需求: 25GB
131+ # 显存需求: 28 GB
38132# 运行时长: 8小时
39133ASCEND_RT_VISIBLE_DEVICES=0 \
40134swift sft \
@@ -46,11 +140,11 @@ swift sft \
46140```
47141
48142
49- ### 数据并行训练
143+ ### 数据并行训练,4卡ddp, qwen1.5-7B-Chat
50144
51145``` shell
52146# 实验环境: 4 * 昇腾910B3
53- # 显存需求: 4 * 30GB
147+ # 显存需求: 4 * 22 GB
54148# 运行时长: 2小时
55149NPROC_PER_NODE=4 \
56150ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
@@ -69,7 +163,7 @@ ZeRO2:
69163``` shell
70164# 实验环境: 4 * 昇腾910B3
71165# 显存需求: 4 * 28GB
72- # 运行时长: 3小时
166+ # 运行时长: 3.5小时
73167NPROC_PER_NODE=4 \
74168ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
75169swift sft \
@@ -84,8 +178,8 @@ swift sft \
84178ZeRO3:
85179``` shell
86180# 实验环境: 4 * 昇腾910B3
87- # 显存需求: 4 * 25GB
88- # 运行时长: 8小时
181+ # 显存需求: 4 * 22 GB
182+ # 运行时长: 8.5小时
89183NPROC_PER_NODE=4 \
90184ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
91185swift sft \
0 commit comments