Skip to content

Commit 1a8c099

Browse files
authored
gpt fused attention and feedforward (#2277)
* gpt add fuse attn ffn * add fuse args * update doc * use FusedFeedForward and FusedMultiHeadAttention from paddle * pre-commit
1 parent 38efa79 commit 1a8c099

File tree

4 files changed

+225
-58
lines changed

4 files changed

+225
-58
lines changed
Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,98 @@
11
## 超大模型部署
22

3-
TBD
3+
目录:
4+
- [简介](#简介)
5+
- [环境安装](#环境安装)
6+
- [模型导出](#模型导出)
7+
- [自动切分](#自动切分)
8+
- [推理部署](#推理部署)
9+
10+
11+
12+
超大模型由于参数容量大、显存/内容占用较多,对如何高效推理提出挑战。飞桨推出了针对分布式推理、大模型的压缩、服务化全流程部署方案。 其中分布式推理采用[张量模型并行、流水线并行技术](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/distributed_introduction.html),这些技术通用用于超大模型训练。推理场景与训练场景的不同点包括: 硬件特性不同、硬件数量不同、通信环境不同。为了充分利用推理硬件效率,将飞桨自适应并行训练技术应用于推理场景,针对推理硬件拓扑、环境进行自适应的切分,进行自适应并行推理。
13+
14+
模型压缩旨在提升推理效率、节约部署资源,飞桨模型压缩工具PaddleSlim包含丰富的压缩方法,诸如量化、稀疏化技术,这些技术不仅可以使得模型容量大大减少,从而节约部署硬件数量,还可以降低推理时延,提升吞吐。在大模型压缩上,依然存在挑战。从算法上,超大模型通常层数较深,如量化误差会累积越大,稀疏化要求的大稀疏度下,很容易出现精度损失;从压缩工具上,由于超大模型显存占用较多,模型压缩工具也需要适配训练并行技术,在张量模型并行、Sharding并行、流水线并行的基础上支持量化缩放系数的统计,支持稀疏掩码训练等;从推理效率与精度平衡上,量化依据量化对象,分为了仅权重量化、权重/激活均量化,依据量化的Bit数,包括8Bit、4Bit等,依据量化的位置,分为部分量化、全量化,稀疏也分为非结构化稀疏、半结构化稀疏等,需依据精度和推理速度、显存/内存权衡选取策略,并且需要在分布式推理的基础上支持量化、稀疏推理。
15+
16+
17+
18+
超大模型云端部署,飞桨还提供了PaddleServing服务化支持,可以使得用户比较容易部署到多机多卡上,对于服务请求自动进行Batch处理、容错调度等。
19+
20+
21+
22+
本教程以GPT-3为例介绍如何进行超大模型部署,下面重点介绍模型导出、自动切分、推理部署,模型压缩内容后续会提供。服务化部署教程,采用其他预训练模型进行介绍。
23+
24+
25+
26+
### 环境安装
27+
28+
版本依赖如下:
29+
30+
Paddle: >= 2.3.0
31+
32+
PaddleNLP: develop分支
33+
34+
35+
36+
由于此前飞桨发布的Python安装包,不包含分布式推理功能,需要源码编译,后续会优化此步骤。参考[源码编译教程](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/compile/linux-compile.html) ,需要安装NCCL,编译命令设置参考如下,注意设置 `-DWITH_DISTRIBUTE=ON`
37+
38+
```
39+
cmake .. -DPY_VERSION=3.7 -DWITH_GPU=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_DISTRIBUTE=ON
40+
```
41+
42+
443

544
### 模型导出
645

7-
### 自动切分
46+
在PaddleNLP,GTP-3提供了静态图、动态图训练,本次教程基于静态图训练代码进行,后续提供动态图支持。
47+
48+
首先,下载PaddleNLP:
49+
50+
```
51+
git clone https://github.com/PaddlePaddle/PaddleNLP.git
52+
cd PaddleNLP/examples/language_model/gpt-3/static/
53+
```
54+
55+
推理模型导出,下面脚本默认是张量模型并行度为1,依据想要部署的GPU卡数,设置
56+
57+
```
58+
run_gen.sh
59+
```
60+
61+
关键参数介绍如下,也可以通过 `python run_generation.py --help` 查看参数列表和设置帮助信息。
62+
63+
- gpus: 设置GPU个数,也就是并行数
64+
- model_type: 设置模型类型
65+
- mp_degree: 张量模型并行度
66+
- max_seq_len: 输入字长
67+
- max_dec_len: 输出字长
68+
69+
注意: 自动切分时,不需要提供设置mp_degree,后续会补充自动切分内容。
70+
71+
运行`bash run_gen.sh`,模型会导出到当前目录。mp_degree为1时导出模型为`inference_model_pp1mp1`,mp_degree为2时导出模型为`inference_model_pp1mp1`
72+
873

974
### 推理部署
75+
推理部署前,参考前面的模型导出步骤,确保已导出好模型。
76+
```
77+
cd PaddleNLP/examples/language_model/gpt-3/static/
78+
bash run_gen.sh # 导出模型
79+
```
80+
导出好模型后,即可使用高性能推理脚本进行推理。以两卡张量模型并行为例,通过`model_path`指定导出的模型目录,
81+
使用如下命令便可以基于Paddle Inference进行高性能预测:
82+
```
83+
cd ../deploy/python
84+
python -m paddle.distributed.launch \
85+
--gpus 0,1 \
86+
inference.py --model_type gpt \
87+
--model_path ../../static/inference_model_pp1mp2/
88+
```
89+
90+
91+
#### 服务化部署
92+
TBD
93+
94+
95+
1096

1197
### Benchmark
98+
TBD

examples/language_model/gpt-3/static/args.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,9 +293,14 @@ def parse_args(MODEL_CLASSES):
293293
help="The hyper-parameter in beam search.")
294294
parser.add_argument(
295295
"--save_inference_model_then_exist",
296-
type=bool,
296+
type=str2bool,
297297
default=False,
298298
help="save_inference_model_then_exist")
299+
parser.add_argument(
300+
"--fuse",
301+
type=str2bool,
302+
default=False,
303+
help="Whether to enable fused_attention and fused_feedforward.")
299304

300305
args = parser.parse_args()
301306
args.test_iters = args.eval_iters * 10

examples/language_model/gpt-3/static/modeling.py

Lines changed: 127 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -364,6 +364,8 @@ class TransformerDecoderLayer(nn.Layer):
364364
It contains multiheadattention and some linear layers.
365365
"""
366366

367+
Cache = collections.namedtuple("Cache", ["kv"])
368+
367369
def __init__(self,
368370
d_model,
369371
nhead,
@@ -375,7 +377,8 @@ def __init__(self,
375377
normalize_before=True,
376378
weight_attr=None,
377379
bias_attr=None,
378-
topo=None):
380+
topo=None,
381+
**kwargs):
379382
self._config = locals()
380383
self._config.pop("self")
381384
self._config.pop("__class__", None) # py3
@@ -388,45 +391,94 @@ def __init__(self,
388391
weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
389392
bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
390393

391-
self.self_attn = MultiHeadAttention(
392-
d_model,
393-
nhead,
394-
dropout=attn_dropout,
395-
weight_attr=weight_attrs[0],
396-
bias_attr=bias_attrs[0],
397-
topo=topo)
398-
if topo is None or topo.mp_info.size == 1:
399-
self.linear1 = nn.Linear(
394+
self._fuse = kwargs.get('fuse', False)
395+
if self._fuse:
396+
nranks, ring_id = 1, -1
397+
if topo is not None and topo.mp_info.size > 1:
398+
nranks = topo.mp_info.size
399+
ring_id = 0
400+
self.self_attn = incubate.nn.FusedMultiHeadAttention(
400401
d_model,
401-
dim_feedforward,
402-
weight_attrs[2],
403-
bias_attr=bias_attrs[2])
404-
self.linear2 = nn.Linear(
405-
dim_feedforward,
402+
nhead,
403+
dropout_rate=dropout,
404+
attn_dropout_rate=attn_dropout,
405+
normalize_before=normalize_before,
406+
qkv_weight_attr=weight_attrs[0],
407+
qkv_bias_attr=bias_attrs[0],
408+
linear_weight_attr=weight_attrs[0],
409+
linear_bias_attr=bias_attrs[0],
410+
epsilon=1e-5,
411+
nranks=nranks,
412+
ring_id=ring_id)
413+
self.ffn = incubate.nn.FusedFeedForward(
406414
d_model,
407-
weight_attrs[2],
408-
bias_attr=bias_attrs[2])
415+
dim_feedforward,
416+
dropout_rate=act_dropout,
417+
epsilon=1e-5,
418+
activation=activation,
419+
normalize_before=normalize_before,
420+
act_dropout_rate=0.0,
421+
linear1_weight_attr=weight_attrs[2],
422+
linear1_bias_attr=bias_attrs[2],
423+
linear2_weight_attr=weight_attrs[2],
424+
linear2_bias_attr=bias_attrs[2],
425+
nranks=nranks,
426+
ring_id=ring_id)
409427
else:
410-
self.linear1 = paddlenlp.ops.ColumnParallelLiner(
411-
(d_model, dim_feedforward),
412-
topo.mp_info.size,
413-
gather_out=False,
414-
param_attr=weight_attrs[2],
415-
bias_attr=bias_attrs[2])
416-
self.linear2 = paddlenlp.ops.RowParallelLiner(
417-
(dim_feedforward, d_model),
418-
topo.mp_info.size,
419-
input_is_parallel=True,
420-
param_attr=weight_attrs[2],
421-
bias_attr=bias_attrs[2])
428+
self.self_attn = MultiHeadAttention(
429+
d_model,
430+
nhead,
431+
dropout=attn_dropout,
432+
weight_attr=weight_attrs[0],
433+
bias_attr=bias_attrs[0],
434+
topo=topo)
435+
if topo is None or topo.mp_info.size == 1:
436+
self.linear1 = nn.Linear(
437+
d_model,
438+
dim_feedforward,
439+
weight_attrs[2],
440+
bias_attr=bias_attrs[2])
441+
self.linear2 = nn.Linear(
442+
dim_feedforward,
443+
d_model,
444+
weight_attrs[2],
445+
bias_attr=bias_attrs[2])
446+
else:
447+
self.linear1 = paddlenlp.ops.ColumnParallelLiner(
448+
(d_model, dim_feedforward),
449+
topo.mp_info.size,
450+
gather_out=False,
451+
param_attr=weight_attrs[2],
452+
bias_attr=bias_attrs[2])
453+
self.linear2 = paddlenlp.ops.RowParallelLiner(
454+
(dim_feedforward, d_model),
455+
topo.mp_info.size,
456+
input_is_parallel=True,
457+
param_attr=weight_attrs[2],
458+
bias_attr=bias_attrs[2])
422459

423-
self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
424-
self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
425-
self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
426-
self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
427-
self.activation = getattr(F, activation)
460+
self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
461+
self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
462+
self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
463+
self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
464+
self.activation = getattr(F, activation)
428465

429466
def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None):
467+
if self._fuse:
468+
if isinstance(cache, self.Cache):
469+
attn_output, cache_kv_out = self.self_attn(
470+
tgt, attn_mask=tgt_mask, cache=cache.kv)
471+
472+
## if not assign here, update caches in While loop
473+
# layers.assign(cache_kv_out, cache.kv)
474+
if use_cache:
475+
cache = self.Cache(cache_kv_out)
476+
else:
477+
attn_output = self.self_attn(tgt, attn_mask=tgt_mask)
478+
479+
enc_out = self.ffn(attn_output)
480+
return (enc_out, cache) if use_cache else enc_out
481+
430482
residual = tgt
431483

432484
if self.normalize_before:
@@ -687,7 +739,8 @@ def __init__(self,
687739
eos_token_id=7,
688740
bos_token_id=0,
689741
eol_token_id=3,
690-
topo=None):
742+
topo=None,
743+
**kwargs):
691744
super(GPTModel, self).__init__()
692745

693746
self.pad_token_id = pad_token_id
@@ -727,7 +780,8 @@ def __init__(self,
727780
initializer=nn.initializer.Normal(
728781
mean=0.0, std=self.initializer_range)),
729782
bias_attr=None,
730-
topo=topo))
783+
topo=topo,
784+
fuse=kwargs.get('fuse', False)))
731785

732786
if self.pipline_mode:
733787
Decoder = paddlenlp.ops.guard('gpu:{}'.format(
@@ -866,7 +920,8 @@ def __init__(self,
866920
temperature=1.0,
867921
top_k=0,
868922
top_p=1.0,
869-
eos_id=None):
923+
eos_id=None,
924+
**kwargs):
870925
super(GPTForGeneration, self).__init__()
871926
self.gpt = gpt
872927
self.apply(self.init_weights)
@@ -879,32 +934,43 @@ def __init__(self,
879934
self.temperature = temperature
880935
self.topk = top_k
881936
self.topp = top_p
882-
self._fuse = False
883937
self._init_gen_cache = False
884-
self.generation_caches = []
938+
self.generation_caches = None
885939
self._dtype = "float32"
940+
self._fuse = kwargs.get("fuse", False)
886941

887942
def _init_generation_caches(self, src_ids):
888-
if self._init_gen_cache:
943+
# not fuse, return None
944+
if self._init_gen_cache or self._fuse is False:
889945
return self.generation_caches
890946

947+
self.generation_caches = []
891948
num_heads = self.gpt.num_attention_heads
892949
num_layers = self.gpt.num_hidden_layers
893950
mp_n_head = num_heads // self.gpt.topo.mp_info.size
894951
hidden_size = self.gpt.hidden_size
895952
head_size = hidden_size // num_heads
896953
for i in range(num_layers):
897-
k = layers.fill_constant_batch_size_like(
898-
input=src_ids,
899-
shape=[-1, mp_n_head, 0, head_size],
900-
dtype=self._dtype,
901-
value=0)
902-
v = layers.fill_constant_batch_size_like(
903-
input=src_ids,
904-
shape=[-1, mp_n_head, 0, head_size],
905-
dtype=self._dtype,
906-
value=0)
907-
self.generation_caches.append(MultiHeadAttention.Cache(k, v))
954+
if self._fuse:
955+
kv = layers.fill_constant_batch_size_like(
956+
input=src_ids,
957+
shape=[2, -1, mp_n_head, 0, head_size],
958+
dtype=self._dtype,
959+
value=0,
960+
output_dim_idx=1)
961+
self.generation_caches.append(TransformerDecoderLayer.Cache(kv))
962+
else:
963+
k = layers.fill_constant_batch_size_like(
964+
input=src_ids,
965+
shape=[-1, mp_n_head, 0, head_size],
966+
dtype=self._dtype,
967+
value=0)
968+
v = layers.fill_constant_batch_size_like(
969+
input=src_ids,
970+
shape=[-1, mp_n_head, 0, head_size],
971+
dtype=self._dtype,
972+
value=0)
973+
self.generation_caches.append(MultiHeadAttention.Cache(k, v))
908974
self._init_gen_cache = True
909975
return self.generation_caches
910976

@@ -1011,10 +1077,14 @@ def forward(self, inputs, use_cache=False, cache=None):
10111077

10121078
# if cached_kvs are assigned to next step in _prepare_qkv of MultiHeadAttention,
10131079
# need to init the global caches here
1014-
#gen_caches = self._init_generation_caches(input_ids)
1080+
gen_caches = self._init_generation_caches(input_ids)
10151081

10161082
logits, cached_kvs = self.model(
1017-
input_ids, position_ids, encode_mask, use_cache=True)
1083+
input_ids,
1084+
position_ids,
1085+
encode_mask,
1086+
use_cache=True,
1087+
cache=gen_caches)
10181088

10191089
next_id = paddle.argmax(logits[:, -1, :], axis=-1).reshape([-1, 1])
10201090
####################################
@@ -1092,7 +1162,10 @@ def forward(self, inputs, use_cache=False, cache=None):
10921162
paddle.assign(layers.cast(cond, dtype='bool'), cond)
10931163
if attention_mask:
10941164
paddle.assign(decode_mask, attention_mask)
1095-
for i in range(len(decode_cached_kvs)):
1165+
for i in range(len(decode_cached_kvs)):
1166+
if self._fuse:
1167+
paddle.assign(decode_cached_kvs[i].kv, cached_kvs[i].kv)
1168+
else:
10961169
paddle.assign(decode_cached_kvs[i].k, cached_kvs[i].k)
10971170
paddle.assign(decode_cached_kvs[i].v, cached_kvs[i].v)
10981171

examples/language_model/gpt-3/static/run_generation.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,14 +203,16 @@ def do_generation(args):
203203
model_config[
204204
"attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
205205
model_config["topo"] = topo
206+
model_config["fuse"] = args.fuse
206207
model = GPTForGeneration(
207208
GPTModel(**model_config),
208209
max_length=args.max_dec_len,
209210
decoding_strategy=args.decoding_strategy,
210211
temperature=args.temperature,
211212
top_k=args.topk,
212213
top_p=args.topp,
213-
eos_id=eos_id)
214+
eos_id=eos_id,
215+
fuse=args.fuse)
214216
else:
215217
logger.error("No checkpoint load.")
216218
model.eval()

0 commit comments

Comments
 (0)