Skip to content

Commit 41e9c77

Browse files
committed
Merge branch 'master' of https://github.com/PaddlePaddle/PaddleRec into maml
2 parents d2061aa + 064a118 commit 41e9c77

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+4459
-259
lines changed

doc/fleet_mode.md

Lines changed: 80 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
# 分布式模式介绍
22

3-
当模型、数据的规模达到单机训练的瓶颈之后,分布式训练是必然选择。目前PaddleRec可提供两种分布式训练的模式。
4-
参数服务器:推荐系统领域常用的并行训练方式,ParameterServer模式提供了基于参数服务器的分布式训练功能 。
5-
GPU多机训练:如果您希望使用GPU进行多机多卡训练,Collective模式提供了使用飞桨进行单机多卡,多机多卡训练的功能。
6-
本教程讲解如何使用以上两种模式,如果您希望深入学习paddle的分布式训练功能,建议您访问[分布式深度学习介绍](ps_background.md)进行深入了解
3+
当模型、数据的规模达到单机训练的瓶颈之后,分布式训练是必然选择。目前PaddleRec可提供三种分布式训练的模式。
4+
参数服务器:推荐系统领域常用的并行训练方式,ParameterServer模式提供了基于参数服务器的分布式训练功能。
5+
GPU多机训练:如果您希望使用GPU进行多机多卡训练,Collective模式提供了使用飞桨进行单机多卡,多机多卡训练的功能。
6+
GPU参数服务器(GPUBox):如果您的推荐任务中稀疏参数较大,使用GPU Collective模式在性能和显存上无法满足要求时,推荐使用最新的GPU参数服务器训练方式,通过使用GPU以及CPU多级存储实现基于参数服务器的分布式训练。
7+
本教程讲解如何使用以上三种模式,如果您希望深入学习paddle的分布式训练功能,建议您访问[分布式深度学习介绍](ps_background.md)进行深入了解
78

89
## 版本要求
9-
在编写分布式训练程序之前,用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。
10+
在编写分布式训练程序之前,用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。
1011

1112
## ParameterServer模式
1213
为了提高模型的训练效率,分布式训练应运而生,其中基于参数服务器的分布式训练为一种常见的中心化共享参数的同步方式。与单机训练不同的是在参数服务器分布式训练中,各个节点充当着不同的角色:
@@ -114,3 +115,77 @@ python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2
114115
# 静态图执行训练
115116
python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2,3,4,5,6,7 ../../../tools/static_trainer.py -m config.yaml
116117
```
118+
119+
## GPU参数服务器(GPUBox)模式
120+
如果您的推荐任务中稀疏参数较大,使用GPU Collective模式在性能和显存上无法满足要求时,推荐使用最新的GPU参数服务器训练方式。原理和使用可参考:[GPUBOX原理与使用](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/parameter_server/performance/heterps.html)
121+
122+
在PaddleRec上使用GPUBox模式启动分布式训练需要三步:
123+
1. 在yaml配置中添加分布式相关的参数
124+
2. 修改reader类型
125+
3. 修改网络使用的embedding
126+
3. 在启动命令中输入相关配置,启动训练
127+
128+
### 添加yaml配置
129+
使用GPUBox模式相较单机模式需要添加一些相关配置,首先需要在模型的yaml配置中,加入use_fleet参数,并把值设置成True。
130+
同时设置use_gpu为True,sync_mode模式设置为gpubox
131+
```yaml
132+
runner:
133+
# 通用配置不再赘述
134+
...
135+
# use fleet
136+
use_fleet: True
137+
use_gpu: True
138+
sync_mode: "gpubox"
139+
```
140+
### 修改reader
141+
目前GPUBox模式下只支持InmemoryDataset模式,您可以在yaml配置中修改reader_type
142+
```yaml
143+
runner:
144+
# 通用配置不再赘述
145+
...
146+
reader_type: "InmemoryDataset"
147+
148+
```
149+
150+
### 修改网络使用的embedding
151+
目前GPUBox模式使用的embedding接口与其他模式暂不兼容,因此可以在models/底下的net.py里修改embedding接口:
152+
```python
153+
def forward(self, sparse_inputs, dense_inputs):
154+
155+
sparse_embs = []
156+
for s_input in sparse_inputs:
157+
if self.sync_mode == "gpubox":
158+
emb = paddle.fluid.contrib.sparse_embedding(
159+
input=s_input,
160+
size=[
161+
self.sparse_feature_number, self.sparse_feature_dim
162+
],
163+
param_attr=paddle.ParamAttr(name="embedding"))
164+
else:
165+
emb = self.embedding(s_input)
166+
emb = paddle.reshape(emb, shape=[-1, self.sparse_feature_dim])
167+
sparse_embs.append(emb)
168+
169+
# 其余部分省略 ....
170+
```
171+
172+
### GPU单机启动命令
173+
下面以dnn模型为例,展示如何启动训练,支持在任意目录下运行,以下命令默认在根目录下运行:
174+
```bash
175+
sh tools/run_gpubox.sh
176+
177+
```
178+
179+
其中run_gpubox.sh中需要关注并设置的参数有:
180+
```bash
181+
182+
# set free port if 29011 is occupied
183+
export PADDLE_PSERVERS_IP_PORT_LIST="127.0.0.1:29011"
184+
export PADDLE_PSERVER_PORT_ARRAY=(29011)
185+
186+
# set gpu numbers according to your device
187+
export FLAGS_selected_gpus="0,1,2,3,4,5,6,7"
188+
189+
# set your model yaml
190+
SC="tools/static_gpubox_trainer.py -m models/rank/dnn/config_gpubox.yaml"
191+
```
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import sys
15+
import yaml
16+
import six
17+
import os
18+
import copy
19+
import paddle.distributed.fleet as fleet
20+
import logging
21+
22+
logging.basicConfig(
23+
format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
24+
logger = logging.getLogger(__name__)
25+
26+
27+
class Reader(fleet.MultiSlotStringDataGenerator):
28+
def init(self, config):
29+
self.config = config
30+
self.neg_num = self.config.get("hyper_parameters.neg_num")
31+
32+
def line_process(self, line):
33+
data = line.rstrip('\n').split('\t')
34+
ins_id = [data[0]]
35+
content = [data[1]]
36+
features = data[2:]
37+
query = features[0].split(',')
38+
pos_doc = features[1].split(',')
39+
40+
neg_doc_list = []
41+
for i in range(self.neg_num):
42+
neg_doc_list.append(features[i + 2].split(','))
43+
44+
return [ins_id, content, query, pos_doc] + neg_doc_list
45+
46+
def generate_sample(self, line):
47+
"Dataset Generator"
48+
49+
def reader():
50+
input_data = self.line_process(line)
51+
feature_name = ["insid", "content", "query", "pos_doc"]
52+
for i in range(self.neg_num):
53+
feature_name.append("neg_doc_{}".format(i))
54+
yield zip(feature_name, input_data)
55+
56+
return reader
57+
58+
def dataloader(self, file_list):
59+
"DataLoader Pyreader Generator"
60+
61+
def reader():
62+
for file in file_list:
63+
with open(file, 'r') as f:
64+
for line in f:
65+
input_data = self.line_process(line)
66+
yield input_data
67+
68+
return reader
69+
70+
71+
if __name__ == "__main__":
72+
yaml_path = sys.argv[1]
73+
utils_path = sys.argv[2]
74+
sys.path.append(utils_path)
75+
import common
76+
yaml_helper = common.YamlHelper()
77+
config = yaml_helper.load_yaml(yaml_path)
78+
79+
r = Reader()
80+
r.init(config)
81+
# r.init(None)
82+
r.run_from_stdin()
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
runner:
16+
train_data_dir: "data/train_with_insid"
17+
test_data_dir: "data/test_with_insid"
18+
# train_reader_path: "bq_reader_train" # importlib format
19+
days: "{20210803..20210804}"
20+
pass_per_day: 1
21+
train_batch_size: 8
22+
test_batch_size: 8
23+
model_save_path: "output_model_dssm"
24+
25+
reader_type: "InmemoryDataset" # DataLoader / QueueDataset / RecDataset / InmemoryDataset
26+
pipe_command: "python3 bq_reader_train_insid.py"
27+
28+
sync_mode: "async"
29+
# thread_num: 1
30+
train_thread_num: 1
31+
test_thread_num: 1
32+
33+
use_gpu: False
34+
epochs: 1
35+
print_interval: 1
36+
37+
dataset_debug: False
38+
39+
# when you need to prune net, please set need_prune to True,
40+
# and need to set prune_feed_vars and prune_target_var in static_model.py
41+
need_prune: True
42+
43+
parse_ins_id: True
44+
parse_content: True
45+
46+
# when you need to dump fileds and params in training, please set need_train_dump to True,
47+
# and need to set train_dump_fields and train_dump_params in static_model.py
48+
need_train_dump: True
49+
# train_dump_fields_dir: "afs:/xxx"
50+
train_dump_fields_dir: "./train_dump_data"
51+
52+
# when you need to dump fileds in inference, please set need_infer_dump to True,
53+
# and need to set infer_dump_fields in static_model.py
54+
need_infer_dump: True
55+
# infer_dump_fields_dir: "afs:/xxx"
56+
infer_dump_fields_dir: "./infer_dump_data"
57+
58+
fs_name: "afs://xxx"
59+
fs_ugi: "xxx,xxx"
60+
61+
hyper_parameters:
62+
optimizer:
63+
class: adam
64+
learning_rate: 0.001
65+
strategy: sync
66+
trigram_d: 2900
67+
neg_num: 1
68+
slice_end: 8
69+
fc_sizes: [300, 300, 128]
70+
fc_acts: ['relu', 'relu', 'relu']
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#!/bin/bash
15+
16+
17+
cat train/train.txt | awk -F'\t' 'BEGIN{OFS="\t"}{print NR, "item_"NR, $0}' > data_with_lineid
18+
for i in 20210803 20210804
19+
do
20+
for j in 1
21+
do
22+
mkdir -p train_with_insid/$i/$j
23+
cp data_with_lineid train_with_insid/$i/$j
24+
mkdir -p test_with_insid/$i/$j
25+
cp data_with_lineid test_with_insid/$i/$j
26+
done
27+
done

models/match/dssm/net.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,14 @@ def forward(self, input_data, is_infer):
7171
query_fc = input_data[0]
7272
for n_layer in self._query_layers:
7373
query_fc = n_layer(query_fc)
74+
self.query_fc = query_fc
7475

7576
doc_pos_fc = input_data[1]
7677
for n_layer in self._doc_layers:
7778
doc_pos_fc = n_layer(doc_pos_fc)
79+
self.doc_pos_fc = doc_pos_fc
80+
81+
self.params = [self._query_layers[-2].bias]
7882

7983
R_Q_D_p = F.cosine_similarity(
8084
query_fc, doc_pos_fc, axis=1).reshape([-1, 1])

models/match/dssm/readme.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,5 +105,32 @@ bash run.sh #动态图训练并测试,最后得到指标
105105
```
106106

107107
## 进阶使用
108+
DSSM作为推荐系统中一种向量召回的方式,一般需要将doc侧的向量预先计算出来,灌入向量搜索引擎(例如milvus)中,同时保存的模型仅为query侧的模型。线上使用阶段,输入query侧的数据,计算出query侧向量后,直接通过向量搜索引擎召回对应的doc。
109+
一般在训练的过程中,增加预测阶段,dump出全量的doc侧向量,需要做如下修改:
110+
1. 为了区分dump出的向量,预测阶段使用的数据需要增加insid和content两个字段,其中insid唯一标记样本,content指明对应的doc。并在数据处理脚本中对这两个字段进行解析,详见bq_reader_train_insid.py脚本。
111+
2. dataset选择InmemoryDataset,同时设置
112+
```python
113+
dataset.set_parse_ins_id(True)
114+
dataset.set_parse_content(True)
115+
```
116+
3. 在static_model.py中配置需要dump的变量(doc侧最上层输出)
117+
```python
118+
self.infer_dump_fields = [dssm_model.doc_pos_fc]
119+
```
120+
4. 配置文件中,打开预测阶段的dump功能,并配置dump_path
121+
```bash
122+
need_infer_dump: True
123+
infer_dump_fields_dir: "./infer_dump_data"
124+
```
125+
保存模型时,只需要保存query侧网络
126+
1. 配置文件中,打开裁剪网络开关
127+
```bash
128+
need_prune: True
129+
```
130+
2. 在static_model.py中配置裁剪网络的输入和输出
131+
```python
132+
self.prune_feed_vars = [query]
133+
self.prune_target_var = dssm_model.query_fc
134+
```
108135

109136
## FAQ

models/match/dssm/static_model.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ def _init_hyper_parameters(self):
3636
def create_feeds(self, is_infer=False):
3737
query = paddle.static.data(
3838
name="query", shape=[-1, self.trigram_d], dtype='float32')
39+
self.prune_feed_vars = [query]
3940

4041
doc_pos = paddle.static.data(
4142
name="doc_pos", shape=[-1, self.trigram_d], dtype='float32')
@@ -58,6 +59,10 @@ def net(self, input, is_infer=False):
5859
R_Q_D_p, hit_prob = dssm_model(input, is_infer)
5960

6061
self.inference_target_var = R_Q_D_p
62+
self.prune_target_var = dssm_model.query_fc
63+
self.train_dump_fields = [dssm_model.query_fc, R_Q_D_p]
64+
self.train_dump_params = dssm_model.params
65+
self.infer_dump_fields = [dssm_model.doc_pos_fc]
6166
if is_infer:
6267
fetch_dict = {'query_doc_sim': R_Q_D_p}
6368
return fetch_dict

models/match/readme.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
# 匹配模型库
22

33
## 简介
4-
我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 包括动态图和静态图的单机训练&预测效果指标。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet)[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)
5-
4+
我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 包括动态图和静态图的单机训练&预测效果指标。实现的模型包括 [DSSM](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/dssm)[MultiView-Simnet](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/multiview-simnet)[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)
65
模型算法库在持续添加中,欢迎关注。
76

87
## 目录

models/multitask/esmm/readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ python -u ../../../tools/static_infer.py -m config.yaml
7272
ESMM是发表在 SIGIR’2018 的论文[《Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate》]( https://arxiv.org/abs/1804.07931 )文章基于 Multi-Task Learning 的思路,提出一种新的CVR预估模型——ESMM,有效解决了真实场景中CVR预估面临的数据稀疏以及样本选择偏差这两个关键问题。模型的主要组网结构如下:
7373
[ESMM](https://arxiv.org/abs/1804.07931):
7474
<p align="center">
75-
<img align="center" src="../../doc/imgs/esmm.png">
75+
<img align="center" src="../../../doc/imgs/esmm.png">
7676
<p>
7777

7878
### 效果复现

models/multitask/maml/config_bigdata.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ runner:
2828
infer_batch_size: 32
2929
infer_load_path: "output_model_all_maml"
3030
infer_start_epoch: 90
31-
infer_end_epoch: 91
31+
infer_end_epoch: 100
3232

3333
# hyper parameters of user-defined network
3434
hyper_parameters:

0 commit comments

Comments
 (0)