PaddlePaddle
diff --git a/‎doc/fleet_mode.md‎
Lines changed: 80 additions & 5 deletions b/‎doc/fleet_mode.md‎
Lines changed: 80 additions & 5 deletions
diff --git a/‎models/match/dssm/bq_reader_train_insid.py‎
Lines changed: 82 additions & 0 deletions b/‎models/match/dssm/bq_reader_train_insid.py‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎models/match/dssm/config_online.yaml‎
Lines changed: 70 additions & 0 deletions b/‎models/match/dssm/config_online.yaml‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎models/match/dssm/data/prepare_dump_data.sh‎
Lines changed: 27 additions & 0 deletions b/‎models/match/dssm/data/prepare_dump_data.sh‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎models/match/dssm/net.py‎
Lines changed: 4 additions & 0 deletions b/‎models/match/dssm/net.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎models/match/dssm/readme.md‎
Lines changed: 27 additions & 0 deletions b/‎models/match/dssm/readme.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎models/match/dssm/static_model.py‎
Lines changed: 5 additions & 0 deletions b/‎models/match/dssm/static_model.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎models/match/readme.md‎
Lines changed: 1 addition & 2 deletions b/‎models/match/readme.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎models/multitask/esmm/readme.md‎
Lines changed: 1 addition & 1 deletion b/‎models/multitask/esmm/readme.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎models/multitask/maml/config_bigdata.yaml‎
Lines changed: 1 addition & 1 deletion b/‎models/multitask/maml/config_bigdata.yaml‎
Lines changed: 1 addition & 1 deletion
@@ -1,12 +1,13 @@
 # 分布式模式介绍
 
-当模型、数据的规模达到单机训练的瓶颈之后，分布式训练是必然选择。目前PaddleRec可提供两种分布式训练的模式。  
-参数服务器：推荐系统领域常用的并行训练方式，ParameterServer模式提供了基于参数服务器的分布式训练功能 。  
-GPU多机训练：如果您希望使用GPU进行多机多卡训练，Collective模式提供了使用飞桨进行单机多卡，多机多卡训练的功能。  
-本教程讲解如何使用以上两种模式，如果您希望深入学习paddle的分布式训练功能，建议您访问[分布式深度学习介绍](ps_background.md)进行深入了解
+当模型、数据的规模达到单机训练的瓶颈之后，分布式训练是必然选择。目前PaddleRec可提供三种分布式训练的模式。  
+参数服务器：推荐系统领域常用的并行训练方式，ParameterServer模式提供了基于参数服务器的分布式训练功能。
+GPU多机训练：如果您希望使用GPU进行多机多卡训练，Collective模式提供了使用飞桨进行单机多卡，多机多卡训练的功能。 
+GPU参数服务器（GPUBox）：如果您的推荐任务中稀疏参数较大，使用GPU Collective模式在性能和显存上无法满足要求时，推荐使用最新的GPU参数服务器训练方式，通过使用GPU以及CPU多级存储实现基于参数服务器的分布式训练。
+本教程讲解如何使用以上三种模式，如果您希望深入学习paddle的分布式训练功能，建议您访问[分布式深度学习介绍](ps_background.md)进行深入了解
 
 ## 版本要求
-在编写分布式训练程序之前，用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。  
+在编写分布式训练程序之前，用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。
 
 ## ParameterServer模式
 为了提高模型的训练效率，分布式训练应运而生，其中基于参数服务器的分布式训练为一种常见的中心化共享参数的同步方式。与单机训练不同的是在参数服务器分布式训练中，各个节点充当着不同的角色：  
@@ -114,3 +115,77 @@ python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2
 # 静态图执行训练
 python -m paddle.distributed.launch --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus 0,1,2,3,4,5,6,7 ../../../tools/static_trainer.py -m config.yaml
 ```
+
+## GPU参数服务器(GPUBox)模式
+如果您的推荐任务中稀疏参数较大，使用GPU Collective模式在性能和显存上无法满足要求时，推荐使用最新的GPU参数服务器训练方式。原理和使用可参考：[GPUBOX原理与使用](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/parameter_server/performance/heterps.html) 
+
+在PaddleRec上使用GPUBox模式启动分布式训练需要三步：
+1. 在yaml配置中添加分布式相关的参数
+2. 修改reader类型
+3. 修改网络使用的embedding
+3. 在启动命令中输入相关配置，启动训练
+
+### 添加yaml配置
+使用GPUBox模式相较单机模式需要添加一些相关配置，首先需要在模型的yaml配置中，加入use_fleet参数，并把值设置成True。  
+同时设置use_gpu为True，sync_mode模式设置为gpubox
+```yaml
+runner:
+  # 通用配置不再赘述
+  ...
+  # use fleet
+  use_fleet: True
+  use_gpu: True
+  sync_mode: "gpubox"
+```
+### 修改reader
+目前GPUBox模式下只支持InmemoryDataset模式，您可以在yaml配置中修改reader_type
+```yaml
+runner:
+  # 通用配置不再赘述
+  ...
+  reader_type: "InmemoryDataset"
+  
+```
+
+### 修改网络使用的embedding
+目前GPUBox模式使用的embedding接口与其他模式暂不兼容，因此可以在models/底下的net.py里修改embedding接口：
+```python
+def forward(self, sparse_inputs, dense_inputs):
+
+  sparse_embs = []
+  for s_input in sparse_inputs:
+      if self.sync_mode == "gpubox":
+          emb = paddle.fluid.contrib.sparse_embedding(
+              input=s_input,
+              size=[
+                  self.sparse_feature_number, self.sparse_feature_dim
+              ],
+              param_attr=paddle.ParamAttr(name="embedding"))
+      else:
+          emb = self.embedding(s_input)
+      emb = paddle.reshape(emb, shape=[-1, self.sparse_feature_dim])
+      sparse_embs.append(emb)
+
+  # 其余部分省略 ....
+```
+
+### GPU单机启动命令
+下面以dnn模型为例，展示如何启动训练,支持在任意目录下运行，以下命令默认在根目录下运行：
+```bash
+sh tools/run_gpubox.sh
+
+```
+
+其中run_gpubox.sh中需要关注并设置的参数有：
+```bash
+
+# set free port if 29011 is occupied
+export PADDLE_PSERVERS_IP_PORT_LIST="127.0.0.1:29011"
+export PADDLE_PSERVER_PORT_ARRAY=(29011)
+
+# set gpu numbers according to your device
+export FLAGS_selected_gpus="0,1,2,3,4,5,6,7"
+
+# set your model yaml
+SC="tools/static_gpubox_trainer.py -m models/rank/dnn/config_gpubox.yaml"
+```
@@ -0,0 +1,82 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import sys
+import yaml
+import six
+import os
+import copy
+import paddle.distributed.fleet as fleet
+import logging
+
+logging.basicConfig(
+    format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+class Reader(fleet.MultiSlotStringDataGenerator):
+    def init(self, config):
+        self.config = config
+        self.neg_num = self.config.get("hyper_parameters.neg_num")
+
+    def line_process(self, line):
+        data = line.rstrip('\n').split('\t')
+        ins_id = [data[0]]
+        content = [data[1]]
+        features = data[2:]
+        query = features[0].split(',')
+        pos_doc = features[1].split(',')
+
+        neg_doc_list = []
+        for i in range(self.neg_num):
+            neg_doc_list.append(features[i + 2].split(','))
+
+        return [ins_id, content, query, pos_doc] + neg_doc_list
+
+    def generate_sample(self, line):
+        "Dataset Generator"
+
+        def reader():
+            input_data = self.line_process(line)
+            feature_name = ["insid", "content", "query", "pos_doc"]
+            for i in range(self.neg_num):
+                feature_name.append("neg_doc_{}".format(i))
+            yield zip(feature_name, input_data)
+
+        return reader
+
+    def dataloader(self, file_list):
+        "DataLoader Pyreader Generator"
+
+        def reader():
+            for file in file_list:
+                with open(file, 'r') as f:
+                    for line in f:
+                        input_data = self.line_process(line)
+                        yield input_data
+
+        return reader
+
+
+if __name__ == "__main__":
+    yaml_path = sys.argv[1]
+    utils_path = sys.argv[2]
+    sys.path.append(utils_path)
+    import common
+    yaml_helper = common.YamlHelper()
+    config = yaml_helper.load_yaml(yaml_path)
+
+    r = Reader()
+    r.init(config)
+    # r.init(None)
+    r.run_from_stdin()
@@ -0,0 +1,70 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+runner:
+  train_data_dir: "data/train_with_insid"
+  test_data_dir: "data/test_with_insid"
+  # train_reader_path: "bq_reader_train"  # importlib format
+  days: "{20210803..20210804}"
+  pass_per_day: 1
+  train_batch_size: 8
+  test_batch_size: 8
+  model_save_path: "output_model_dssm"
+
+  reader_type: "InmemoryDataset"  # DataLoader / QueueDataset / RecDataset / InmemoryDataset
+  pipe_command: "python3 bq_reader_train_insid.py"
+
+  sync_mode: "async"
+  # thread_num: 1
+  train_thread_num: 1
+  test_thread_num: 1
+
+  use_gpu: False
+  epochs: 1
+  print_interval: 1
+
+  dataset_debug: False
+
+  # when you need to prune net, please set need_prune to True,
+  # and need to set prune_feed_vars and prune_target_var in static_model.py
+  need_prune: True
+
+  parse_ins_id: True
+  parse_content: True
+  
+  # when you need to dump fileds and params in training, please set need_train_dump to True,
+  # and need to set train_dump_fields and train_dump_params in static_model.py
+  need_train_dump: True
+  # train_dump_fields_dir: "afs:/xxx"
+  train_dump_fields_dir: "./train_dump_data"
+
+  # when you need to dump fileds in inference, please set need_infer_dump to True,
+  # and need to set infer_dump_fields in static_model.py
+  need_infer_dump: True
+  # infer_dump_fields_dir: "afs:/xxx"
+  infer_dump_fields_dir: "./infer_dump_data"
+
+  fs_name: "afs://xxx"
+  fs_ugi: "xxx,xxx"
+  
+hyper_parameters:
+  optimizer:
+    class: adam
+    learning_rate: 0.001
+    strategy: sync
+  trigram_d: 2900
+  neg_num: 1
+  slice_end: 8
+  fc_sizes: [300, 300, 128]
+  fc_acts: ['relu', 'relu', 'relu']
@@ -0,0 +1,27 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#!/bin/bash
+
+
+cat train/train.txt | awk -F'\t' 'BEGIN{OFS="\t"}{print NR, "item_"NR, $0}' > data_with_lineid
+for i in 20210803 20210804
+do
+    for j in 1
+    do
+        mkdir -p train_with_insid/$i/$j
+        cp data_with_lineid train_with_insid/$i/$j
+        mkdir -p test_with_insid/$i/$j
+        cp data_with_lineid test_with_insid/$i/$j
+    done
+done
@@ -71,10 +71,14 @@ def forward(self, input_data, is_infer):
         query_fc = input_data[0]
         for n_layer in self._query_layers:
             query_fc = n_layer(query_fc)
+        self.query_fc = query_fc
 
         doc_pos_fc = input_data[1]
         for n_layer in self._doc_layers:
             doc_pos_fc = n_layer(doc_pos_fc)
+        self.doc_pos_fc = doc_pos_fc
+
+        self.params = [self._query_layers[-2].bias]
 
         R_Q_D_p = F.cosine_similarity(
             query_fc, doc_pos_fc, axis=1).reshape([-1, 1])
 
@@ -105,5 +105,32 @@ bash run.sh #动态图训练并测试，最后得到指标
 ```
 
 ## 进阶使用
+DSSM作为推荐系统中一种向量召回的方式，一般需要将doc侧的向量预先计算出来，灌入向量搜索引擎（例如milvus）中，同时保存的模型仅为query侧的模型。线上使用阶段，输入query侧的数据，计算出query侧向量后，直接通过向量搜索引擎召回对应的doc。
+一般在训练的过程中，增加预测阶段，dump出全量的doc侧向量，需要做如下修改：
+1. 为了区分dump出的向量，预测阶段使用的数据需要增加insid和content两个字段，其中insid唯一标记样本，content指明对应的doc。并在数据处理脚本中对这两个字段进行解析，详见bq_reader_train_insid.py脚本。
+2. dataset选择InmemoryDataset，同时设置
+```python
+dataset.set_parse_ins_id(True)
+dataset.set_parse_content(True)
+```
+3. 在static_model.py中配置需要dump的变量（doc侧最上层输出）
+```python
+self.infer_dump_fields = [dssm_model.doc_pos_fc]
+```
+4. 配置文件中，打开预测阶段的dump功能，并配置dump_path
+```bash
+need_infer_dump: True
+infer_dump_fields_dir: "./infer_dump_data"
+```
+保存模型时，只需要保存query侧网络
+1. 配置文件中，打开裁剪网络开关
+```bash
+need_prune: True
+```
+2. 在static_model.py中配置裁剪网络的输入和输出
+```python
+self.prune_feed_vars = [query]
+self.prune_target_var = dssm_model.query_fc
+```
 
 ## FAQ
@@ -36,6 +36,7 @@ def _init_hyper_parameters(self):
     def create_feeds(self, is_infer=False):
         query = paddle.static.data(
             name="query", shape=[-1, self.trigram_d], dtype='float32')
+        self.prune_feed_vars = [query]
 
         doc_pos = paddle.static.data(
             name="doc_pos", shape=[-1, self.trigram_d], dtype='float32')
@@ -58,6 +59,10 @@ def net(self, input, is_infer=False):
         R_Q_D_p, hit_prob = dssm_model(input, is_infer)
 
         self.inference_target_var = R_Q_D_p
+        self.prune_target_var = dssm_model.query_fc
+        self.train_dump_fields = [dssm_model.query_fc, R_Q_D_p]
+        self.train_dump_params = dssm_model.params
+        self.infer_dump_fields = [dssm_model.doc_pos_fc]
         if is_infer:
             fetch_dict = {'query_doc_sim': R_Q_D_p}
             return fetch_dict
 
@@ -1,8 +1,7 @@
 # 匹配模型库
 
 ## 简介
-我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 包括动态图和静态图的单机训练&预测效果指标。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)、[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet)、[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)。
-
+我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 包括动态图和静态图的单机训练&预测效果指标。实现的模型包括 [DSSM](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/dssm)、[MultiView-Simnet](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/multiview-simnet)、[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)。  
 模型算法库在持续添加中，欢迎关注。
 
 ## 目录
 
@@ -72,7 +72,7 @@ python -u ../../../tools/static_infer.py -m config.yaml
 ESMM是发表在 SIGIR’2018 的论文[《Entire Space Multi-Task Model: An Eﬀective Approach for Estimating Post-Click Conversion Rate》](  https://arxiv.org/abs/1804.07931  )文章基于 Multi-Task Learning 的思路，提出一种新的CVR预估模型——ESMM，有效解决了真实场景中CVR预估面临的数据稀疏以及样本选择偏差这两个关键问题。模型的主要组网结构如下：
 [ESMM](https://arxiv.org/abs/1804.07931):
 <p align="center">
-<img align="center" src="../../doc/imgs/esmm.png">
+<img align="center" src="../../../doc/imgs/esmm.png">
 <p>
 
 ### 效果复现
 
@@ -28,7 +28,7 @@ runner:
   infer_batch_size: 32
   infer_load_path: "output_model_all_maml"
   infer_start_epoch: 90
-  infer_end_epoch: 91
+  infer_end_epoch: 100
 
 # hyper parameters of user-defined network
 hyper_parameters: