PaddlePaddle
diff --git a/‎README_CN.md‎
Lines changed: 2 additions & 1 deletion b/‎README_CN.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README_EN.md‎
Lines changed: 1 addition & 0 deletions b/‎README_EN.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎contributor.md‎
Lines changed: 1 addition & 0 deletions b/‎contributor.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎datasets/criteo_dcn_v2/download.sh‎
Lines changed: 16 additions & 0 deletions b/‎datasets/criteo_dcn_v2/download.sh‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎datasets/criteo_dcn_v2/get_slot_data.py‎
Lines changed: 106 additions & 0 deletions b/‎datasets/criteo_dcn_v2/get_slot_data.py‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎datasets/criteo_dcn_v2/run.sh‎
Lines changed: 24 additions & 0 deletions b/‎datasets/criteo_dcn_v2/run.sh‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎models/rank/dcn_v2/README.md‎
Lines changed: 128 additions & 0 deletions b/‎models/rank/dcn_v2/README.md‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎models/rank/dcn_v2/__init__.py‎
Lines changed: 13 additions & 0 deletions b/‎models/rank/dcn_v2/__init__.py‎
Lines changed: 13 additions & 0 deletions
@@ -162,7 +162,8 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml #  静态图训
   |   排序   |                  [Fibinet](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/fibinet/)                  |  -  |       ✓     |     ✓     | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf)                                                 |
   |   排序   |                     [FLEN](models/rank/flen/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf)                                                                                                           |
   |   排序   |                     [DeepRec](models/rank/deeprec/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf)                                                                                                          |
-  |   排序   |                     [AutoFIS](models/rank/autofis/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf)
+  |   排序   |                     [AutoFIS](models/rank/autofis/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) 
+  |   排序   |                     [DCN_V2](models/rank/dcn_v2/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf) 
   |  多任务  |                  [PLE](models/multitask/ple/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938)  |       ✓     |     ✓     |  >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236)                                                              |
   |  多任务  |                  [ESMM](models/multitask/esmm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583)  |       ✓     |     ✓     | >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931)                                                              |
   |  多任务  |                  [MMOE](models/multitask/mmoe/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934)  |       ✓     |     ✓     | >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007)                                                       |
 
@@ -153,6 +153,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml #  Training wit
   |         Rank          |                     [FLEN](models/rank/flen/)                     |  -  |         ✓         |     ✓     |  >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf)                                                                                                           |
   |   Rank   |                     [DeepRec](models/rank/deeprec/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf)                                                                                                          |
   |   Rank   |                     [AutoFIS](models/rank/autofis/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf)                                                                                                          |
+  |   Rank   |                     [DCN_V2](models/rank/dcn_v2/)                     |  -  |       ✓     |     ✓     | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf) 
   |      Multi-Task       |                  [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938)  |     ✓     |     ✓     |  >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236)                                                              |
   |      Multi-Task       |                  [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583)  |         ✓         |     ✓     |      >=2.1.0     | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931)                                                              |
   |      Multi-Task       |                  [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html))                   |  [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934)  |         ✓         |     ✓     |      >=2.1.0     | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007)                                                       |
 
@@ -19,5 +19,6 @@
   |                     [MIND](models/recall/mind/)                     |  [duyiqi17 ](https://github.com/duyiqi17)  |    https://github.com/PaddlePaddle/PaddleRec/pull/398   | 其他 |
   |                     [FLEN](models/rank/flen/)                     |  [LinJayan](https://github.com/LinJayan)  |    https://github.com/PaddlePaddle/PaddleRec/pull/685   | 论文复现赛第五期 |
   |                     [MHCN](models/recall/mhcn/)                     |  [Andy1314Chen](https://github.com/Andy1314Chen)  |    https://github.com/PaddlePaddle/PaddleRec/pull/679   | 论文复现赛第五期 |
+  |                     [DCN_V2](models/rank/dcn_v2/)                     |  [LinJayan](https://github.com/LinJayan)  |    https://github.com/PaddlePaddle/PaddleRec/pull/677   | 论文复现赛第五期 |
 
 </div> 
@@ -0,0 +1,16 @@
+wget --no-check-certificate https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2
+
+wget --no-check-certificate https://fleet.bj.bcebos.com/ctr_data.tar.gz
+
+tar -zxvf ctr_data.tar.gz
+mv ./raw_data ./train_data_full
+mkdir train_data && cd train_data
+cp ../train_data_full/part-0 ../train_data_full/part-1 ./ && cd ..
+mv ./test_data ./test_data_full
+mkdir test_data && cd test_data
+cp ../test_data_full/part-220 ./  && cd ..
+echo "Complete data download."
+echo "Full Train data stored in ./train_data_full "
+echo "Full Test data stored in ./test_data_full "
+echo "Rapid Verification train data stored in ./train_data "
+echo "Rapid Verification test data stored in ./test_data "
@@ -0,0 +1,106 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.	
+#	
+# Licensed under the Apache License, Version 2.0 (the "License");	
+# you may not use this file except in compliance with the License.	
+# You may obtain a copy of the License at	
+#	
+#     http://www.apache.org/licenses/LICENSE-2.0	
+#	
+# Unless required by applicable law or agreed to in writing, software	
+# distributed under the License is distributed on an "AS IS" BASIS,	
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.	
+# See the License for the specific language governing permissions and	
+# limitations under the License.	
+
+import os
+import numpy as np
+import paddle.fluid.incubate.data_generator as dg
+try:
+    import cPickle as pickle
+except ImportError:
+    import pickle
+
+import paddle.fluid.incubate.data_generator as dg
+
+
+class Reader(dg.MultiSlotDataGenerator):
+    def __init__(self, config):
+        dg.MultiSlotDataGenerator.__init__(self)
+
+    def init(self):
+        # DCN_v2 use log normalize the 13 continuous features
+        # log（x+4）for dense-feature-2, log(x+1) for others
+
+        # self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+        # self.cont_max_ = [
+        #     5775, 257675, 65535, 969, 23159456, 431037, 56311, 6047, 29019, 46,
+        #     231, 4008, 7393
+        # ]
+        # self.cont_diff_ = [
+        #     self.cont_max_[i] - self.cont_min_[i]
+        #     for i in range(len(self.cont_min_))
+        # ]
+
+        self.continuous_range_ = range(1, 14)
+        self.categorical_range_ = range(14, 40)
+        # load preprocessed feature dict	
+        self.feat_dict_name = "deepfm%2Ffeat_dict_10.pkl2"  # 
+        self.feat_dict_ = pickle.load(open(self.feat_dict_name, 'rb'))
+
+    def _process_line(self, line):
+        features = line.rstrip('\n').split('\t')
+        feat_idx = []
+        feat_value = []
+        # log normalize
+        for idx in self.continuous_range_:
+            if features[idx] == '':
+                # feat_idx.append(0)
+                feat_value.append(0.0)
+            else:
+                # feat_idx.append(self.feat_dict_[idx])
+                if idx == 2:  # log(x+4)
+                    feat_value.append(np.log(float(features[idx]) + 4))
+                else:  # log(x+1)
+                    feat_value.append(np.log(float(features[idx]) + 1))
+
+                # feat_idx.append(self.feat_dict_[idx])
+                # feat_value.append(
+                #     (float(features[idx]) - self.cont_min_[idx - 1]) /
+                #     self.cont_diff_[idx - 1])
+
+        for idx in self.categorical_range_:
+            if features[idx] == '' or features[idx] not in self.feat_dict_:
+                feat_idx.append(0)
+                # feat_value.append(0.0)
+            else:
+                feat_idx.append(self.feat_dict_[features[idx]])
+                # feat_value.append(1.0)
+        label = [int(features[0])]
+        return label, feat_value, feat_idx
+
+    def generate_sample(self, line):
+        """	
+        Read the data line by line and process it as a dictionary	
+        """
+
+        def data_iter():
+            label, feat_value, feat_idx = self._process_line(line)
+            s = ""
+            for i in [('click', label), ('dense_feature', feat_value),
+                      ('feat_idx', feat_idx)]:
+                k = i[0]
+                v = i[1]
+                for n, j in enumerate(v):
+                    if k == "feat_idx":
+                        s += " " + str(n + 1) + ":" + str(j)
+                    else:
+                        s += " " + k + ":" + str(j)
+            print(s.strip())  # add print for data preprocessing	
+            yield None
+
+        return data_iter
+
+
+reader = Reader("../config.yaml")
+reader.init()
+reader.run_from_stdin()
@@ -0,0 +1,24 @@
+sh download.sh
+mkdir slot_train_data_full
+for i in `ls ./train_data_full`
+do
+    cat train_data_full/$i | python get_slot_data.py > slot_train_data_full/$i
+done
+
+mkdir slot_test_data_full
+for i in `ls ./test_data_full`
+do
+    cat test_data_full/$i | python get_slot_data.py > slot_test_data_full/$i
+done
+
+mkdir slot_train_data
+for i in `ls ./train_data`
+do
+    cat train_data/$i | python get_slot_data.py > slot_train_data/$i
+done
+
+mkdir slot_test_data
+for i in `ls ./test_data`
+do
+    cat test_data/$i | python get_slot_data.py > slot_test_data/$i
+done
@@ -0,0 +1,128 @@
+# 基于 DCN_V2 模型的点击率预估模型
+
+以下是本例的简要目录结构及说明： 
+
+```
+├── data # 样例数据
+    ├── sample_data # 样例数据
+        ├── train
+            ├── sample_train.txt # 训练数据样例
+├── __init__.py
+├── README.md # 文档
+├── config.yaml # sample数据配置
+├── config_bigdata.yaml # 全量数据配置
+├── net.py # 模型核心组网（动静统一）
+├── reader.py # 数据读取程序
+├── dygraph_model.py # 构建动态图
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[PaddleRec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [模型组网](#模型组网)
+- [效果复现](#效果复现)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+## 模型简介
+`CTR(Click Through Rate)`，即点击率，是“推荐系统/计算广告”等领域的重要指标，对其进行预估是商品推送/广告投放等决策的基础。简单来说，CTR预估对每次广告的点击情况做出预测，预测用户是点击还是不点击。CTR预估模型综合考虑各种因素、特征，在大量历史数据上训练，最终对商业决策提供帮助。本模型实现了下述论文中的 DCN_V2 模型：
+
+```text
+@article{DCN_V2 2020,
+  title={DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems},
+  author={Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, Ed H. Chi},
+  journal={arXiv preprint arXiv:2008.13535v2},
+  year={2020}，
+  url={https://arxiv.org/pdf/2008.13535v2.pdf},
+}
+```
+
+## 数据准备
+
+训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分：训练集和测试集。训练集包含一段时间内Criteo的部分流量，测试集则对应训练数据后一天的广告点击流量。
+每一行数据格式如下所示：
+```
+<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
+```
+其中```<label>```表示广告是否被点击，点击用1表示，未点击用0表示。```<integer feature>```代表数值特征（连续特征），共有13个连续特征。```<categorical feature>```代表分类特征（离散特征），共有26个离散特征。相邻两个特征用```\t```分隔，缺失特征用空格表示。测试集中```<label>```特征已被移除。  
+在模型目录的data目录下为您准备了快速运行的示例数据，若需要使用全量数据可以参考下方[效果复现](#效果复现)部分。
+
+
+## 运行环境
+PaddlePaddle>=2.0
+
+python 2.7/3.5/3.6/3.7
+
+os : windows/linux/macos 
+
+## 快速开始
+本文提供了样例数据可以供您快速体验，在任意目录下均可执行。在dcn_v2模型目录的快速执行命令如下： 
+```bash
+# 进入模型目录
+# cd models/rank/dcn_v2 # 在任意目录均可运行
+# 动态图训练
+python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml 
+# 动态图预测
+python -u ../../../tools/infer.py -m config.yaml 
+``` 
+
+## 模型组网
+DCN_V2 模型的组网，代码参考 `net.py`。模型主要组成是 Embedding层，CrossNetwork 层，MLP 层，以及相应的分类任务的loss计算和auc计算。另外，DCN_V2在DCN的基础
+上，根据CrossNetwork 层和MLP 层的堆叠方式，网络结构又分为Stacked和Parallel两种方式，模型架构如下：
+
+<img align="center" src="https://wx4.sinaimg.cn/mw2000/0073e4AWgy1gyao6r7ovbj30ov0guqac.jpg">
+
+### **CrossNetwork 层**
+CrossNetwork的核心是创建显示的特征交叉，每一层都与原始特征进行特征交叉，每一层的输出是下一层的输入，计算方式如下公式（1）所示，计算方法可视化表达如下图所示：
+
+<img align="center" src="https://wx2.sinaimg.cn/mw2000/0073e4AWgy1gyaotiqopbj30hh01nt8w.jpg">
+
+<img align="center" src="https://wx3.sinaimg.cn/mw2000/0073e4AWgy1gyaohnd39zj30hb06i0ts.jpg">
+
+CrossNetwork计算特征交叉时，随着层数和特征维度的增大，计算成本也比较高，论文中设计了降低计算成本的CrossMix网络结构，如下介绍。
+
+### **Cost-Effective Mixture of Low-Rank DCN**
+如上公式（1）所示，权重矩阵 W 是一个具有高秩的矩阵，论文中将W分解为两个低秩的矩阵U 和 V 有效降低了计算成本，如下公式（2）所示，但在精度性能上效果却略逊色前一种方式。
+
+<img align="center" src="https://wx4.sinaimg.cn/mw2000/0073e4AWgy1gyap3vkyq1j30k301xwev.jpg">
+
+
+
+### **Loss 及 Auc 计算**
+- 为了得到每条样本分属于正负样本的概率，我们将预测结果和 `1-predict` 合并起来得到 `predict_2d`，以便接下来计算 `auc`。  
+- 每条样本的损失为负对数损失值，label的数据类型将转化为float输入。  
+- 该batch的损失 `avg_cost` 是各条样本的损失之和
+- 我们同时还会计算预测的auc指标。
+
+## 效果复现
+为了方便使用者能够快速的跑通每一个模型，我们在每个模型下都提供了样例数据。如果需要复现 README 中的效果,请按如下步骤依次操作即可。
+在全量数据下模型的指标如下：  
+
+| 模型 | auc | logloss | batch_size | epoch_num| Time of each epoch |
+| :------| :------ | :------ | :------| :------ | :------ | 
+| DCN_V2 | 0.8026 | 0.4384 |512 | 1 | 约 3 小时 |
+
+1. 确认您当前所在目录为PaddleRec/models/rank/dcn_v2
+2. 进入PaddleRec/datasets/criteo_dcn_v2目录下，执行该脚本，会从国内源的服务器上下载并预处理完成criteo全量数据集，放到指定文件夹。
+``` bash
+cd ../../../datasets/criteo_dcn_v2
+sh run.sh
+``` 
+3. 切回模型目录,执行命令运行全量数据
+```bash
+cd - # 切回模型目录
+# 动态图训练
+python -u ../../../tools/trainer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml 
+python -u ../../../tools/infer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml 
+```
+
+## 进阶使用
+  
+## FAQ
@@ -0,0 +1,13 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.