Skip to content

Commit f33c832

Browse files
authored
Merge pull request #719 from wangzhen38/add_dcn_v2
add dcn_v2
2 parents f8892c0 + 7039e82 commit f33c832

File tree

14 files changed

+1033
-1
lines changed

14 files changed

+1033
-1
lines changed

README_CN.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,8 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
162162
| 排序 | [Fibinet](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/fibinet/) | - ||| [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [RecSys19][FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction]( https://arxiv.org/pdf/1905.09433.pdf) |
163163
| 排序 | [FLEN](models/rank/flen/) | - ||| >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
164164
| 排序 | [DeepRec](models/rank/deeprec/) | - ||| >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf) |
165-
| 排序 | [AutoFIS](models/rank/autofis/) | - | ✓ | ✓ | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf)
165+
| 排序 | [AutoFIS](models/rank/autofis/) | - | ✓ | ✓ | >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf)
166+
| 排序 | [DCN_V2](models/rank/dcn_v2/) | - | ✓ | ✓ | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)
166167
| 多任务 | [PLE](models/multitask/ple/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
167168
| 多任务 | [ESMM](models/multitask/esmm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
168169
| 多任务 | [MMOE](models/multitask/mmoe/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |

README_EN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
153153
| Rank | [FLEN](models/rank/flen/) | - ||| >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
154154
| Rank | [DeepRec](models/rank/deeprec/) | - ||| >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf) |
155155
| Rank | [AutoFIS](models/rank/autofis/) | - ||| >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) |
156+
| Rank | [DCN_V2](models/rank/dcn_v2/) | - | ✓ | ✓ | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)
156157
| Multi-Task | [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
157158
| Multi-Task | [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
158159
| Multi-Task | [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |

contributor.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,6 @@
1919
| [MIND](models/recall/mind/) | [duyiqi17 ](https://github.com/duyiqi17) | https://github.com/PaddlePaddle/PaddleRec/pull/398 | 其他 |
2020
| [FLEN](models/rank/flen/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/685 | 论文复现赛第五期 |
2121
| [MHCN](models/recall/mhcn/) | [Andy1314Chen](https://github.com/Andy1314Chen) | https://github.com/PaddlePaddle/PaddleRec/pull/679 | 论文复现赛第五期 |
22+
| [DCN_V2](models/rank/dcn_v2/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/677 | 论文复现赛第五期 |
2223

2324
</div>

datasets/criteo_dcn_v2/download.sh

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
wget --no-check-certificate https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2
2+
3+
wget --no-check-certificate https://fleet.bj.bcebos.com/ctr_data.tar.gz
4+
5+
tar -zxvf ctr_data.tar.gz
6+
mv ./raw_data ./train_data_full
7+
mkdir train_data && cd train_data
8+
cp ../train_data_full/part-0 ../train_data_full/part-1 ./ && cd ..
9+
mv ./test_data ./test_data_full
10+
mkdir test_data && cd test_data
11+
cp ../test_data_full/part-220 ./ && cd ..
12+
echo "Complete data download."
13+
echo "Full Train data stored in ./train_data_full "
14+
echo "Full Test data stored in ./test_data_full "
15+
echo "Rapid Verification train data stored in ./train_data "
16+
echo "Rapid Verification test data stored in ./test_data "
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import numpy as np
17+
import paddle.fluid.incubate.data_generator as dg
18+
try:
19+
import cPickle as pickle
20+
except ImportError:
21+
import pickle
22+
23+
import paddle.fluid.incubate.data_generator as dg
24+
25+
26+
class Reader(dg.MultiSlotDataGenerator):
27+
def __init__(self, config):
28+
dg.MultiSlotDataGenerator.__init__(self)
29+
30+
def init(self):
31+
# DCN_v2 use log normalize the 13 continuous features
32+
# log(x+4)for dense-feature-2, log(x+1) for others
33+
34+
# self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
35+
# self.cont_max_ = [
36+
# 5775, 257675, 65535, 969, 23159456, 431037, 56311, 6047, 29019, 46,
37+
# 231, 4008, 7393
38+
# ]
39+
# self.cont_diff_ = [
40+
# self.cont_max_[i] - self.cont_min_[i]
41+
# for i in range(len(self.cont_min_))
42+
# ]
43+
44+
self.continuous_range_ = range(1, 14)
45+
self.categorical_range_ = range(14, 40)
46+
# load preprocessed feature dict
47+
self.feat_dict_name = "deepfm%2Ffeat_dict_10.pkl2" #
48+
self.feat_dict_ = pickle.load(open(self.feat_dict_name, 'rb'))
49+
50+
def _process_line(self, line):
51+
features = line.rstrip('\n').split('\t')
52+
feat_idx = []
53+
feat_value = []
54+
# log normalize
55+
for idx in self.continuous_range_:
56+
if features[idx] == '':
57+
# feat_idx.append(0)
58+
feat_value.append(0.0)
59+
else:
60+
# feat_idx.append(self.feat_dict_[idx])
61+
if idx == 2: # log(x+4)
62+
feat_value.append(np.log(float(features[idx]) + 4))
63+
else: # log(x+1)
64+
feat_value.append(np.log(float(features[idx]) + 1))
65+
66+
# feat_idx.append(self.feat_dict_[idx])
67+
# feat_value.append(
68+
# (float(features[idx]) - self.cont_min_[idx - 1]) /
69+
# self.cont_diff_[idx - 1])
70+
71+
for idx in self.categorical_range_:
72+
if features[idx] == '' or features[idx] not in self.feat_dict_:
73+
feat_idx.append(0)
74+
# feat_value.append(0.0)
75+
else:
76+
feat_idx.append(self.feat_dict_[features[idx]])
77+
# feat_value.append(1.0)
78+
label = [int(features[0])]
79+
return label, feat_value, feat_idx
80+
81+
def generate_sample(self, line):
82+
"""
83+
Read the data line by line and process it as a dictionary
84+
"""
85+
86+
def data_iter():
87+
label, feat_value, feat_idx = self._process_line(line)
88+
s = ""
89+
for i in [('click', label), ('dense_feature', feat_value),
90+
('feat_idx', feat_idx)]:
91+
k = i[0]
92+
v = i[1]
93+
for n, j in enumerate(v):
94+
if k == "feat_idx":
95+
s += " " + str(n + 1) + ":" + str(j)
96+
else:
97+
s += " " + k + ":" + str(j)
98+
print(s.strip()) # add print for data preprocessing
99+
yield None
100+
101+
return data_iter
102+
103+
104+
reader = Reader("../config.yaml")
105+
reader.init()
106+
reader.run_from_stdin()

datasets/criteo_dcn_v2/run.sh

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
sh download.sh
2+
mkdir slot_train_data_full
3+
for i in `ls ./train_data_full`
4+
do
5+
cat train_data_full/$i | python get_slot_data.py > slot_train_data_full/$i
6+
done
7+
8+
mkdir slot_test_data_full
9+
for i in `ls ./test_data_full`
10+
do
11+
cat test_data_full/$i | python get_slot_data.py > slot_test_data_full/$i
12+
done
13+
14+
mkdir slot_train_data
15+
for i in `ls ./train_data`
16+
do
17+
cat train_data/$i | python get_slot_data.py > slot_train_data/$i
18+
done
19+
20+
mkdir slot_test_data
21+
for i in `ls ./test_data`
22+
do
23+
cat test_data/$i | python get_slot_data.py > slot_test_data/$i
24+
done

models/rank/dcn_v2/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# 基于 DCN_V2 模型的点击率预估模型
2+
3+
以下是本例的简要目录结构及说明:
4+
5+
```
6+
├── data # 样例数据
7+
├── sample_data # 样例数据
8+
├── train
9+
├── sample_train.txt # 训练数据样例
10+
├── __init__.py
11+
├── README.md # 文档
12+
├── config.yaml # sample数据配置
13+
├── config_bigdata.yaml # 全量数据配置
14+
├── net.py # 模型核心组网(动静统一)
15+
├── reader.py # 数据读取程序
16+
├── dygraph_model.py # 构建动态图
17+
```
18+
19+
注:在阅读该示例前,建议您先了解以下内容:
20+
21+
[PaddleRec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
22+
23+
## 内容
24+
25+
- [模型简介](#模型简介)
26+
- [数据准备](#数据准备)
27+
- [运行环境](#运行环境)
28+
- [快速开始](#快速开始)
29+
- [模型组网](#模型组网)
30+
- [效果复现](#效果复现)
31+
- [进阶使用](#进阶使用)
32+
- [FAQ](#FAQ)
33+
34+
## 模型简介
35+
`CTR(Click Through Rate)`,即点击率,是“推荐系统/计算广告”等领域的重要指标,对其进行预估是商品推送/广告投放等决策的基础。简单来说,CTR预估对每次广告的点击情况做出预测,预测用户是点击还是不点击。CTR预估模型综合考虑各种因素、特征,在大量历史数据上训练,最终对商业决策提供帮助。本模型实现了下述论文中的 DCN_V2 模型:
36+
37+
```text
38+
@article{DCN_V2 2020,
39+
title={DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems},
40+
author={Ruoxi Wang, Rakesh Shivanna, Derek Z. Cheng, Sagar Jain, Dong Lin, Lichan Hong, Ed H. Chi},
41+
journal={arXiv preprint arXiv:2008.13535v2},
42+
year={2020},
43+
url={https://arxiv.org/pdf/2008.13535v2.pdf},
44+
}
45+
```
46+
47+
## 数据准备
48+
49+
训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。
50+
每一行数据格式如下所示:
51+
```
52+
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
53+
```
54+
其中```<label>```表示广告是否被点击,点击用1表示,未点击用0表示。```<integer feature>```代表数值特征(连续特征),共有13个连续特征。```<categorical feature>```代表分类特征(离散特征),共有26个离散特征。相邻两个特征用```\t```分隔,缺失特征用空格表示。测试集中```<label>```特征已被移除。
55+
在模型目录的data目录下为您准备了快速运行的示例数据,若需要使用全量数据可以参考下方[效果复现](#效果复现)部分。
56+
57+
58+
## 运行环境
59+
PaddlePaddle>=2.0
60+
61+
python 2.7/3.5/3.6/3.7
62+
63+
os : windows/linux/macos
64+
65+
## 快速开始
66+
本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在dcn_v2模型目录的快速执行命令如下:
67+
```bash
68+
# 进入模型目录
69+
# cd models/rank/dcn_v2 # 在任意目录均可运行
70+
# 动态图训练
71+
python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
72+
# 动态图预测
73+
python -u ../../../tools/infer.py -m config.yaml
74+
```
75+
76+
## 模型组网
77+
DCN_V2 模型的组网,代码参考 `net.py`。模型主要组成是 Embedding层,CrossNetwork 层,MLP 层,以及相应的分类任务的loss计算和auc计算。另外,DCN_V2在DCN的基础
78+
上,根据CrossNetwork 层和MLP 层的堆叠方式,网络结构又分为Stacked和Parallel两种方式,模型架构如下:
79+
80+
<img align="center" src="https://wx4.sinaimg.cn/mw2000/0073e4AWgy1gyao6r7ovbj30ov0guqac.jpg">
81+
82+
### **CrossNetwork 层**
83+
CrossNetwork的核心是创建显示的特征交叉,每一层都与原始特征进行特征交叉,每一层的输出是下一层的输入,计算方式如下公式(1)所示,计算方法可视化表达如下图所示:
84+
85+
<img align="center" src="https://wx2.sinaimg.cn/mw2000/0073e4AWgy1gyaotiqopbj30hh01nt8w.jpg">
86+
87+
<img align="center" src="https://wx3.sinaimg.cn/mw2000/0073e4AWgy1gyaohnd39zj30hb06i0ts.jpg">
88+
89+
CrossNetwork计算特征交叉时,随着层数和特征维度的增大,计算成本也比较高,论文中设计了降低计算成本的CrossMix网络结构,如下介绍。
90+
91+
### **Cost-Effective Mixture of Low-Rank DCN**
92+
如上公式(1)所示,权重矩阵 W 是一个具有高秩的矩阵,论文中将W分解为两个低秩的矩阵U 和 V 有效降低了计算成本,如下公式(2)所示,但在精度性能上效果却略逊色前一种方式。
93+
94+
<img align="center" src="https://wx4.sinaimg.cn/mw2000/0073e4AWgy1gyap3vkyq1j30k301xwev.jpg">
95+
96+
97+
98+
### **Loss 及 Auc 计算**
99+
- 为了得到每条样本分属于正负样本的概率,我们将预测结果和 `1-predict` 合并起来得到 `predict_2d`,以便接下来计算 `auc`
100+
- 每条样本的损失为负对数损失值,label的数据类型将转化为float输入。
101+
- 该batch的损失 `avg_cost` 是各条样本的损失之和
102+
- 我们同时还会计算预测的auc指标。
103+
104+
## 效果复现
105+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现 README 中的效果,请按如下步骤依次操作即可。
106+
在全量数据下模型的指标如下:
107+
108+
| 模型 | auc | logloss | batch_size | epoch_num| Time of each epoch |
109+
| :------| :------ | :------ | :------| :------ | :------ |
110+
| DCN_V2 | 0.8026 | 0.4384 |512 | 1 | 约 3 小时 |
111+
112+
1. 确认您当前所在目录为PaddleRec/models/rank/dcn_v2
113+
2. 进入PaddleRec/datasets/criteo_dcn_v2目录下,执行该脚本,会从国内源的服务器上下载并预处理完成criteo全量数据集,放到指定文件夹。
114+
``` bash
115+
cd ../../../datasets/criteo_dcn_v2
116+
sh run.sh
117+
```
118+
3. 切回模型目录,执行命令运行全量数据
119+
```bash
120+
cd - # 切回模型目录
121+
# 动态图训练
122+
python -u ../../../tools/trainer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
123+
python -u ../../../tools/infer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
124+
```
125+
126+
## 进阶使用
127+
128+
## FAQ

models/rank/dcn_v2/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.

0 commit comments

Comments
 (0)