Skip to content

Commit 72f1378

Browse files
committed
Merge branch 'master' of https://github.com/PaddlePaddle/PaddleRec into aitm
2 parents 8671863 + 8ce9dbd commit 72f1378

27 files changed

+1530
-1
lines changed

README_CN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
171171
| 排序 | [AutoFIS](models/rank/autofis/) | - ||| >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) |
172172
| 排序 | [DCN_V2](models/rank/dcn_v2/) | - ||| >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)|
173173
| 排序 | [AITM](models/rank/aitm/) | - ||| >=2.1.0 | [KDD 2021][Modeling the Sequential Dependence among Audience Multi-step Conversions withMulti-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489v2.pdf) |
174+
| 排序 | [DSIN](models/rank/dsin/) | - ||| >=2.1.0 | [IJCAI 2019][Deep Session Interest Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1905.06482v1.pdf) |
174175
| 多任务 | [PLE](models/multitask/ple/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
175176
| 多任务 | [ESMM](models/multitask/esmm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
176177
| 多任务 | [MMOE](models/multitask/mmoe/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |

README_EN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
161161
| Rank | [AutoFIS](models/rank/autofis/) | - ||| >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) |
162162
| Rank | [DCN_V2](models/rank/dcn_v2/) | - ||| >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)|
163163
| Rank | [AITM](models/rank/aitm/) | - ||| >=2.1.0 | [KDD 2021][Modeling the Sequential Dependence among Audience Multi-step Conversions withMulti-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489v2.pdf) |
164+
| Rank | [DSIN](models/rank/dsin/) | - ||| >=2.1.0 | [IJCAI 2019][Deep Session Interest Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1905.06482v1.pdf) |
164165
| Multi-Task | [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
165166
| Multi-Task | [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
166167
| Multi-Task | [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
mkdir raw_data
2+
cd raw_data
3+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz
4+
tar -zxvf user_profile.csv.tar.gz
5+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz
6+
tar -zxvf raw_sample.csv.tar.gz
7+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz
8+
tar -zxvf behavior_log.csv.tar.gz
9+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz
10+
tar -zxvf ad_feature.csv.tar.gz
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Ali_Display_Ad_Click数据集
2+
[Ali_Display_Ad_Click](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56)是阿里巴巴提供的一个淘宝展示广告点击率预估数据集
3+
4+
## 原始数据集介绍
5+
- 原始样本骨架raw_sample:淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架
6+
1. user:脱敏过的用户ID;
7+
2. adgroup_id:脱敏过的广告单元ID;
8+
3. time_stamp:时间戳;
9+
4. pid:资源位;
10+
5. nonclk:为1代表没有点击;为0代表点击;
11+
6. clk:为0代表没有点击;为1代表点击;
12+
13+
```
14+
user,time_stamp,adgroup_id,pid,nonclk,clk
15+
581738,1494137644,1,430548_1007,1,0
16+
```
17+
18+
- 广告基本信息表ad_feature:本数据集涵盖了raw_sample中全部广告的基本信息
19+
1. adgroup_id:脱敏过的广告ID;
20+
2. cate_id:脱敏过的商品类目ID;
21+
3. campaign_id:脱敏过的广告计划ID;
22+
4. customer: 脱敏过的广告主ID;
23+
5. brand:脱敏过的品牌ID;
24+
6. price: 宝贝的价格
25+
```
26+
adgroup_id,cate_id,campaign_id,customer,brand,price
27+
63133,6406,83237,1,95471,170.0
28+
```
29+
30+
- 用户基本信息表user_profile:本数据集涵盖了raw_sample中全部用户的基本信息
31+
1. userid:脱敏过的用户ID;
32+
2. cms_segid:微群ID;
33+
3. cms_group_id:cms_group_id;
34+
4. final_gender_code:性别 1:男,2:女;
35+
5. age_level:年龄层次; 1234
36+
6. pvalue_level:消费档次,1:低档,2:中档,3:高档;
37+
7. shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户
38+
8. occupation:是否大学生 ,1:是,0:否
39+
9. new_user_class_level:城市层级
40+
```
41+
userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level
42+
234,0,5,2,5,,3,0,3
43+
```
44+
45+
- 用户的行为日志behavior_log:本数据集涵盖了raw_sample中全部用户22天内的购物行为
46+
1. user:脱敏过的用户ID;
47+
2. time_stamp:时间戳;
48+
3. btag:行为类型, 包括以下四种:(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买)
49+
4. cate:脱敏过的商品类目id;
50+
5. brand: 脱敏过的品牌id;
51+
```
52+
user,time_stamp,btag,cate,brand
53+
558157,1493741625,pv,6250,91286
54+
```
55+
56+
## 预处理数据集介绍
57+
对原始数据集中的四个文件,参考[原论文的数据预处理过程](https://github.com/shenweichen/DSIN/tree/master/code)对数据进行处理,形成满足DSIN论文条件且可以被reader直接读取的数据集。
58+
数据集共有八个pkl文件,训练集和测试集各自拥有四个,以训练集为例,这四个文件为train_feat_input.pkl、train_sess_input、train_sess_length和train_label.pkl。各自存储了按0.25的采样比进行采样后的user及item特征输入,用户会话特征输入、用户会话长度和标签数据。
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
mkdir big_train
2+
mkdir big_test
3+
wget -O model_input.tar.gz https://bj.bcebos.com/v1/ai-studio-online/53e61a9bcfc54e0581044883d0f876d9841cb4d0a68848f1a1d568a84591da6f?responseContentDisposition=attachment%3B%20filename%3Dmodel_input.tar.gz&authorization=bce-auth-v1%2F0ef6765c1e494918bc0d4c3ca3e5c6d1%2F2022-04-21T01%3A43%3A00Z%2F-1%2F%2F665a728726f0569e1ef9dd423adfa40a2a5e798f86a8d5d68804a2f21cc03624
4+
tar -zxvf model_input.tar.gz
5+
mv model_input/test_feat_input.pkl big_test/
6+
mv model_input/test_label.pkl big_test/
7+
mv model_input/test_sess_input.pkl big_test/
8+
mv model_input/test_session_length.pkl big_test/
9+
mv model_input/train_feat_input.pkl big_train/
10+
mv model_input/train_label.pkl big_train/
11+
mv model_input/train_sess_input.pkl big_train/
12+
mv model_input/train_session_length.pkl big_train/

doc/imgs/dsin.png

141 KB
Loading

doc/source/models/rank/dsin.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# dsin (Deep Session Interest Network for Click-Through Rate Prediction)
2+
3+
代码请参考:[dsin](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/rank/dsin)
4+
如果我们的代码对您有用,还请点个star啊~
5+
6+
## 内容
7+
8+
- [模型简介](#模型简介)
9+
- [数据准备](#数据准备)
10+
- [运行环境](#运行环境)
11+
- [快速开始](#快速开始)
12+
- [模型组网](#模型组网)
13+
- [效果复现](#效果复现)
14+
- [进阶使用](#进阶使用)
15+
- [FAQ](#FAQ)
16+
17+
## 模型简介
18+
本模型主要聚焦于用户的历史会话行为,通过Self-Attention和BiLSTM对历史会话行为进行学习,最后通过Activation Unit得到最终的session表征向量,再结合其他特征送入MLP计算最后的ctr score。[Deep Session Interest Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1905.06482v1.pdf)文章通过 Transformer 和 BiLSTM 来学习用户的 Session Interest Interacting,提升模型的表达能力。
19+
20+
## 数据准备
21+
本模型使用论文中的数据集Alimama Dataset,参考[原文作者的数据预处理过程](https://github.com/shenweichen/DSIN/tree/master/code)对数据进行处理。在模型目录的data目录下为您准备了快速运行的示例数据,若需要使用全量数据可以参考下方[效果复现](#效果复现)部分。
22+
23+
## 运行环境
24+
PaddlePaddle>=2.0
25+
26+
python 3.5/3.6/3.7
27+
28+
os : windows/linux/macos
29+
30+
## 快速开始
31+
本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在DSIN模型目录的快速执行命令如下:
32+
```bash
33+
# 进入模型目录
34+
# cd models/rank/dmr # 在任意目录均可运行
35+
# 动态图训练
36+
python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
37+
# 动态图预测
38+
python -u ../../../tools/infer.py -m config.yaml
39+
40+
# 静态图训练
41+
python -u ../../../tools/static_trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
42+
# 静态图预测
43+
python -u ../../../tools/static_infer.py -m config.yaml
44+
```
45+
46+
## 模型组网
47+
论文[Deep Session Interest Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1905.06482v1.pdf)中的网络结构如图所示:
48+
<p align="center">
49+
<img align="center" src="../../../doc/imgs/dsin.png">
50+
<p>
51+
52+
## 效果复现
53+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
54+
在全量数据下模型的指标如下:
55+
56+
| 模型 | auc | batch_size | epoch_num | Time of each epoch |
57+
| :------| :------ | :------ | :------| :------ |
58+
| DSIN | 0.6356 | 4096 | 1 | 约10分钟 |
59+
60+
1. 确认您当前所在目录为PaddleRec/models/rank/dsin
61+
2. 进入paddlerec/datasets/Ali_Display_Ad_Click_DSIN目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的Alimama全量数据集,并解压到指定文件夹。若您希望从原始数据集自行处理,请详见该目录下的readme。
62+
63+
``` bash
64+
cd ../../../datasets/Ali_Display_Ad_Click_DSIN
65+
sh run.sh
66+
```
67+
3. 切回模型目录,执行命令运行全量数据
68+
69+
```bash
70+
cd - # 切回模型目录
71+
# 动态图训练
72+
python -u ../../../tools/trainer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
73+
python -u ../../../tools/infer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
74+
```
75+
76+
效果复现过程可参考[AI Studio项目](https://aistudio.baidu.com/aistudio/projectdetail/3850087)
77+
78+
Note:运行环境为至尊GPU。
79+
80+
## 进阶使用
81+
82+
## FAQ

doc/source/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,4 @@
4949
[deeprec](https://paddlerec.readthedocs.io/en/latest/models/rank/deeprec.html)
5050
[autofis](https://paddlerec.readthedocs.io/en/latest/models/rank/autofis.html)
5151
[aitm](https://paddlerec.readthedocs.io/en/latest/models/rank/aitm.html)
52+
[dsin](https://paddlerec.readthedocs.io/en/latest/models/rank/dsin.html)

models/rank/dnn/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ runner:
3131
infer_load_path: "output_model_dnn"
3232
infer_start_epoch: 0
3333
infer_end_epoch: 3
34+
num_workers: 0
3435

3536
# distribute_config
3637
sync_mode: "async"

models/rank/dnn/config_bigdata.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ runner:
3131
infer_load_path: "output_model_dnn_all"
3232
infer_start_epoch: 0
3333
infer_end_epoch: 4
34+
num_workers: 0
3435

3536
#thread_num: 5
3637
#reader_type: "QueueDataset" # DataLoader / QueueDataset / RecDataset

0 commit comments

Comments
 (0)