Skip to content

Commit ec5a1a4

Browse files
committed
Add MetaHeac model
PR Add new model metaheac;
1 parent 8004404 commit ec5a1a4

File tree

15 files changed

+494
-135
lines changed

15 files changed

+494
-135
lines changed

README_CN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
180180
| 多任务 | [Maml](models/multitask/maml/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412) | x | x | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf) |
181181
| 多任务 | [DSelect_K](models/multitask/dselect_k/)([文档](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html)) | - | x | x | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf) |
182182
| 多任务 | [ESCM2](models/multitask/escm2/) | - | x | x | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf) |
183+
| 多任务 | [MetaHeac](models/multitask/metaheac/) | - | x | x | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf) |
183184
| 重排序 | [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/) | - || x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
184185

185186

README_EN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
171171
| Multi-Task | [Maml](models/multitask/maml/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412) | x | x | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf) |
172172
| Multi-Task | [DSelect_K](models/multitask/dselect_k/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html)) | - | x | x | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf) |
173173
| Multi-Task | [ESCM2](models/multitask/escm2/) | - | x | x | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf) |
174+
| Multi-Task | [MetaHeac](models/multitask/metaheac/) | - | x | x | >=2.1.0 | [KDD 2021][Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising](https://arxiv.org/pdf/2105.14688.pdf) |
174175
| Re-Rank | [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/) | - || x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
175176

176177
<h2 align="center">Community</h2>

contributor.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,6 @@
2121
| [MHCN](models/recall/mhcn/) | [Andy1314Chen](https://github.com/Andy1314Chen) | https://github.com/PaddlePaddle/PaddleRec/pull/679 | 论文复现赛第五期 |
2222
| [DCN_V2](models/rank/dcn_v2/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/677 | 论文复现赛第五期 |
2323
| [SIGN](models/rank/sign/) | [BamLubi](https://github.com/BamLubi) | https://github.com/PaddlePaddle/PaddleRec/pull/748 | 论文复现赛第六期 |
24+
| [MetaHeac](models/multitask/metaheac/) | [simuler](https://github.com/simuler) | https://github.com/PaddlePaddle/PaddleRec/pull/788 | 论文复现赛第六期 |
2425

2526
</div>

doc/imgs/metaheac.png

68.7 KB
Loading

doc/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@
9191
models/multitask/ple.md
9292
models/multitask/share_bottom.md
9393
models/multitask/dselect_k.md
94+
models/multitask/metaheac.md
9495

9596
.. toctree::
9697
:maxdepth: 1
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# MetaHeac
2+
3+
以下是本例的简要目录结构及说明:
4+
5+
```
6+
├── data #样例数据
7+
├── train #训练数据
8+
├── train_stage1.pkl
9+
├── test #测试数据
10+
├── test_stage1.pkl
11+
├── test_stage2.pkl
12+
├── net.py # 核心模型组网
13+
├── config.yaml # sample数据配置
14+
├── config_big.yaml # 全量数据配置
15+
├── dygraph_model.py # 构建动态图
16+
├── reader_train.py # 训练数据读取程序
17+
├── reader_test.py # infer数据读取程序
18+
├── readme.md #文档
19+
```
20+
21+
注:在阅读该示例前,建议您先了解以下内容:
22+
23+
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
24+
25+
## 内容
26+
27+
- [模型简介](#模型简介)
28+
- [数据准备](#数据准备)
29+
- [运行环境](#运行环境)
30+
- [快速开始](#快速开始)
31+
- [模型组网](#模型组网)
32+
- [效果复现](#效果复现)
33+
- [infer说明](#infer说明)
34+
- [进阶使用](#进阶使用)
35+
- [FAQ](#FAQ)
36+
37+
## 模型简介
38+
在推荐系统和广告平台上,营销人员总是希望通过视频或者社交等媒体渠道向潜在用户推广商品、内容或者广告。扩充候选集技术(Look-alike建模)是一种很有效的解决方案,但look-alike建模通常面临两个挑战:(1)一家公司每天可以开展数百场营销活动,以推广完全不同类别的各种内容。(2)某项活动的种子集只能覆盖有限的用户,因此一个基于有限种子用户的定制化模型往往会产生严重的过拟合。为了解决以上的挑战,论文《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》提出了一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),采用元学习的方法训练一个泛化初始化模型,从而能够快速适应新类别内容推广任务。
39+
40+
## 数据准备
41+
使用Tencent Look-alike Dataset,该数据集包含几百个种子人群、海量候选人群对应的用户特征,以及种子人群对应的广告特征。出于业务数据安全保证的考虑,所有数据均为脱敏处理后的数据。本次复现使用处理过的数据集,直接下载[propocessed data](https://drive.google.com/file/d/11gXgf_yFLnbazjx24ZNb_Ry41MI5Ud1g/view?usp=sharing),mataheac/data/目录下存放了从全量数据集获取的少量数据集,用于对齐模型。
42+
43+
## 运行环境
44+
PaddlePaddle>=2.0
45+
46+
python 2.7/3.5/3.6/3.7
47+
48+
os : windows/linux/macos
49+
50+
## 快速开始
51+
本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在MetaHeac模型目录的快速执行命令如下:
52+
```bash
53+
# 进入模型目录
54+
# cd PaddleRec/models/multitask/metaheac/ # 在任意目录均可运行
55+
# 动态图训练
56+
python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
57+
# 动态图预测
58+
python -u ./infer_meta.py -m config.yaml
59+
```
60+
61+
## 模型组网
62+
MetaHeac是发表在 KDD 2021 的论文[《Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising》]( https://arxiv.org/pdf/2105.14688 )文章提出一种新的两阶段框架Meta Hybrid Experts and Critics (MetaHeac),有效解决了真实场景中难以构建泛化模型,同时在所有内容领域中扩充高质量的受众候选集和基于有限种子用户的定制化模型容易产生严重过拟合的两个关键问题模型的主要组网结构如下:
63+
[MetaHeac](https://arxiv.org/pdf/2105.14688):
64+
<p align="center">
65+
<img align="center" src="../../../doc/imgs/metaheac.png">
66+
<p>
67+
68+
## 效果复现
69+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
70+
在全量数据下模型的指标如下(train.py文件内 paddle.seed = 2021下效果):
71+
72+
| 模型 | auc | batch_size | epoch_num| Time of each epoch |
73+
|:------|:-------| :------ | :------| :------ |
74+
| MetaHeac | 0.7112 | 1024 | 1 | 3个小时左右 |
75+
76+
1. 确认您当前所在目录为PaddleRec/models/multitask/metaheac
77+
2. 进入paddlerec/datasets/目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的Lookalike全量数据集,并解压到指定文件夹。
78+
``` bash
79+
cd ../../../datasets/Lookalike
80+
sh run.sh
81+
```
82+
3. 切回模型目录,执行命令运行全量数据
83+
```bash
84+
cd ../../models/multitask/metaheac/ # 切回模型目录
85+
# 动态图训练
86+
# step1: train
87+
python -u ../../../tools/trainer.py -m config_big.yaml
88+
# 动态图预测
89+
# step2: infer 此时test数据集为hot
90+
python -u ./infer_meta.py -m config_big.yaml
91+
# step3:修改config_big.yaml文件中test_data_dir的路径为cold
92+
# python -u ./infer_meta.py -m config.yaml
93+
```
94+
95+
## infer说明
96+
### 数据集说明
97+
为了测试模型在不同规模的内容定向推广任务上的表现,将数据集根据内容定向推广任务给定的候选集大小进行了划分,分为大于T和小于T两部分。将腾讯广告大赛2018的Look-alike数据集中的T设置为4000,其中hot数据集中候选集大于T,cold数据集中候选集小于T.
98+
### infer_meta.py说明
99+
infer_meta.py是用于元学习模型infer的tool,在使用中主要有以下几点需要注意:
100+
1. 在对模型进行infer时(train时也可使用这样的操作),可以将runner.infer_batch_size注释掉,这样将禁用DataLoader的自动组batch功能,进而可以使用自定义的组batch方式.
101+
2. 由于元学习在infer时需要先对特定任务的少量数据集进行训练,因此在infer_meta.py的infer_dataloader中每次接收单个子任务的全量infer数据集(包括训练数据和测试数据).
102+
3. 实际组batch在infer.py中进行,在获取到单个子任务的数据后,获取config中的batch_size参数,对训练数据和测试数据进行组batch,并分别调用dygraph_model.py中的infer_train_forward和infer_forward进行训练和测试.
103+
4. 和普通infer不同,由于需要对单个子任务进行少量数据的train和test,对于每个子任务来说加载的都是train阶段训练好的泛化模型.
104+
5. 在对单个子任务infer时,创建了局部的paddle.metric.Auc("ROC"),可以查看每个子任务的AUC指标,在全局metric中维护包含所有子任务的AUC指标.
105+
106+
## 进阶使用
107+
108+
## FAQ

models/multitask/metaheac/config.yaml

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,16 @@
1616
runner:
1717
train_data_dir: "./data/train"
1818
train_reader_path: "reader_train" # importlib format
19-
use_gpu: True
19+
use_gpu: False
2020
use_auc: True
2121
# train_batch_size: 32
2222
epochs: 1
23-
print_interval: 100
24-
#model_init_path: "output_model_esmm/2" # init model
25-
model_save_path: "output_model_esmm"
23+
print_interval: 1
24+
model_save_path: "output_model_metaheac"
2625
test_data_dir: "./data/test"
2726
# infer_batch_size: 32
2827
infer_reader_path: "reader_infer" # importlib format
29-
infer_load_path: "output_model_esmm"
28+
infer_load_path: "output_model_metaheac"
3029
infer_start_epoch: 0
3130
infer_end_epoch: 1
3231
#use inference save model
@@ -41,7 +40,7 @@ hyper_parameters:
4140
num_expert: 8
4241
num_output: 5
4342
task_count: 5
44-
batch_size: 2
43+
batch_size: 32
4544

4645
optimizer:
4746
class: adam

models/multitask/metaheac/config_big.yaml

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,12 @@ runner:
2121
# train_batch_size: 32
2222
epochs: 1
2323
print_interval: 100
24-
#model_init_path: "output_model_esmm/2" # init model
25-
model_save_path: "output_model_esmm"
26-
# test_data_dir: "../../../datasets/Lookalike/test_hot_data"
27-
test_data_dir: "../../../datasets/Lookalike/test_cold_data"
24+
model_save_path: "output_model_metaheac_all"
25+
test_data_dir: "../../../datasets/Lookalike/test_hot_data"
26+
# test_data_dir: "../../../datasets/Lookalike/test_cold_data"
2827
# infer_batch_size: 32
2928
infer_reader_path: "reader_infer" # importlib format
30-
infer_load_path: "output_model_esmm"
29+
infer_load_path: "output_model_metaheac_all"
3130
infer_start_epoch: 0
3231
infer_end_epoch: 1
3332
#use inference save model
-10.4 KB
Binary file not shown.
-10.4 KB
Binary file not shown.

0 commit comments

Comments
 (0)