|
1 | | -# MMOE |
2 | | - |
3 | | - 以下是本例的简要目录结构及说明: |
4 | | - |
5 | | -``` |
6 | | -├── data # 文档 |
7 | | - ├── train #训练数据 |
8 | | - ├── train_data.txt |
9 | | - ├── test #测试数据 |
10 | | - ├── test_data.txt |
11 | | - ├── run.sh |
12 | | - ├── data_preparation.py |
13 | | -├── __init__.py |
14 | | -├── config.yaml #配置文件 |
15 | | -├── census_reader.py #数据读取文件 |
16 | | -├── model.py #模型文件 |
17 | | -``` |
18 | | - |
19 | | -注:在阅读该示例前,建议您先了解以下内容: |
20 | | - |
21 | | -[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) |
22 | | - |
23 | | -## 内容 |
24 | | - |
25 | | -- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#模型简介) |
26 | | -- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#数据准备) |
27 | | -- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#运行环境) |
28 | | -- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#快速开始) |
29 | | -- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现) |
30 | | -- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#进阶使用) |
31 | | -- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#FAQ) |
32 | | - |
33 | | -## 模型简介 |
34 | | - |
35 | | -多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。 论文[《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》]( https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture- )中提出了一个Multi-gate Mixture-of-Experts(MMOE)的多任务学习结构。MMOE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。 |
36 | | - |
37 | | -我们在Paddlepaddle定义MMOE的网络结构,在开源数据集Census-income Data上验证模型效果,两个任务的auc分别为: |
38 | | - |
39 | | -1.income |
40 | | - |
41 | | -> max_mmoe_test_auc_income:0.94937 |
42 | | -> |
43 | | -> mean_mmoe_test_auc_income:0.94465 |
44 | | -
|
45 | | -2.marital |
46 | | - |
47 | | -> max_mmoe_test_auc_marital:0.99419 |
48 | | -> |
49 | | -> mean_mmoe_test_auc_marital:0.99324 |
50 | | -
|
51 | | -若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)部分。 |
52 | | - |
53 | | -本项目支持功能 |
| 1 | +# MMOE |
| 2 | + |
| 3 | + 以下是本例的简要目录结构及说明: |
| 4 | + |
| 5 | +``` |
| 6 | +├── data # 文档 |
| 7 | + ├── train #训练数据 |
| 8 | + ├── train_data.txt |
| 9 | + ├── test #测试数据 |
| 10 | + ├── test_data.txt |
| 11 | + ├── run.sh |
| 12 | + ├── data_preparation.py |
| 13 | +├── __init__.py |
| 14 | +├── config.yaml #配置文件 |
| 15 | +├── census_reader.py #数据读取文件 |
| 16 | +├── model.py #模型文件 |
| 17 | +``` |
| 18 | + |
| 19 | +注:在阅读该示例前,建议您先了解以下内容: |
| 20 | + |
| 21 | +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) |
| 22 | + |
| 23 | +## 内容 |
| 24 | + |
| 25 | +- [模型简介](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#模型简介) |
| 26 | +- [数据准备](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#数据准备) |
| 27 | +- [运行环境](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#运行环境) |
| 28 | +- [快速开始](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#快速开始) |
| 29 | +- [论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现) |
| 30 | +- [进阶使用](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#进阶使用) |
| 31 | +- [FAQ](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#FAQ) |
| 32 | + |
| 33 | +## 模型简介 |
| 34 | + |
| 35 | +多任务模型通过学习不同任务的联系和差异,可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构,不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险,但是效果上可能受到任务差异和数据分布带来的影响。 论文[《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》]( https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture- )中提出了一个Multi-gate Mixture-of-Experts(MMOE)的多任务学习结构。MMOE模型刻画了任务相关性,基于共享表示来学习特定任务的函数,避免了明显增加参数的缺点。 |
| 36 | + |
| 37 | +我们在Paddlepaddle定义MMOE的网络结构,在开源数据集Census-income Data上验证模型效果。 |
| 38 | + |
| 39 | +若进行精度验证,请参考[论文复现](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/multitask/mmoe#论文复现)部分。 |
| 40 | + |
| 41 | +本项目支持功能 |
54 | 42 |
|
55 | 43 | 训练:单机CPU、单机单卡GPU、单机多卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) |
56 | 44 | 预测:单机CPU、单机单卡GPU ;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) |
57 | 45 |
|
58 | 46 | ## 数据准备 |
59 | 47 |
|
60 | | -数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz ) |
| 48 | +数据地址: [Census-income Data](https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz ) |
| 49 | + |
| 50 | + |
| 51 | +生成的格式以逗号为分割点 |
| 52 | + |
| 53 | +``` |
| 54 | +0,0,73,0,0,0,0,1700.09,0,0 |
| 55 | +``` |
| 56 | + |
| 57 | +完整的大数据参考论文复现部分。 |
61 | 58 |
|
62 | | -数据解压后, 在run.sh脚本文件中添加文件的路径,并运行脚本。 |
| 59 | +## 运行环境 |
63 | 60 |
|
64 | | -```sh |
65 | | -mkdir train_data |
66 | | -mkdir test_data |
67 | | -mkdir data |
68 | | -train_path="data/census-income.data" |
69 | | -test_path="data/census-income.test" |
70 | | -train_data_path="train_data/" |
71 | | -test_data_path="test_data/" |
72 | | -pip install -r requirements.txt |
73 | | -wget -P data/ https://archive.ics.uci.edu/ml/machine-learning-databases/census-income-mld/census.tar.gz |
74 | | -tar -zxvf data/census.tar.gz -C data/ |
| 61 | +PaddlePaddle>=1.7.2 |
75 | 62 |
|
76 | | -python data_preparation.py --train_path ${train_path} \ |
77 | | - --test_path ${test_path} \ |
78 | | - --train_data_path ${train_data_path}\ |
79 | | - --test_data_path ${test_data_path} |
| 63 | +python 2.7/3.5/3.6/3.7 |
80 | 64 |
|
| 65 | +PaddleRec >=0.1 |
| 66 | + |
| 67 | +os : windows/linux/macos |
| 68 | + |
| 69 | +## 快速开始 |
| 70 | + |
| 71 | +### 单机训练 |
| 72 | + |
| 73 | +CPU环境 |
| 74 | + |
| 75 | +在config.yaml文件中设置好设备,epochs等。 |
| 76 | + |
| 77 | +``` |
| 78 | +dataset: |
| 79 | +- name: dataset_train |
| 80 | + batch_size: 5 |
| 81 | + type: QueueDataset |
| 82 | + data_path: "{workspace}/data/train" |
| 83 | + data_converter: "{workspace}/census_reader.py" |
| 84 | +- name: dataset_infer |
| 85 | + batch_size: 5 |
| 86 | + type: QueueDataset |
| 87 | + data_path: "{workspace}/data/train" |
| 88 | + data_converter: "{workspace}/census_reader.py" |
81 | 89 | ``` |
82 | 90 |
|
83 | | -生成的格式以逗号为分割点 |
| 91 | +### 单机预测 |
84 | 92 |
|
| 93 | +CPU环境 |
| 94 | + |
| 95 | +在config.yaml文件中设置好epochs、device等参数。 |
85 | 96 | ``` |
86 | | -0,0,73,0,0,0,0,1700.09,0,0 |
| 97 | +- name: infer_runner |
| 98 | + class: infer |
| 99 | + init_model_path: "increment/0" |
| 100 | + device: cpu |
87 | 101 | ``` |
88 | 102 |
|
| 103 | +## 论文复现 |
| 104 | + |
| 105 | +数据下载,我们提供了在百度云上预处理好的数据,可以直接训练 |
| 106 | + |
| 107 | +``` |
| 108 | +wget https://paddlerec.bj.bcebos.com/mmoe/train_data.csv |
| 109 | +wget https://paddlerec.bj.bcebos.com/mmoe/test_data.csv |
| 110 | +wget https://paddlerec.bj.bcebos.com/mmoe/config_all.yaml |
| 111 | +``` |
| 112 | + |
| 113 | +用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=32 gpu配置等,可参考config_all.yaml |
89 | 114 |
|
90 | | -## 运行环境 |
91 | | - |
92 | | -PaddlePaddle>=1.7.2 |
93 | | - |
94 | | -python 2.7/3.5/3.6/3.7 |
95 | | - |
96 | | -PaddleRec >=0.1 |
97 | | - |
98 | | -os : windows/linux/macos |
99 | | - |
100 | | -## 快速开始 |
101 | | - |
102 | | -### 单机训练 |
103 | | - |
104 | | -CPU环境 |
105 | | - |
106 | | -在config.yaml文件中设置好设备,epochs等。 |
107 | | - |
108 | | -``` |
109 | | -dataset: |
110 | | -- name: dataset_train |
111 | | - batch_size: 5 |
112 | | - type: QueueDataset |
113 | | - data_path: "{workspace}/data/train" |
114 | | - data_converter: "{workspace}/census_reader.py" |
115 | | -- name: dataset_infer |
116 | | - batch_size: 5 |
117 | | - type: QueueDataset |
118 | | - data_path: "{workspace}/data/train" |
119 | | - data_converter: "{workspace}/census_reader.py" |
120 | | -``` |
121 | | - |
122 | | -### 单机预测 |
123 | | - |
124 | | -CPU环境 |
125 | | - |
126 | | -在config.yaml文件中设置好epochs、device等参数。 |
127 | | - |
128 | | -``` |
129 | | -- name: infer_runner |
130 | | - class: infer |
131 | | - init_model_path: "increment/0" |
132 | | - device: cpu |
133 | | -``` |
134 | | - |
135 | | -## 论文复现 |
136 | | - |
137 | | -用原论文的完整数据复现论文效果需要在config.yaml中修改batch_size=1000, thread_num=8, epoch_num=4 |
138 | | - |
139 | 115 | 使用gpu p100 单卡训练 6.5h 测试auc: best:0.9940, mean:0.9932 |
140 | | - |
141 | | -修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行 |
142 | | - |
143 | | -``` |
144 | | -python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径 |
145 | | -``` |
146 | | - |
147 | | -## 进阶使用 |
148 | | - |
| 116 | + |
| 117 | +``` |
| 118 | +python -m paddlerec.run -m /home/your/dir/config_all.yaml #调试模式 直接指定本地config的绝对路径 |
| 119 | +``` |
| 120 | + |
| 121 | +## 进阶使用 |
| 122 | + |
149 | 123 | ## FAQ |
0 commit comments