Skip to content

Commit 1cded00

Browse files
authored
Merge branch 'master' into multiview-simnet
2 parents c7fd7a7 + 03cec6d commit 1cded00

File tree

10 files changed

+530
-221
lines changed

10 files changed

+530
-221
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Youtebe-DNN
2+
3+
以下是本例的简要目录结构及说明:
4+
5+
```
6+
├── data #样例数据
7+
├── train
8+
├── data.txt
9+
├── test
10+
├── data.txt
11+
├── generate_ramdom_data # 随机训练数据生成文件
12+
├── __init__.py
13+
├── README.md # 文档
14+
├── model.py #模型文件
15+
├── config.yaml #配置文件
16+
├── data_prepare.sh #一键数据处理脚本
17+
├── reader.py #reader
18+
├── infer.py # 预测程序
19+
```
20+
21+
注:在阅读该示例前,建议您先了解以下内容:
22+
23+
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
24+
25+
26+
---
27+
## 内容
28+
29+
- [模型简介](#模型简介)
30+
- [数据准备](#数据准备)
31+
- [运行环境](#运行环境)
32+
- [快速开始](#快速开始)
33+
- [论文复现](#论文复现)
34+
- [进阶使用](#进阶使用)
35+
- [FAQ](#FAQ)
36+
37+
## 模型简介
38+
[《Deep Neural Networks for YouTube Recommendations》](https://link.zhihu.com/?target=https%3A//static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) 这篇论文是google的YouTube团队在推荐系统上DNN方面的尝试,是经典的向量化召回模型,主要通过模型来学习用户和物品的兴趣向量,并通过内积来计算用户和物品之间的相似性,从而得到最终的候选集。YouTube采取了两层深度网络完成整个推荐过程:
39+
40+
1.第一层是**Candidate Generation Model**完成候选视频的快速筛选,这一步候选视频集合由百万降低到了百的量级。
41+
42+
2.第二层是用**Ranking Model**完成几百个候选视频的精排。
43+
44+
本项目在paddlepaddle上完成YouTube dnn的召回部分Candidate Generation Model,分别获得用户和物品的向量表示,从而后续可以通过其他方法(如用户和物品的余弦相似度)给用户推荐物品。
45+
46+
由于原论文没有开源数据集,本项目随机构造数据验证网络的正确性。
47+
48+
本项目支持功能
49+
50+
训练:单机CPU、单机单卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
51+
52+
预测:单机CPU、单机单卡GPU;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
53+
54+
## 数据处理
55+
调用python generate_ramdom_data.py生成随机训练数据,每行数据格式如下:
56+
```
57+
#watch_vec;search_vec;other_feat;label
58+
0.01,0.02,...,0.09;0.01,0.02,...,0.09;0.01,0.02,...,0.09;20
59+
```
60+
方便起见,我们提供了一键式数据生成脚本:
61+
```
62+
sh data_prepare.sh
63+
```
64+
65+
## 运行环境
66+
67+
PaddlePaddle>=1.7.2
68+
69+
python 2.7/3.5/3.6/3.7
70+
71+
PaddleRec >=0.1
72+
73+
os : windows/linux/macos
74+
75+
## 快速开始
76+
77+
### 单机训练
78+
79+
```
80+
mode: [cpu_single_train]
81+
82+
runner:
83+
- name: cpu_single_train
84+
class: train
85+
device: cpu # if use_gpu, set it to gpu
86+
epochs: 20
87+
save_checkpoint_interval: 1
88+
save_inference_interval: 1
89+
save_checkpoint_path: "increment_youtubednn"
90+
save_inference_path: "inference_youtubednn"
91+
save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference
92+
save_inference_fetch_varnames: ["l3.tmp_2"]
93+
print_interval: 1
94+
```
95+
96+
### 单机预测
97+
通过计算每个用户和每个物品的余弦相似度,给每个用户推荐topk视频:
98+
99+
cpu infer:
100+
```
101+
python infer.py --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
102+
```
103+
104+
gpu infer:
105+
```
106+
python infer.py --use_gpu 1 --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
107+
```
108+
### 运行
109+
```
110+
python -m paddlerec.run -m paddlerec.models.recall.w2v
111+
```
112+
113+
### 结果展示
114+
115+
样例数据训练结果展示:
116+
117+
```
118+
Running SingleStartup.
119+
Running SingleRunner.
120+
batch: 1, acc: [0.03125]
121+
batch: 2, acc: [0.0625]
122+
batch: 3, acc: [0.]
123+
...
124+
epoch 0 done, use time: 0.0605320930481, global metrics: acc=[0.]
125+
...
126+
epoch 19 done, use time: 0.33447098732, global metrics: acc=[0.]
127+
```
128+
129+
样例数据预测结果展示:
130+
```
131+
user:0, top K videos:[40, 31, 4, 33, 93]
132+
user:1, top K videos:[35, 57, 58, 40, 17]
133+
user:2, top K videos:[35, 17, 88, 40, 9]
134+
user:3, top K videos:[73, 35, 39, 58, 38]
135+
user:4, top K videos:[40, 31, 57, 4, 73]
136+
user:5, top K videos:[38, 9, 7, 88, 22]
137+
user:6, top K videos:[35, 73, 14, 58, 28]
138+
user:7, top K videos:[35, 73, 58, 38, 56]
139+
user:8, top K videos:[38, 40, 9, 35, 99]
140+
user:9, top K videos:[88, 73, 9, 35, 28]
141+
user:10, top K videos:[35, 52, 28, 54, 73]
142+
```
143+
144+
## 进阶使用
145+
146+
## FAQ

models/recall/youtube_dnn/config.yaml

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,10 @@ workspace: "models/recall/youtube_dnn"
1717

1818
dataset:
1919
- name: dataset_train
20-
batch_size: 5
21-
type: DataLoader
22-
#type: QueueDataset
20+
batch_size: 32
21+
type: DataLoader # or QueueDataset
2322
data_path: "{workspace}/data/train"
24-
data_converter: "{workspace}/random_reader.py"
23+
data_converter: "{workspace}/reader.py"
2524

2625
hyper_parameters:
2726
watch_vec_size: 64
@@ -30,22 +29,23 @@ hyper_parameters:
3029
output_size: 100
3130
layers: [128, 64, 32]
3231
optimizer:
33-
class: adam
34-
learning_rate: 0.001
35-
strategy: async
32+
class: SGD
33+
learning_rate: 0.01
3634

37-
mode: train_runner
35+
mode: [cpu_single_train]
3836

3937
runner:
40-
- name: train_runner
38+
- name: cpu_single_train
4139
class: train
4240
device: cpu
43-
epochs: 3
44-
save_checkpoint_interval: 2
45-
save_inference_interval: 4
46-
save_checkpoint_path: "increment"
47-
save_inference_path: "inference"
48-
print_interval: 10
41+
epochs: 20
42+
save_checkpoint_interval: 1
43+
save_inference_interval: 1
44+
save_checkpoint_path: "increment_youtubednn"
45+
save_inference_path: "inference_youtubednn"
46+
save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference
47+
save_inference_fetch_varnames: ["l3.tmp_2"]
48+
print_interval: 1
4949

5050
phase:
5151
- name: train

0 commit comments

Comments
 (0)