|
| 1 | +# Youtebe-DNN |
| 2 | + |
| 3 | +以下是本例的简要目录结构及说明: |
| 4 | + |
| 5 | +``` |
| 6 | +├── data #样例数据 |
| 7 | + ├── train |
| 8 | + ├── data.txt |
| 9 | + ├── test |
| 10 | + ├── data.txt |
| 11 | +├── generate_ramdom_data # 随机训练数据生成文件 |
| 12 | +├── __init__.py |
| 13 | +├── README.md # 文档 |
| 14 | +├── model.py #模型文件 |
| 15 | +├── config.yaml #配置文件 |
| 16 | +├── data_prepare.sh #一键数据处理脚本 |
| 17 | +├── reader.py #reader |
| 18 | +├── infer.py # 预测程序 |
| 19 | +``` |
| 20 | + |
| 21 | +注:在阅读该示例前,建议您先了解以下内容: |
| 22 | + |
| 23 | +[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md) |
| 24 | + |
| 25 | + |
| 26 | +--- |
| 27 | +## 内容 |
| 28 | + |
| 29 | +- [模型简介](#模型简介) |
| 30 | +- [数据准备](#数据准备) |
| 31 | +- [运行环境](#运行环境) |
| 32 | +- [快速开始](#快速开始) |
| 33 | +- [论文复现](#论文复现) |
| 34 | +- [进阶使用](#进阶使用) |
| 35 | +- [FAQ](#FAQ) |
| 36 | + |
| 37 | +## 模型简介 |
| 38 | +[《Deep Neural Networks for YouTube Recommendations》](https://link.zhihu.com/?target=https%3A//static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) 这篇论文是google的YouTube团队在推荐系统上DNN方面的尝试,是经典的向量化召回模型,主要通过模型来学习用户和物品的兴趣向量,并通过内积来计算用户和物品之间的相似性,从而得到最终的候选集。YouTube采取了两层深度网络完成整个推荐过程: |
| 39 | + |
| 40 | +1.第一层是**Candidate Generation Model**完成候选视频的快速筛选,这一步候选视频集合由百万降低到了百的量级。 |
| 41 | + |
| 42 | +2.第二层是用**Ranking Model**完成几百个候选视频的精排。 |
| 43 | + |
| 44 | +本项目在paddlepaddle上完成YouTube dnn的召回部分Candidate Generation Model,分别获得用户和物品的向量表示,从而后续可以通过其他方法(如用户和物品的余弦相似度)给用户推荐物品。 |
| 45 | + |
| 46 | +由于原论文没有开源数据集,本项目随机构造数据验证网络的正确性。 |
| 47 | + |
| 48 | +本项目支持功能 |
| 49 | + |
| 50 | +训练:单机CPU、单机单卡GPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md) |
| 51 | + |
| 52 | +预测:单机CPU、单机单卡GPU;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) |
| 53 | + |
| 54 | +## 数据处理 |
| 55 | +调用python generate_ramdom_data.py生成随机训练数据,每行数据格式如下: |
| 56 | +``` |
| 57 | +#watch_vec;search_vec;other_feat;label |
| 58 | +0.01,0.02,...,0.09;0.01,0.02,...,0.09;0.01,0.02,...,0.09;20 |
| 59 | +``` |
| 60 | +方便起见,我们提供了一键式数据生成脚本: |
| 61 | +``` |
| 62 | +sh data_prepare.sh |
| 63 | +``` |
| 64 | + |
| 65 | +## 运行环境 |
| 66 | + |
| 67 | +PaddlePaddle>=1.7.2 |
| 68 | + |
| 69 | +python 2.7/3.5/3.6/3.7 |
| 70 | + |
| 71 | +PaddleRec >=0.1 |
| 72 | + |
| 73 | +os : windows/linux/macos |
| 74 | + |
| 75 | +## 快速开始 |
| 76 | + |
| 77 | +### 单机训练 |
| 78 | + |
| 79 | +``` |
| 80 | +mode: [cpu_single_train] |
| 81 | +
|
| 82 | +runner: |
| 83 | +- name: cpu_single_train |
| 84 | + class: train |
| 85 | + device: cpu # if use_gpu, set it to gpu |
| 86 | + epochs: 20 |
| 87 | + save_checkpoint_interval: 1 |
| 88 | + save_inference_interval: 1 |
| 89 | + save_checkpoint_path: "increment_youtubednn" |
| 90 | + save_inference_path: "inference_youtubednn" |
| 91 | + save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference |
| 92 | + save_inference_fetch_varnames: ["l3.tmp_2"] |
| 93 | + print_interval: 1 |
| 94 | +``` |
| 95 | + |
| 96 | +### 单机预测 |
| 97 | +通过计算每个用户和每个物品的余弦相似度,给每个用户推荐topk视频: |
| 98 | + |
| 99 | +cpu infer: |
| 100 | +``` |
| 101 | +python infer.py --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5 |
| 102 | +``` |
| 103 | + |
| 104 | +gpu infer: |
| 105 | +``` |
| 106 | +python infer.py --use_gpu 1 --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5 |
| 107 | +``` |
| 108 | +### 运行 |
| 109 | +``` |
| 110 | +python -m paddlerec.run -m paddlerec.models.recall.w2v |
| 111 | +``` |
| 112 | + |
| 113 | +### 结果展示 |
| 114 | + |
| 115 | +样例数据训练结果展示: |
| 116 | + |
| 117 | +``` |
| 118 | +Running SingleStartup. |
| 119 | +Running SingleRunner. |
| 120 | +batch: 1, acc: [0.03125] |
| 121 | +batch: 2, acc: [0.0625] |
| 122 | +batch: 3, acc: [0.] |
| 123 | +... |
| 124 | +epoch 0 done, use time: 0.0605320930481, global metrics: acc=[0.] |
| 125 | +... |
| 126 | +epoch 19 done, use time: 0.33447098732, global metrics: acc=[0.] |
| 127 | +``` |
| 128 | + |
| 129 | +样例数据预测结果展示: |
| 130 | +``` |
| 131 | +user:0, top K videos:[40, 31, 4, 33, 93] |
| 132 | +user:1, top K videos:[35, 57, 58, 40, 17] |
| 133 | +user:2, top K videos:[35, 17, 88, 40, 9] |
| 134 | +user:3, top K videos:[73, 35, 39, 58, 38] |
| 135 | +user:4, top K videos:[40, 31, 57, 4, 73] |
| 136 | +user:5, top K videos:[38, 9, 7, 88, 22] |
| 137 | +user:6, top K videos:[35, 73, 14, 58, 28] |
| 138 | +user:7, top K videos:[35, 73, 58, 38, 56] |
| 139 | +user:8, top K videos:[38, 40, 9, 35, 99] |
| 140 | +user:9, top K videos:[88, 73, 9, 35, 28] |
| 141 | +user:10, top K videos:[35, 52, 28, 54, 73] |
| 142 | +``` |
| 143 | + |
| 144 | +## 进阶使用 |
| 145 | + |
| 146 | +## FAQ |
0 commit comments