Skip to content

Commit c888822

Browse files
committed
Merge branch 'master' of https://github.com/PaddlePaddle/PaddleRec into fix_collective_files_partition
2 parents 4798f2e + 3061d46 commit c888822

File tree

6 files changed

+282
-11
lines changed

6 files changed

+282
-11
lines changed

doc/train.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ python -m paddlerec.run -m paddlerec.models.xxx.yyy
2020
例如启动`recall`下的`word2vec`模型的默认配置;
2121

2222
```shell
23-
python -m paddlerec.run -m models/recall/word2vec
23+
python -m paddlerec.run -m models/recall/word2vec/config.yaml
2424
```
2525

2626
### 2. 启动内置模型的个性化配置训练

models/rank/dnn/README.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,3 +259,133 @@ auc_var, batch_auc_var, auc_states = fluid.layers.auc(
259259
```
260260

261261
完成上述组网后,我们最终可以通过训练拿到`avg_cost``auc`两个重要指标。
262+
263+
264+
## 流式训练(OnlineLearning)任务启动及配置流程
265+
266+
### 流式训练简介
267+
流式训练是按照一定顺序进行数据的接收和处理,每接收一个数据,模型会对它进行预测并对当前模型进行更新,然后处理下一个数据。 像信息流、小视频、电商等场景,每天都会新增大量的数据, 让每天(每一刻)新增的数据基于上一天(上一刻)的模型进行新的预测和模型更新。
268+
269+
在大规模流式训练场景下, 需要使用的深度学习框架有对应的能力支持, 即:
270+
* 支持大规模分布式训练的能力, 数据量巨大, 需要有良好的分布式训练及扩展能力,才能满足训练的时效要求
271+
* 支持超大规模的Embedding, 能够支持十亿甚至千亿级别的Embedding, 拥有合理的参数输出的能力,能够快速输出模型参数并和线上其他系统进行对接
272+
* Embedding的特征ID需要支持HASH映射,不要求ID的编码,能够自动增长及控制特征的准入(原先不存在的特征可以以适当的条件创建), 能够定期淘汰(能够以一定的策略进行过期的特征的清理) 并拥有准入及淘汰策略
273+
* 最后就是要基于框架开发一套完备的流式训练的 trainer.py, 能够拥有完善的流式训练流程
274+
275+
### 使用ctr-dnn online learning 进行模型的训练
276+
目前,PaddleRec基于飞桨分布式训练框架的能力,实现了这套流式训练的流程。 供大家参考和使用。我们基于`models/rank/ctr-dnn`修改了一个online_training的版本,供大家更好的理解和参考。
277+
278+
**注意**
279+
1. 使用online learning 需要安装目前Paddle最新的开发者版本, 你可以从 https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-dev 此处获得它,需要先卸载当前已经安装的飞桨版本,根据自己的Python环境下载相应的安装包。
280+
2. 使用online learning 需要安装目前PaddleRec最新的开发者版本, 你可以通过 git clone https://github.com/PaddlePaddle/PaddleRec.git 得到最新版的PaddleRec并自行安装
281+
282+
### 启动方法
283+
1. 修改config.yaml中的 hyper_parameters.distributed_embedding=1,表示打开大规模稀疏的模式
284+
2. 修改config.yaml中的 mode: [single_cpu_train, single_cpu_infer] 中的 `single_cpu_train` 为online_learning_cluster,表示使用online learning对应的运行模式
285+
3. 准备训练数据, ctr-dnn中使用的online learning对应的训练模式为 天级别训练, 每天又分为24个小时, 因此训练数据需要 天--小时的目录结构进行整理。
286+
以 2020年08月10日 到 2020年08月11日 2天的训练数据举例, 用户需要准备的数据的目录结构如下:
287+
```
288+
train_data/
289+
|-- 20200810
290+
| |-- 00
291+
| | `-- train.txt
292+
| |-- 01
293+
| | `-- train.txt
294+
| |-- 02
295+
| | `-- train.txt
296+
| |-- 03
297+
| | `-- train.txt
298+
| |-- 04
299+
| | `-- train.txt
300+
| |-- 05
301+
| | `-- train.txt
302+
| |-- 06
303+
| | `-- train.txt
304+
| |-- 07
305+
| | `-- train.txt
306+
| |-- 08
307+
| | `-- train.txt
308+
| |-- 09
309+
| | `-- train.txt
310+
| |-- 10
311+
| | `-- train.txt
312+
| |-- 11
313+
| | `-- train.txt
314+
| |-- 12
315+
| | `-- train.txt
316+
| |-- 13
317+
| | `-- train.txt
318+
| |-- 14
319+
| | `-- train.txt
320+
| |-- 15
321+
| | `-- train.txt
322+
| |-- 16
323+
| | `-- train.txt
324+
| |-- 17
325+
| | `-- train.txt
326+
| |-- 18
327+
| | `-- train.txt
328+
| |-- 19
329+
| | `-- train.txt
330+
| |-- 20
331+
| | `-- train.txt
332+
| |-- 21
333+
| | `-- train.txt
334+
| |-- 22
335+
| | `-- train.txt
336+
| `-- 23
337+
| `-- train.txt
338+
`-- 20200811
339+
|-- 00
340+
| `-- train.txt
341+
|-- 01
342+
| `-- train.txt
343+
|-- 02
344+
| `-- train.txt
345+
|-- 03
346+
| `-- train.txt
347+
|-- 04
348+
| `-- train.txt
349+
|-- 05
350+
| `-- train.txt
351+
|-- 06
352+
| `-- train.txt
353+
|-- 07
354+
| `-- train.txt
355+
|-- 08
356+
| `-- train.txt
357+
|-- 09
358+
| `-- train.txt
359+
|-- 10
360+
| `-- train.txt
361+
|-- 11
362+
| `-- train.txt
363+
|-- 12
364+
| `-- train.txt
365+
|-- 13
366+
| `-- train.txt
367+
|-- 14
368+
| `-- train.txt
369+
|-- 15
370+
| `-- train.txt
371+
|-- 16
372+
| `-- train.txt
373+
|-- 17
374+
| `-- train.txt
375+
|-- 18
376+
| `-- train.txt
377+
|-- 19
378+
| `-- train.txt
379+
|-- 20
380+
| `-- train.txt
381+
|-- 21
382+
| `-- train.txt
383+
|-- 22
384+
| `-- train.txt
385+
`-- 23
386+
`-- train.txt
387+
```
388+
4. 准备好数据后, 即可按照标准的训练流程进行流式训练了
389+
```shell
390+
python -m paddlerec.run -m models/rerank/ctr-dnn/config.yaml
391+
```

models/rank/dnn/config.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ hyper_parameters:
4949
sparse_feature_dim: 9
5050
dense_input_dim: 13
5151
fc_sizes: [512, 256, 128, 32]
52+
distributed_embedding: 0
5253

5354
# select runner by name
5455
mode: [single_cpu_train, single_cpu_infer]
@@ -90,6 +91,18 @@ runner:
9091
print_interval: 1
9192
phases: [phase1]
9293

94+
- name: online_learning_cluster
95+
class: cluster_train
96+
runner_class_path: "{workspace}/online_learning_runner.py"
97+
epochs: 2
98+
device: cpu
99+
fleet_mode: ps
100+
save_checkpoint_interval: 1 # save model interval of epochs
101+
save_checkpoint_path: "increment_dnn" # save checkpoint path
102+
init_model_path: "" # load model path
103+
print_interval: 1
104+
phases: [phase1]
105+
93106
- name: collective_cluster
94107
class: cluster_train
95108
epochs: 2

models/rank/dnn/model.py

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,16 @@ def __init__(self, config):
2525
ModelBase.__init__(self, config)
2626

2727
def _init_hyper_parameters(self):
28-
self.is_distributed = True if envs.get_fleet_mode().upper(
29-
) == "PSLIB" else False
28+
self.is_distributed = False
29+
self.distributed_embedding = False
30+
31+
if envs.get_fleet_mode().upper() == "PSLIB":
32+
self.is_distributed = True
33+
34+
if envs.get_global_env("hyper_parameters.distributed_embedding",
35+
0) == 1:
36+
self.distributed_embedding = True
37+
3038
self.sparse_feature_number = envs.get_global_env(
3139
"hyper_parameters.sparse_feature_number")
3240
self.sparse_feature_dim = envs.get_global_env(
@@ -40,14 +48,26 @@ def net(self, input, is_infer=False):
4048
self.label_input = self._sparse_data_var[0]
4149

4250
def embedding_layer(input):
43-
emb = fluid.layers.embedding(
44-
input=input,
45-
is_sparse=True,
46-
is_distributed=self.is_distributed,
47-
size=[self.sparse_feature_number, self.sparse_feature_dim],
48-
param_attr=fluid.ParamAttr(
49-
name="SparseFeatFactors",
50-
initializer=fluid.initializer.Uniform()), )
51+
if self.distributed_embedding:
52+
emb = fluid.contrib.layers.sparse_embedding(
53+
input=input,
54+
size=[
55+
self.sparse_feature_number, self.sparse_feature_dim
56+
],
57+
param_attr=fluid.ParamAttr(
58+
name="SparseFeatFactors",
59+
initializer=fluid.initializer.Uniform()))
60+
else:
61+
emb = fluid.layers.embedding(
62+
input=input,
63+
is_sparse=True,
64+
is_distributed=self.is_distributed,
65+
size=[
66+
self.sparse_feature_number, self.sparse_feature_dim
67+
],
68+
param_attr=fluid.ParamAttr(
69+
name="SparseFeatFactors",
70+
initializer=fluid.initializer.Uniform()))
5171
emb_sum = fluid.layers.sequence_pool(input=emb, pool_type='sum')
5272
return emb_sum
5373

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from __future__ import print_function
16+
17+
import os
18+
import time
19+
import warnings
20+
import numpy as np
21+
import logging
22+
import paddle.fluid as fluid
23+
24+
from paddlerec.core.utils import envs
25+
from paddlerec.core.metric import Metric
26+
from paddlerec.core.trainers.framework.runner import RunnerBase
27+
28+
logging.basicConfig(
29+
format='%(asctime)s - %(levelname)s: %(message)s', level=logging.INFO)
30+
31+
32+
class OnlineLearningRunner(RunnerBase):
33+
def __init__(self, context):
34+
print("Running OnlineLearningRunner.")
35+
36+
def run(self, context):
37+
epochs = int(
38+
envs.get_global_env("runner." + context["runner_name"] +
39+
".epochs"))
40+
model_dict = context["env"]["phase"][0]
41+
model_class = context["model"][model_dict["name"]]["model"]
42+
metrics = model_class._metrics
43+
44+
dataset_list = []
45+
dataset_index = 0
46+
for day_index in range(len(days)):
47+
day = days[day_index]
48+
cur_path = "%s/%s" % (path, str(day))
49+
filelist = fleet.split_files(hdfs_ls([cur_path]))
50+
dataset = create_dataset(use_var, filelist)
51+
dataset_list.append(dataset)
52+
dataset_index += 1
53+
54+
dataset_index = 0
55+
for epoch in range(len(days)):
56+
day = days[day_index]
57+
begin_time = time.time()
58+
result = self._run(context, model_dict)
59+
end_time = time.time()
60+
seconds = end_time - begin_time
61+
message = "epoch {} done, use time: {}".format(epoch, seconds)
62+
63+
# TODO, wait for PaddleCloudRoleMaker supports gloo
64+
from paddle.fluid.incubate.fleet.base.role_maker import GeneralRoleMaker
65+
if context["fleet"] is not None and isinstance(context["fleet"],
66+
GeneralRoleMaker):
67+
metrics_result = []
68+
for key in metrics:
69+
if isinstance(metrics[key], Metric):
70+
_str = metrics[key].calc_global_metrics(
71+
context["fleet"],
72+
context["model"][model_dict["name"]]["scope"])
73+
metrics_result.append(_str)
74+
elif result is not None:
75+
_str = "{}={}".format(key, result[key])
76+
metrics_result.append(_str)
77+
if len(metrics_result) > 0:
78+
message += ", global metrics: " + ", ".join(metrics_result)
79+
print(message)
80+
with fluid.scope_guard(context["model"][model_dict["name"]][
81+
"scope"]):
82+
train_prog = context["model"][model_dict["name"]][
83+
"main_program"]
84+
startup_prog = context["model"][model_dict["name"]][
85+
"startup_program"]
86+
with fluid.program_guard(train_prog, startup_prog):
87+
self.save(epoch, context, True)
88+
89+
context["status"] = "terminal_pass"

models/treebased/tdm/build_tree.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
2+
3+
wget https://paddlerec.bj.bcebos.com/utils/tree_build_utils.tar.gz --no-check-certificate
4+
5+
# input_path: embedding的路径
6+
# emb_shape: embedding中key-value,value的维度
7+
# emb格式要求: embedding_id(int64),embedding(float),embedding(float),......,embedding(float)
8+
# cluster_threads: 建树聚类所用线程
9+
python_172_anytree/bin/python -u main.py --input_path=./gen_emb/item_emb.txt --output_path=./ --emb_shape=24 --cluster_threads=4
10+
11+
建树流程是:1、读取emb -> 2、kmeans聚类 -> 3、聚类结果整理为树 -> 4、基于树结构得到模型所需的4个文件
12+
1 Layer_list:记录了每一层都有哪些节点。训练用
13+
2 Travel_list:记录每个叶子节点的Travel路径。训练用
14+
3 Tree_Info:记录了每个节点的信息,主要为:是否是item/item_id,所在层级,父节点,子节点。检索用
15+
4 Tree_Embedding:记录所有节点的Embedding。训练及检索用
16+
17+
注意一下训练数据输入的item是建树之前用的item id,还是基于树的node id,还是基于叶子的leaf id,在tdm_reader.py中,可以加载字典,做映射。
18+
用厂内版建树得到的输出文件夹里,有名为id2nodeid.txt的映射文件,格式是『hash值』+ 『树节点ID』+『叶子节点ID(表示第几个叶子节点,tdm_sampler op 所需的输入)』
19+
在另一个id2bidword.txt中,也有映射关系,格式是『hash值』+『原始item ID』,这个文件中仅存储了叶子节点的信息。

0 commit comments

Comments
 (0)