Skip to content

Commit 182e4f3

Browse files
MrChengmofuyinno4
andauthored
add gen_tree (#214)
Co-authored-by: wuzhihua <[email protected]>
1 parent 98c9498 commit 182e4f3

File tree

11 files changed

+957
-38
lines changed

11 files changed

+957
-38
lines changed

models/treebased/tdm/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ cd paddle-rec
1313

1414
python -m paddlerec.run -m models/treebased/tdm/config.yaml
1515
```
16+
3. 建树及自定义训练的细节可以查阅[TDM-Demo建树及训练](./gen_tree/README.md)
1617

1718
## 树结构的准备
1819
### 名词概念

models/treebased/tdm/build_tree.md

Lines changed: 0 additions & 19 deletions
This file was deleted.

models/treebased/tdm/config.yaml

Lines changed: 9 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -59,57 +59,47 @@ hyper_parameters:
5959
tree_emb_path: "{workspace}/tree/tree_emb.npy"
6060

6161
# select runner by name
62-
mode: runner1
63-
# config of each runner.
64-
# runner is a kind of paddle training class, which wraps the train/infer process.
62+
mode: [runner1]
63+
6564
runner:
6665
- name: runner1
6766
class: train
6867
startup_class_path: "{workspace}/tdm_startup.py"
69-
# num of epochs
7068
epochs: 10
71-
# device to run training or infer
7269
device: cpu
7370
save_checkpoint_interval: 2 # save model interval of epochs
74-
save_inference_interval: 4 # save inference
7571
save_checkpoint_path: "increment" # save checkpoint path
76-
save_inference_path: "inference" # save inference path
77-
save_inference_feed_varnames: [] # feed vars of save inference
78-
save_inference_fetch_varnames: [] # fetch vars of save inference
7972
init_model_path: "" # load model path
8073
print_interval: 10
74+
phases: [phase1]
8175

8276
- name: runner2
8377
class: infer
8478
startup_class_path: "{workspace}/tdm_startup.py"
85-
# device to run training or infer
8679
device: cpu
8780
init_model_path: "increment/0" # load model path
8881
print_interval: 1
82+
phases: [phase2]
8983

9084
- name: runner3
9185
class: local_cluster_train
9286
startup_class_path: "{workspace}/tdm_startup.py"
9387
fleet_mode: ps
9488
epochs: 10
95-
# device to run training or infer
9689
device: cpu
9790
save_checkpoint_interval: 2 # save model interval of epochs
98-
save_inference_interval: 4 # save inference
9991
save_checkpoint_path: "increment" # save checkpoint path
100-
save_inference_path: "inference" # save inference path
101-
save_inference_feed_varnames: [] # feed vars of save inference
102-
save_inference_fetch_varnames: [] # fetch vars of save inference
10392
init_model_path: "init_model" # load model path
10493
print_interval: 10
94+
phases: [phase1]
10595

10696
# runner will run all the phase in each epoch
10797
phase:
10898
- name: phase1
10999
model: "{workspace}/model.py" # user-defined model
110100
dataset_name: dataset_train # select dataset by name
111101
thread_num: 1
112-
# - name: phase2
113-
# model: "{workspace}/model.py"
114-
# dataset_name: dataset_infer
115-
# thread_num: 2
102+
- name: phase2
103+
model: "{workspace}/model.py"
104+
dataset_name: dataset_infer
105+
thread_num: 2
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# TDM-Demo建树及训练
2+
3+
## 建树所需环境
4+
Requirements:
5+
- python >= 2.7
6+
- paddlepaddle >= 1.7.2(建议1.7.2)
7+
- paddle-rec (克隆github paddlerec,执行python setup.py install)
8+
- sklearn
9+
- anytree
10+
11+
12+
## 建树流程
13+
14+
### 生成建树所需Embedding
15+
16+
- 生成Fake的emb
17+
18+
```shell
19+
cd gen_tree
20+
python -u emb_util.py
21+
```
22+
23+
生成的emb维度是[13, 64],含义是共有13个item,每个item的embedding维度是64,生成的item_emb位于`gen_tree/item_emb.txt`
24+
25+
格式为`emb_value_0(float) 空格 emb_value_1(float) ... emb_value_63(float) \t item_id `
26+
27+
在demo中,要求item的编号从0开始,范围 [0, item_nums-1]
28+
29+
真实场景可以通过各种hash映射满足该要求
30+
31+
### 对Item_embedding进行聚类建树
32+
33+
执行
34+
35+
```shell
36+
cd gen_tree
37+
# emd_path: item_emb的地址
38+
# emb_size: item_emb的第二个维度,即每个item的emb的size(示例中为64)
39+
# threads: 多线程建树配置的线程数
40+
# n_clusters: 最终建树为几叉树,此处设置为2叉树
41+
python gen_tree.py --emd_path item_emb.txt --emb_size 64 --output_dir ./output --threads 1 --n_clusters 2
42+
```
43+
44+
生成的训练所需树结构文件位于`gen_tree/output`
45+
```shell
46+
.
47+
├── id2item.json # 树节点id到item id的映射表
48+
├── layer_list.txt # 树的每个层级都有哪些节点
49+
├── travel_list.npy # 每个item从根到叶子的遍历路径,按item顺序排序
50+
├── travel_list.txt # 上个文件的明文txt
51+
├── tree_embedding.txt # 所有节点按节点id排列组成的embedding
52+
├── tree_emb.npy # 上个文件的.npy版本
53+
├── tree_info.npy # 每个节点:是否对应item/父/层级/子节点,按节点顺序排列
54+
├── tree_info.txt # 上个文件的明文txt
55+
└── tree.pkl # 聚类得到的树结构
56+
```
57+
58+
我们最终需要使用建树生成的以下四个文件,参与网络训练,参考`models/treebased/tdm/config.yaml`
59+
60+
1. layer_list.txt
61+
2. travel_list.npy
62+
3. tree_info.npy
63+
4. tree_emb.npy
64+
65+
66+
### 执行训练
67+
68+
- 更改`config.yaml`中的配置
69+
70+
首先更改
71+
```yaml
72+
hyper_parameters:
73+
# ...
74+
tree:
75+
# 单机训练建议tree只load一次,保存为paddle tensor,之后从paddle模型热启
76+
# 分布式训练trainer需要独立load
77+
# 预测时也改为从paddle模型加载
78+
load_tree_from_numpy: True # only once
79+
load_paddle_model: False # train & infer need, after load from npy, change it to True
80+
tree_layer_path: "{workspace}/tree/layer_list.txt"
81+
tree_travel_path: "{workspace}/tree/travel_list.npy"
82+
tree_info_path: "{workspace}/tree/tree_info.npy"
83+
tree_emb_path: "{workspace}/tree/tree_emb.npy"
84+
```
85+
将上述几个path改为建树得到的文件所在的地址
86+
87+
再更改
88+
```yaml
89+
hyper_parameters:
90+
max_layers: 4 # 不含根节点,树的层数
91+
node_nums: 26 # 树共有多少个节点,数量与tree_info文件的行数相等
92+
leaf_node_nums: 13 # 树共有多少个叶子节点
93+
layer_node_num_list: [2, 4, 8, 10] # 树的每层有多少个节点
94+
child_nums: 2 # 每个节点最多有几个孩子结点(几叉树)
95+
neg_sampling_list: [1, 2, 3, 4] # 在树的每层做多少负采样,训练自定义的参数
96+
```
97+
98+
若并不知道对上面几个参数具体值,可以试运行一下,paddlerec读取建树生成的文件后,会将具体信息打印到屏幕上,如下所示:
99+
```shell
100+
...
101+
File_list: ['models/treebased/tdm/data/train/demo_fake_input.txt']
102+
2020-09-10 15:17:19,259 - INFO - Run TDM Trainer Startup Pass
103+
2020-09-10 15:17:19,283 - INFO - load tree from numpy
104+
2020-09-10 15:17:19,284 - INFO - TDM Tree leaf node nums: 13
105+
2020-09-10 15:17:19,284 - INFO - TDM Tree max layer: 4
106+
2020-09-10 15:17:19,284 - INFO - TDM Tree layer_node_num_list: [2, 4, 8, 10]
107+
2020-09-10 15:17:19,285 - INFO - Begin Save Init model.
108+
2020-09-10 15:17:19,394 - INFO - End Save Init model.
109+
Running SingleRunner.
110+
...
111+
```
112+
将其抄到配置中即可
113+
114+
- 训练
115+
116+
执行
117+
```
118+
cd /PaddleRec # PaddleRec 克隆的根目录
119+
python -m paddlerec.run -m models/treebased/tdm/config.yaml
120+
```
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from . import cluster
15+
16+
__all__ = []
17+
__all__ += cluster.__all__

0 commit comments

Comments
 (0)