Skip to content

Commit fe0ed2d

Browse files
committed
test=develop, debug
1 parent 3e028e1 commit fe0ed2d

File tree

8 files changed

+93
-74
lines changed

8 files changed

+93
-74
lines changed

models/treebased/README.md

Lines changed: 27 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,36 @@
1-
# Paddle-TDM
1+
# Paddle TDM解决方案
22

3-
TDM召回方法来源于阿里妈妈团队在`KDD2018`发表的论文[Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf),本示例代码提供了基于PaddlePaddle实现的TreeBased推荐搜索算法,主要包含以下组成:
3+
本示例代码提供了基于PaddlePaddle实现的[TDM](https://arxiv.org/pdf/1801.02294.pdf)推荐搜索算法。TDM模型是为大规模推荐系统设计的、能承载任意先进模型来高效检索用户兴趣的推荐算法解决方案。该方案基于树结构,提出了一套对用户兴趣度量进行层次化建模与检索的方法论,使得系统能直接利高级深度学习模型在全库范围内检索用户兴趣。其基本原理是使用树结构对全库item进行索引,然后训练深度模型以支持树上的逐层检索,从而将大规模推荐中全库检索的复杂度由O(n)(n为所有item的量级)下降至O(log n)。
44

5-
- 基于fake数据集,适用于快速调试的paddle-tdm模型。主要用于理解paddle-tdm的设计原理,高效上手设计适合您的使用场景的模型。
65

7-
以上内容将随paddle版本迭代不断更新,欢迎您关注该代码库。
6+
## 快速开始
87

9-
## TDM设计思路
8+
基于demo数据集,快速上手TDM模型,为您后续设计适合特定使用场景的模型做准备。
109

11-
### 基本概念
12-
TDM是为大规模推荐系统设计的、能承载任意先进模型来高效检索用户兴趣的推荐算法解决方案。TDM基于树结构,提出了一套对用户兴趣度量进行层次化建模与检索的方法论,使得系统能直接利高级深度学习模型在全库范围内检索用户兴趣。其基本原理是使用树结构对全库item进行索引,然后训练深度模型以支持树上的逐层检索,从而将大规模推荐中全库检索的复杂度由O(n)(n为所有item的量级)下降至O(log n)。
10+
假定您PaddleRec所在目录为${PaddleRec_Home}。
1311

14-
### 核心问题
12+
- Step1: 进入tree-based模型库文件夹下,完成demo数据集的切分、建树等准备工作。
1513

16-
1. 如何构建树结构?
17-
2. 如何基于树结构做深度学习模型的训练?
18-
3. 如何基于树及模型进行高效检索?
14+
```shell
15+
cd ${PaddleRec_Home}/models/treebased/
16+
./data_prepare.sh demo
17+
```
18+
demo数据集预处理一键命令为 `./data_prepare.sh demo` 。若对具体的数据处理、建树细节感兴趣,请查看 `data_prepare.sh` 脚本。这一步完成后,您会在 `${PaddleRec_Home}/models/treebased/` 目录下得到一个名为 `demo_data`的目录,该目录结构如下:
1919

20-
### PaddlePaddle的TDM方案
20+
```
21+
├── treebased
22+
├── demo_data
23+
| ├── samples JTM Tree-Learning算法所需,
24+
| | ├── samples_{item_id}.json 记录了所有和 `item_id` 相关的训练集样本。
25+
| ├── train_data 训练集目录
26+
| ├── test_data 测试集目录
27+
| ├── ItemCate.txt 记录所有item的类别信息,用于初始化建树。
28+
| ├── Stat.txt 记录所有item在训练集中出现的频次信息,用于采样。
29+
| ├── tree.pb 初始化化树文件
30+
```
2131

22-
1. 树结构的数据,来源于各个业务的实际场景,构造方式各有不同,paddle-TDM一期暂不提供统一的树的构造流程,但会统一树构造好之后,输入paddle网络的数据组织形式。业务方可以使用任意工具构造自己的树,生成指定的数据格式,参与tdm网络训练。
23-
2. 网络训练中,有三个核心问题:
24-
25-
- 如何组网?答:paddle封装了大量的深度学习OP,用户可以根据需求设计自己的网络结构。
26-
- 训练数据如何组织?答:tdm的训练数据主要为:`user/query emb``item`的正样本,`item`需要映射到树的某个叶子节点。用户只需准备符合该构成的数据即可。负样本的生成,会基于用户提供的树结构,以及paddle提供的`tdm-sampler op`完成高效的负采样,并自动添加相应的label,参与tdm中深度学习模型的训练。
27-
- 大规模的数据与模型训练如何实现?答:基于paddle优秀的大规模参数服务器分布式能力,可以实现高效的分布式训练。基于paddle-fleet api,学习门槛极低,且可以灵活的支持增量训练,流式训练等业务需求。
28-
3. 训练好模型后,可以基于paddle,将检索与打分等流程都融入paddle的组网中,生成inference_model与参数文件,基于PaddlePaddle的预测库或者PaddleLite进行快速部署与高效检索。
32+
- Step2: 快速运行。config.yaml中配置了模型训练所有的超参,运行方式同PaddleRec其他模型静态图运行方式。当前树模型暂不支持动态图运行模式。
33+
34+
```shell
35+
python -u ../../../tools/static_trainer.py -m config.yaml
36+
```

models/treebased/builder/tree_index_builder.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from paddle.fluid.proto import index_dataset_pb2
15+
from paddle.distributed.fleet.proto import index_dataset_pb2
1616
import numpy as np
1717
import struct
1818
import argparse
@@ -97,20 +97,20 @@ def build(self, output_filename, ids, codes, data=None, id_offset=None):
9797
min_code = 0
9898
max_code = codes[-1]
9999
while max_code > 0:
100-
min_code = min_code * 2 + 1
101-
max_code = int((max_code - 1) / 2)
100+
min_code = min_code * self.branch + 1
101+
max_code = int((max_code - 1) / self.branch)
102102

103103
for i in range(len(codes)):
104104
while codes[i] < min_code:
105-
codes[i] = codes[i] * 2 + 1
105+
codes[i] = codes[i] * self.branch + 1
106106

107107
filter_set = set()
108108
max_level = 0
109109
tree_meta = index_dataset_pb2.TreeMeta()
110110

111111
with open(output_filename, 'wb') as f:
112112
for id, code in zip(ids, codes):
113-
node = index_dataset_pb2.Node()
113+
node = index_dataset_pb2.IndexNode()
114114
node.id = id
115115
node.is_leaf = True
116116
node.probability = 1.0
@@ -126,7 +126,7 @@ def build(self, output_filename, ids, codes, data=None, id_offset=None):
126126

127127
for ancessor in ancessors:
128128
if ancessor not in filter_set:
129-
node = index_dataset_pb2.Node()
129+
node = index_dataset_pb2.IndexNode()
130130
node.id = id_offset + ancessor # id = id_offset + code
131131
node.is_leaf = False
132132
node.probability = 1.0
@@ -146,7 +146,7 @@ def build(self, output_filename, ids, codes, data=None, id_offset=None):
146146
def _ancessors(self, code):
147147
ancs = []
148148
while code > 0:
149-
code = int((code - 1) / 2)
149+
code = int((code - 1) / self.branch)
150150
ancs.append(code)
151151
return ancs
152152

models/treebased/data_prepare.sh

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,9 @@ then
88
python builder/tree_index_builder.py --mode "by_category" --branch 2 --input "demo_data/ItemCate.txt" --output "demo_data/tree.pb"
99
elif [[ ${type} = "user_behaviour" ]]
1010
then
11+
# wget --no-check-certificate https://paddlerec.bj.bcebos.com/tree-based/data/UserBehavior.csv.zip -O data/UserBehavior.csv.zip
12+
# unzip -d data/ data/UserBehavior.csv.zip
13+
# python data/data_cutter.py --input "./data/UserBehavior.csv" --train "./data/ub_train.csv" --test "./data/ub_test.csv" --number 10000
14+
python data/data_generator.py --train_file "data/ub_train.csv" --test_file "data/ub_test.csv" --item_cate_filename "ub_data_new/ItemCate.txt" --stat_file "ub_data_new/Stat.txt" --train_dir "ub_data_new/train_data" --test_dir "ub_data_new/test_data" --sample_dir "ub_data_new/samples" --parall 32 --train_sample_seg_cnt 400 --seq_len 70 --min_seq_len 6
1115
echo "ub"
1216
fi

models/treebased/tdm/config.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ runner:
3030

3131
train_batch_size: 100 # 30000
3232
epochs: 5
33-
print_interval: 1000 # 1000
33+
print_interval: 10 # 1000
3434
model_save_path: "tdm_demo_output"
3535

3636
# hyper parameters of user-defined network
@@ -40,7 +40,8 @@ hyper_parameters:
4040
class: Adam
4141
learning_rate: 0.001
4242
strategy: async
43-
43+
44+
with_att: False
4445
# tree
4546
sparse_feature_num: 5171136
4647
node_emb_size: 24

models/treebased/tdm/config_ub.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ hyper_parameters:
4040
learning_rate: 0.001
4141
strategy: async
4242

43+
with_att: False
4344
# tree
4445
sparse_feature_num: 9357374
4546
node_emb_size: 24

models/treebased/tdm/model.py

Lines changed: 34 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,7 @@ def dnn_model_define(user_input,
140140
fea_groups="20,20,10,10,2,2,2,1,1,1",
141141
active_op='prelu',
142142
use_batch_norm=True,
143+
with_att=False,
143144
is_infer=False,
144145
topk=10):
145146
fea_groups = [int(s) for s in fea_groups.split(',')]
@@ -148,30 +149,37 @@ def dnn_model_define(user_input,
148149

149150
layer_data = []
150151
# start att
151-
att_user_input = paddle.concat(
152-
user_input, axis=1) # [bs, total_group_length, emb_size]
153-
att_node_input = fluid.layers.expand(
154-
unit_id_emb, expand_times=[1, total_group_length, 1])
155-
att_din = paddle.concat(
156-
[att_user_input, att_user_input * att_node_input, att_node_input],
157-
axis=2)
158-
159-
att_active_op = 'prelu'
160-
att_layer_arr = []
161-
att_layer1 = FullyConnected3D(
162-
3 * node_emb_size, 36, active_op=att_active_op, version=1)
163-
att_layer_arr.append(att_layer1)
164-
att_layer2 = FullyConnected3D(36, 1, active_op=att_active_op, version=2)
165-
att_layer_arr.append(att_layer2)
166-
167-
layer_data.append(att_din)
168-
for layer in att_layer_arr:
169-
layer_data.append(layer.call(layer_data[-1]))
170-
att_dout = layer_data[-1]
171-
172-
att_dout = fluid.layers.expand(
173-
att_dout, expand_times=[1, 1, node_emb_size])
174-
user_input = att_user_input * att_dout
152+
if with_att:
153+
print("TDM Attention DNN")
154+
att_user_input = paddle.concat(
155+
user_input, axis=1) # [bs, total_group_length, emb_size]
156+
att_node_input = fluid.layers.expand(
157+
unit_id_emb, expand_times=[1, total_group_length, 1])
158+
att_din = paddle.concat(
159+
[att_user_input, att_user_input * att_node_input, att_node_input],
160+
axis=2)
161+
162+
att_active_op = 'prelu'
163+
att_layer_arr = []
164+
att_layer1 = FullyConnected3D(
165+
3 * node_emb_size, 36, active_op=att_active_op, version=1)
166+
att_layer_arr.append(att_layer1)
167+
att_layer2 = FullyConnected3D(
168+
36, 1, active_op=att_active_op, version=2)
169+
att_layer_arr.append(att_layer2)
170+
171+
layer_data.append(att_din)
172+
for layer in att_layer_arr:
173+
layer_data.append(layer.call(layer_data[-1]))
174+
att_dout = layer_data[-1]
175+
176+
att_dout = fluid.layers.expand(
177+
att_dout, expand_times=[1, 1, node_emb_size])
178+
user_input = att_user_input * att_dout
179+
else:
180+
print("TDM DNN")
181+
user_input = paddle.concat(
182+
user_input, axis=1) # [bs, total_group_length, emb_size]
175183
# end att
176184

177185
idx = 0
@@ -207,13 +215,13 @@ def dnn_model_define(user_input,
207215
layer_arr.append(layer2)
208216
layer3 = paddle_dnn_layer(
209217
64,
210-
32,
218+
24,
211219
active_op=active_op,
212220
use_batch_norm=use_batch_norm,
213221
version="%d_%s" % (3, net_version))
214222
layer_arr.append(layer3)
215223
layer4 = paddle_dnn_layer(
216-
32,
224+
24,
217225
2,
218226
active_op='',
219227
use_batch_norm=False,

models/treebased/tdm/reader.py

Lines changed: 14 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
import sys
2323
import paddle.distributed.fleet as fleet
2424
import logging
25-
from paddle.distributed.fleet.data_generator import TreeIndex
25+
from paddle.distributed.fleet.dataset import TreeIndex
2626

2727
logging.basicConfig(
2828
format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
@@ -41,27 +41,23 @@ def init(self, config):
4141
self.with_hierachy = config.get("hyper_parameters.with_hierachy", True)
4242
self.seed = config.get("hyper_parameters.seed", 0)
4343

44-
self.set_tree_layerwise_sampler(
45-
self.tree_name,
46-
self.sample_layer_counts,
47-
range(self.item_nums),
48-
self.item_nums,
49-
self.item_nums + 1,
50-
start_sample_layer=self.start_sample_layer,
51-
seed=self.seed,
52-
with_hierarchy=self.with_hierachy)
44+
self.tree = TreeIndex(
45+
config.get("hyper_parameters.tree_name"),
46+
config.get("hyper_parameters.tree_path"))
47+
self.tree.init_layerwise_sampler(self.sample_layer_counts,
48+
self.start_sample_layer, self.seed)
5349

5450
def line_process(self, line):
55-
history_ids = [[0]] * (self.item_nums + 2)
51+
history_ids = [0] * (self.item_nums)
5652
features = line.strip().split("\t")
5753
item_id = int(features[1])
5854
for item in features[2:]:
5955
slot, feasign = item.split(":")
6056
slot_id = int(slot.split("_")[1])
61-
history_ids[slot_id - 1] = [int(feasign)]
62-
history_ids[-2] = [item_id]
63-
history_ids[-1] = [1]
64-
return history_ids
57+
history_ids[slot_id - 1] = int(feasign)
58+
res = self.tree.layerwise_sample([history_ids], [item_id],
59+
self.with_hierachy)
60+
return res
6561

6662
def generate_sample(self, line):
6763
"Dataset Generator"
@@ -73,7 +69,9 @@ def reader():
7369
feature_name.append("item_" + str(i + 1))
7470
feature_name.append("unit_id")
7571
feature_name.append("label")
76-
yield zip(feature_name, output_list)
72+
for _ in output_list:
73+
output = [[item] for item in _]
74+
yield zip(feature_name, output)
7775

7876
return reader
7977

@@ -87,8 +85,5 @@ def reader():
8785
config = yaml_helper.load_yaml(yaml_path)
8886

8987
r = MyDataset()
90-
tree = TreeIndex(
91-
config.get("hyper_parameters.tree_name"),
92-
config.get("hyper_parameters.tree_path"))
9388
r.init(config)
9489
r.run_from_stdin()

models/treebased/tdm/static_model.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ def _init_hyper_parameters(self):
3535
self.item_nums = self.config.get("hyper_parameters.item_nums", 69)
3636
self.fea_group = self.config.get("hyper_parameters.fea_group",
3737
"20,20,10,10,2,2,2,1,1,1")
38+
self.with_att = self.config.get("hyper_parameters.with_att", False)
3839

3940
def create_feeds(self, is_infer=False):
4041
user_input = [
@@ -80,7 +81,8 @@ def net(self, input, is_infer=False):
8081
unit_id_emb,
8182
input[-1],
8283
node_emb_size=self.node_emb_size,
83-
fea_groups=self.fea_group)
84+
fea_groups=self.fea_group,
85+
with_att=self.with_att)
8486
self._cost = avg_cost
8587

8688
self.inference_target_var = softmax_prob

0 commit comments

Comments
 (0)