|
| 1 | +# TDM-Demo建树及训练 |
| 2 | + |
| 3 | +## 建树所需环境 |
| 4 | +Requirements: |
| 5 | +- python >= 2.7 |
| 6 | +- paddlepaddle >= 1.7.2(建议1.7.2) |
| 7 | +- paddle-rec (克隆github paddlerec,执行python setup.py install) |
| 8 | +- sklearn |
| 9 | +- anytree |
| 10 | + |
| 11 | + |
| 12 | +## 建树流程 |
| 13 | + |
| 14 | +### 生成建树所需Embedding |
| 15 | + |
| 16 | +- 生成Fake的emb |
| 17 | + |
| 18 | +```shell |
| 19 | +cd gen_tree |
| 20 | +python -u emb_util.py |
| 21 | +``` |
| 22 | + |
| 23 | +生成的emb维度是[13, 64],含义是共有13个item,每个item的embedding维度是64,生成的item_emb位于`gen_tree/item_emb.txt` |
| 24 | + |
| 25 | +格式为`emb_value_0(float) 空格 emb_value_1(float) ... emb_value_63(float) \t item_id ` |
| 26 | + |
| 27 | +在demo中,要求item的编号从0开始,范围 [0, item_nums-1] |
| 28 | + |
| 29 | +真实场景可以通过各种hash映射满足该要求 |
| 30 | + |
| 31 | +### 对Item_embedding进行聚类建树 |
| 32 | + |
| 33 | +执行 |
| 34 | + |
| 35 | +```shell |
| 36 | +cd gen_tree |
| 37 | +# emd_path: item_emb的地址 |
| 38 | +# emb_size: item_emb的第二个维度,即每个item的emb的size(示例中为64) |
| 39 | +# threads: 多线程建树配置的线程数 |
| 40 | +# n_clusters: 最终建树为几叉树,此处设置为2叉树 |
| 41 | +python gen_tree.py --emd_path item_emb.txt --emb_size 64 --output_dir ./output --threads 1 --n_clusters 2 |
| 42 | +``` |
| 43 | + |
| 44 | +生成的训练所需树结构文件位于`gen_tree/output` |
| 45 | +```shell |
| 46 | +. |
| 47 | +├── id2item.json # 树节点id到item id的映射表 |
| 48 | +├── layer_list.txt # 树的每个层级都有哪些节点 |
| 49 | +├── travel_list.npy # 每个item从根到叶子的遍历路径,按item顺序排序 |
| 50 | +├── travel_list.txt # 上个文件的明文txt |
| 51 | +├── tree_embedding.txt # 所有节点按节点id排列组成的embedding |
| 52 | +├── tree_emb.npy # 上个文件的.npy版本 |
| 53 | +├── tree_info.npy # 每个节点:是否对应item/父/层级/子节点,按节点顺序排列 |
| 54 | +├── tree_info.txt # 上个文件的明文txt |
| 55 | +└── tree.pkl # 聚类得到的树结构 |
| 56 | +``` |
| 57 | + |
| 58 | +我们最终需要使用建树生成的以下四个文件,参与网络训练,参考`models/treebased/tdm/config.yaml` |
| 59 | + |
| 60 | +1. layer_list.txt |
| 61 | +2. travel_list.npy |
| 62 | +3. tree_info.npy |
| 63 | +4. tree_emb.npy |
| 64 | + |
| 65 | + |
| 66 | +### 执行训练 |
| 67 | + |
| 68 | +- 更改`config.yaml`中的配置 |
| 69 | + |
| 70 | +首先更改 |
| 71 | +```yaml |
| 72 | +hyper_parameters: |
| 73 | + # ... |
| 74 | + tree: |
| 75 | + # 单机训练建议tree只load一次,保存为paddle tensor,之后从paddle模型热启 |
| 76 | + # 分布式训练trainer需要独立load |
| 77 | + # 预测时也改为从paddle模型加载 |
| 78 | + load_tree_from_numpy: True # only once |
| 79 | + load_paddle_model: False # train & infer need, after load from npy, change it to True |
| 80 | + tree_layer_path: "{workspace}/tree/layer_list.txt" |
| 81 | + tree_travel_path: "{workspace}/tree/travel_list.npy" |
| 82 | + tree_info_path: "{workspace}/tree/tree_info.npy" |
| 83 | + tree_emb_path: "{workspace}/tree/tree_emb.npy" |
| 84 | +``` |
| 85 | +将上述几个path改为建树得到的文件所在的地址 |
| 86 | +
|
| 87 | +再更改 |
| 88 | +```yaml |
| 89 | +hyper_parameters: |
| 90 | + max_layers: 4 # 不含根节点,树的层数 |
| 91 | + node_nums: 26 # 树共有多少个节点,数量与tree_info文件的行数相等 |
| 92 | + leaf_node_nums: 13 # 树共有多少个叶子节点 |
| 93 | + layer_node_num_list: [2, 4, 8, 10] # 树的每层有多少个节点 |
| 94 | + child_nums: 2 # 每个节点最多有几个孩子结点(几叉树) |
| 95 | + neg_sampling_list: [1, 2, 3, 4] # 在树的每层做多少负采样,训练自定义的参数 |
| 96 | +``` |
| 97 | +
|
| 98 | +若并不知道对上面几个参数具体值,可以试运行一下,paddlerec读取建树生成的文件后,会将具体信息打印到屏幕上,如下所示: |
| 99 | +```shell |
| 100 | +... |
| 101 | +File_list: ['models/treebased/tdm/data/train/demo_fake_input.txt'] |
| 102 | +2020-09-10 15:17:19,259 - INFO - Run TDM Trainer Startup Pass |
| 103 | +2020-09-10 15:17:19,283 - INFO - load tree from numpy |
| 104 | +2020-09-10 15:17:19,284 - INFO - TDM Tree leaf node nums: 13 |
| 105 | +2020-09-10 15:17:19,284 - INFO - TDM Tree max layer: 4 |
| 106 | +2020-09-10 15:17:19,284 - INFO - TDM Tree layer_node_num_list: [2, 4, 8, 10] |
| 107 | +2020-09-10 15:17:19,285 - INFO - Begin Save Init model. |
| 108 | +2020-09-10 15:17:19,394 - INFO - End Save Init model. |
| 109 | +Running SingleRunner. |
| 110 | +... |
| 111 | +``` |
| 112 | +将其抄到配置中即可 |
| 113 | + |
| 114 | +- 训练 |
| 115 | + |
| 116 | +执行 |
| 117 | +``` |
| 118 | +cd /PaddleRec # PaddleRec 克隆的根目录 |
| 119 | +python -m paddlerec.run -m models/treebased/tdm/config.yaml |
| 120 | +``` |
0 commit comments