Skip to content

Commit ba26394

Browse files
authored
Merge pull request #790 from renmada/kim
Add Kim model
2 parents 08c367f + d40a89d commit ba26394

File tree

24 files changed

+6830
-0
lines changed

24 files changed

+6830
-0
lines changed

README_CN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
125125
| 匹配 | [DSSM](models/match/dssm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/match/dssm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238124) || x | >=2.1.0 | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
126126
| 匹配 | [Match-Pyramid](models/match/match-pyramid/)([文档](https://paddlerec.readthedocs.io/en/latest/models/match/match-pyramid.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238192) || x | >=2.1.0 | [AAAI 2016][Text Matching as Image Recognition](https://arxiv.org/pdf/1602.06359.pdf) |
127127
| 匹配 | [MultiView-Simnet](models/match/multiview-simnet/)([文档](https://paddlerec.readthedocs.io/en/latest/models/match/multiview-simnet.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238206) || x | >=2.1.0 | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
128+
| 匹配 | [KIM](models/match/kim/)([文档](https://paddlerec.readthedocs.io/en/latest/models/match/kim.html)) | - | x | x | >=2.1.0 | [SIGIR 2021][Personalized News Recommendation with Knowledge-aware Interactive Matching](https://arxiv.org/pdf/2104.10083.pdf) |
128129
| 召回 | [TDM](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/treebased/tdm/) | - || >=1.8.0 | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [KDD 2018][Learning Tree-based Deep Model for Recommender Systems](https://arxiv.org/pdf/1801.02294.pdf) |
129130
| 召回 | [FastText](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/recall/fasttext/) | - | x | x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [EACL 2017][Bag of Tricks for Efficient Text Classification](https://www.aclweb.org/anthology/E17-2068.pdf) |
130131
| 召回 | [MIND](models/recall/mind/)([文档](https://paddlerec.readthedocs.io/en/latest/models/recall/mind.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3239088) | x | x | >=2.1.0 | [2019][Multi-Interest Network with Dynamic Routing for Recommendation at Tmall](https://arxiv.org/pdf/1904.08030.pdf) |

README_EN.md

Lines changed: 211 additions & 0 deletions
Large diffs are not rendered by default.

datasets/kim/run.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
wget https://paddlerec.bj.bcebos.com/datasets/kim/kim.zip
2+
unzip kim.zip

doc/imgs/kim.png

66.8 KB
Loading

doc/source/models/match/kim.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# kim文本匹配模型
2+
3+
4+
以下是本例的简要目录结构及说明:
5+
6+
```
7+
├── data #样例数据
8+
├── sample_data #样例数据
9+
├── docs.tsv #新闻文本
10+
├── entity2id.txt #实体映射表
11+
├── glove.840B.300d.txt #glove词向量
12+
├── KGGraph #知识图谱关系
13+
├── KGGraph #知识图谱关系
14+
├── train.tsv #训练数据样例
15+
├── test.txt #测试数据样例
16+
├── __init__.py
17+
├── README.md #文档
18+
├── config.yaml # sample数据配置
19+
├── config_bigdata.yaml # 全量数据配置
20+
├── dygraph_model.py # 构建动态图
21+
├── net.py # 模型核心组网(动静统一)
22+
├── mind_reader.py #数据读取程序
23+
├── eval_utils #评估函数
24+
```
25+
26+
注:在阅读该示例前,建议您先了解以下内容:
27+
28+
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
29+
[kim](https://paddlerec.readthedocs.io/en/latest/models/match/kim.html)
30+
31+
## 内容
32+
33+
- [模型简介](#模型简介)
34+
- [数据准备](#数据准备)
35+
- [运行环境](#运行环境)
36+
- [快速开始](#快速开始)
37+
- [模型组网](#模型组网)
38+
- [效果复现](#效果复现)
39+
- [进阶使用](#进阶使用)
40+
- [FAQ](#FAQ)
41+
42+
43+
## 数据准备
44+
训练及测试数据集选用mind新闻、 glove.840B.300d 词向量初始化embedding层和知识图谱数据。
45+
46+
## 运行环境
47+
PaddlePaddle>=2.0
48+
nltk>=3.7
49+
python 3.7
50+
51+
os : windows/linux/macos
52+
53+
## 快速开始
54+
**开始前确保已nltk英文分词模型到个人目录下**
55+
本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在kim模型目录的快速执行命令如下:
56+
```bash
57+
# 进入模型目录
58+
# cd models/match/match-pyramid # 在任意目录均可运行
59+
# 动态图训练
60+
python -u trainer.py -m config.yaml -o mode=train # 全量数据运行config_bigdata.yaml
61+
# 动态图预测
62+
python -u infer.py -m config.yaml -o mode=test
63+
```
64+
65+
## 模型组网
66+
个性化新闻推荐的核心是候选新闻和用户兴趣之间的准确匹配。大多数现有的新闻推荐方法通常从文本内容中建立候选新闻模型,并从用户点击的新闻中建立用户兴趣模型,两者是独立的。然而,一篇新闻可能涵盖多个方面和实体,一个用户也可能有多种兴趣。对候选新闻和用户兴趣的独立建模可能会导致新闻和用户之间的劣质匹配。在本文中,我们提出了一个用于个性化新闻推荐的知识感知的交互式匹配框架。我们的方法可以对候选新闻和用户兴趣进行交互式建模,以学习用户感知的候选新闻表示和候选新闻感知的用户兴趣表示,这可以促进用户兴趣和候选新闻之间的准确匹配。更具体地说,我们提出了一个知识协同编码器,借助知识图谱捕捉实体中的关联性,为点击新闻和候选新闻交互式地学习基于知识的新闻表示。此外,我们还提出了一个文本协同编码器,通过对文本之间的语义关系进行建模,为被点击新闻和候选新闻交互式地学习基于文本的新闻表示。此外,我们还提出了一个用户-新闻联合编码器,从候选新闻和点击新闻的知识和基于文本的表征中学习候选新闻的用户兴趣表征和用户意识到的候选新闻表征,以实现更好的兴趣匹配。通过在两个真实世界的数据集上进行广泛的实验,我们证明了我们的方法可以有效地提高新闻推荐的性能。:
67+
<p align="center">
68+
<img align="center" src="../../../doc/imgs/kim.png">
69+
<p>
70+
71+
72+
## 效果复现
73+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
74+
在全量数据下模型的指标如下:
75+
## 效果复现
76+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
77+
在全量数据下模型的指标如下:
78+
79+
| 模型 | AUC | MRR | nDCG5 | nDCG10 | batch_size | epoch_num | Time of each epoch |
80+
|-----|-----|-----|-----|-----|------------|-----------|--------------------|
81+
| kim | 0.6681 | 0.3164 | 0.3484 | 0.4132 | 16 | 7 | 2h |
82+
| kim | 0.6696 | 0.3192 | 0.3515 | 0.4158 | 16 | 8 | 2h |
83+
84+
1. 确认您当前所在目录为PaddleRec/models/match/kim
85+
2. 进入paddlerec/datasets/kim目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的kim全量数据集,并解压到指定文件夹。
86+
``` bash
87+
cd ../../../datasets/kim
88+
bash run.sh
89+
```
90+
3. 切回模型目录f
91+
```bash
92+
python -u trainer.py -m config_bigdata.yml -o mode=train
93+
python -u infer.py -m config_bigdata.yml -o mode=test
94+
```
95+
96+
## 进阶使用
97+
98+
## FAQ

models/match/kim/config.yaml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# global settings
16+
17+
runner:
18+
train_data_dir: "./data/sample_data"
19+
train_reader_path: "mind_reader" # importlib format
20+
use_gpu: False
21+
use_auc: False
22+
train_batch_size: 1
23+
epochs: 1
24+
print_interval: 1
25+
#model_init_path: "output_model/0" # init model
26+
model_save_path: "output_model_kim"
27+
test_data_dir: "./data/sample_data"
28+
infer_reader_path: "mind_reader" # importlib format
29+
infer_batch_size: 1
30+
infer_load_path: "output_model_kim"
31+
infer_start_epoch: 0
32+
infer_end_epoch: 1
33+
random_emb: true
34+
35+
# hyper parameters of user-defined network
36+
hyper_parameters:
37+
# optimizer config
38+
optimizer:
39+
class: Adam
40+
learning_rate: 0.00005
41+
# user-defined <key, value> pairs
42+
max_sentence: 30
43+
max_sents: 50
44+
max_entity_num: 10
45+
npratio: 4
46+
hidden_size: 400
47+
embedding_size: 300
48+
vocab_size: 1891
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# global settings
16+
17+
runner:
18+
train_data_dir: "../../../datasets/kim/data/whole_data"
19+
train_reader_path: "mind_reader" # importlib format
20+
use_gpu: True
21+
use_auc: False
22+
train_batch_size: 16
23+
epochs: 8
24+
print_interval: 50
25+
#model_init_path: "output_model/0" # init model
26+
model_save_path: "output_model_kim_all"
27+
test_data_dir: "../../../datasets/kim/data/whole_data"
28+
infer_reader_path: "mind_reader" # importlib format
29+
infer_batch_size: 64
30+
infer_load_path: "output_model_kim_all"
31+
infer_start_epoch: 6
32+
infer_end_epoch: 8
33+
random_emb: false
34+
35+
# hyper parameters of user-defined network
36+
hyper_parameters:
37+
# optimizer config
38+
optimizer:
39+
class: Adam
40+
learning_rate: 0.00005
41+
# user-defined <key, value> pairs
42+
max_sentence: 30
43+
max_sents: 50
44+
max_entity_num: 10
45+
npratio: 4
46+
hidden_size: 400
47+
embedding_size: 300
48+
vocab_size: 42055

models/match/kim/data/sample_data/KGGraph.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)