Skip to content

Commit 838928d

Browse files
committed
change pyramid
1 parent 59acb6b commit 838928d

File tree

7 files changed

+20
-14
lines changed

7 files changed

+20
-14
lines changed

models/match/dssm/data/preprocess.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363
#划分训练集和测试集
6464
query_list = list(pos_dict.keys())
6565
#print(len(query_list))
66-
random.shuffle(query_list)
66+
#random.shuffle(query_list)
6767
train_query = query_list[:11600]
6868
test_query = query_list[11600:]
6969

models/match/dssm/readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ label.txt中对应的测试集中的标签
157157
将hyper_parameters中的slice_end从8改为128.当您需要改变batchsize的时候,这个参数也需要随之变化
158158
将dataset_train中的data_path改为{workspace}/data/big_train
159159
将dataset_infer中的data_path改为{workspace}/data/big_test
160-
将hyper_parameters中的trigram_d改为6327
160+
将hyper_parameters中的trigram_d改为5913
161161

162162
5. 执行脚本,开始训练.脚本会运行python -m paddlerec.run -m ./config.yaml启动训练,并将结果输出到result文件中。然后启动transform.py整合数据,最后计算出正逆序指标:
163163
```

models/match/match-pyramid/data/process.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ def make_train():
106106
pair_list.append((d1, high_d2, low_d2))
107107
print('Pair Instance Count:', len(pair_list))
108108

109-
f = open("./data/train/train.txt", "w")
109+
f = open("./data/big_train/train.txt", "w")
110110
for batch in range(800):
111111
X1 = np.zeros((batch_size * 2, data1_maxlen), dtype=np.int32)
112112
X2 = np.zeros((batch_size * 2, data2_maxlen), dtype=np.int32)
@@ -131,7 +131,7 @@ def make_train():
131131
def make_test():
132132
rel = read_relation(filename=os.path.join(Letor07Path,
133133
'relation.test.fold1.txt'))
134-
f = open("./data/test/test.txt", "w")
134+
f = open("./data/big_test/test.txt", "w")
135135
for label, d1, d2 in rel:
136136
X1 = np.zeros(data1_maxlen, dtype=np.int32)
137137
X2 = np.zeros(data2_maxlen, dtype=np.int32)

models/match/match-pyramid/data_process.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,9 @@
33
echo "...........load data................."
44
wget --no-check-certificate 'https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz'
55
mv ./match_pyramid_data.tar.gz ./data
6-
rm -rf ./data/relation.test.fold1.txt ./data/realtion.train.fold1.txt
6+
rm -rf ./data/relation.test.fold1.txt
77
tar -xvf ./data/match_pyramid_data.tar.gz
8+
mkdir ./data/big_train
9+
mkdir ./data/big_test
810
echo "...........data process..............."
911
python ./data/process.py

models/match/match-pyramid/eval.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@ def eval_MAP(pred, gt):
4949
pred = []
5050
for line in open(filename):
5151
line = line.strip().split(",")
52-
line[1] = line[1].split(":")
53-
line = line[1][1].strip(" ")
52+
line[3] = line[3].split(":")
53+
line = line[3][1].strip(" ")
5454
line = line.strip("[")
5555
line = line.strip("]")
5656
pred.append(float(line))

models/match/match-pyramid/readme.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,10 @@
5656
4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
5757

5858
## 运行环境
59-
PaddlePaddle>=1.7.2
60-
python 2.7/3.5/3.6/3.7
61-
PaddleRec >=0.1
62-
os : windows/linux/macos
59+
PaddlePaddle>=1.7.2
60+
python 2.7/3.5/3.6/3.7
61+
PaddleRec >=0.1
62+
os : windows/linux/macos
6363

6464
## 快速开始
6565

@@ -72,7 +72,7 @@ python -m paddlerec.run -m models/match/match-pyramid/config.yaml
7272
## 论文复现
7373
1. 确认您当前所在目录为PaddleRec/models/match/match-pyramid
7474
2. 本文提供了原数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行:bash data_process.sh
75-
执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`,全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件
75+
执行该脚本,会从国内源的服务器上下载Letor07数据集,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/big_train`,全量测试数据放置于`./data/big_test`。并生成用于初始化embedding层的embedding.npy文件
7676
执行该脚本的理想输出为:
7777
```
7878
bash data_process.sh
@@ -123,6 +123,8 @@ data/embed_wiki-pdc_d50_norm
123123
3. 打开文件config.yaml,更改其中的参数
124124

125125
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
126+
将dataset_train下的data_path参数改为{workspace}/data/big_train
127+
将dataset_infer下的data_path参数改为{workspace}/data/big_test
126128

127129
4. 随后,您直接一键运行:bash run.sh 即可得到复现的论文效果
128130
执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标
@@ -131,7 +133,7 @@ data/embed_wiki-pdc_d50_norm
131133
..............test.................
132134
13651
133135
336
134-
('map=', 0.420878322843591)
136+
('map=', 0.3993127885738651)
135137
```
136138
## 进阶使用
137139

models/match/match-pyramid/run.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
#!/bin/bash
22
echo "................run................."
33
python -m paddlerec.run -m ./config.yaml &>result1.txt
4-
grep -i "prediction" ./result1.txt >./result.txt
4+
grep -i "prediction" ./result1.txt >./result2.txt
5+
sed '$d' result2.txt >result.txt
6+
rm -f result2.txt
57
rm -f result1.txt
68
python eval.py

0 commit comments

Comments
 (0)