File tree Expand file tree Collapse file tree 7 files changed +20
-14
lines changed Expand file tree Collapse file tree 7 files changed +20
-14
lines changed Original file line number Diff line number Diff line change 6363#划分训练集和测试集
6464query_list = list (pos_dict .keys ())
6565#print(len(query_list))
66- random .shuffle (query_list )
66+ # random.shuffle(query_list)
6767train_query = query_list [:11600 ]
6868test_query = query_list [11600 :]
6969
Original file line number Diff line number Diff line change @@ -157,7 +157,7 @@ label.txt中对应的测试集中的标签
157157将hyper_parameters中的slice_end从8改为128.当您需要改变batchsize的时候,这个参数也需要随之变化
158158将dataset_train中的data_path改为{workspace}/data/big_train
159159将dataset_infer中的data_path改为{workspace}/data/big_test
160- 将hyper_parameters中的trigram_d改为6327
160+ 将hyper_parameters中的trigram_d改为5913
161161
1621625 . 执行脚本,开始训练.脚本会运行python -m paddlerec.run -m ./config.yaml启动训练,并将结果输出到result文件中。然后启动transform.py整合数据,最后计算出正逆序指标:
163163```
Original file line number Diff line number Diff line change @@ -106,7 +106,7 @@ def make_train():
106106 pair_list .append ((d1 , high_d2 , low_d2 ))
107107 print ('Pair Instance Count:' , len (pair_list ))
108108
109- f = open ("./data/train /train.txt" , "w" )
109+ f = open ("./data/big_train /train.txt" , "w" )
110110 for batch in range (800 ):
111111 X1 = np .zeros ((batch_size * 2 , data1_maxlen ), dtype = np .int32 )
112112 X2 = np .zeros ((batch_size * 2 , data2_maxlen ), dtype = np .int32 )
@@ -131,7 +131,7 @@ def make_train():
131131def make_test ():
132132 rel = read_relation (filename = os .path .join (Letor07Path ,
133133 'relation.test.fold1.txt' ))
134- f = open ("./data/test /test.txt" , "w" )
134+ f = open ("./data/big_test /test.txt" , "w" )
135135 for label , d1 , d2 in rel :
136136 X1 = np .zeros (data1_maxlen , dtype = np .int32 )
137137 X2 = np .zeros (data2_maxlen , dtype = np .int32 )
Original file line number Diff line number Diff line change 33echo " ...........load data................."
44wget --no-check-certificate ' https://paddlerec.bj.bcebos.com/match_pyramid/match_pyramid_data.tar.gz'
55mv ./match_pyramid_data.tar.gz ./data
6- rm -rf ./data/relation.test.fold1.txt ./data/realtion.train.fold1.txt
6+ rm -rf ./data/relation.test.fold1.txt
77tar -xvf ./data/match_pyramid_data.tar.gz
8+ mkdir ./data/big_train
9+ mkdir ./data/big_test
810echo " ...........data process..............."
911python ./data/process.py
Original file line number Diff line number Diff line change @@ -49,8 +49,8 @@ def eval_MAP(pred, gt):
4949pred = []
5050for line in open (filename ):
5151 line = line .strip ().split ("," )
52- line [1 ] = line [1 ].split (":" )
53- line = line [1 ][1 ].strip (" " )
52+ line [3 ] = line [3 ].split (":" )
53+ line = line [3 ][1 ].strip (" " )
5454 line = line .strip ("[" )
5555 line = line .strip ("]" )
5656 pred .append (float (line ))
Original file line number Diff line number Diff line change 56564.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
5757
5858## 运行环境
59- PaddlePaddle>=1.7.2
60- python 2.7/3.5/3.6/3.7
61- PaddleRec >=0.1
62- os : windows/linux/macos
59+ PaddlePaddle>=1.7.2
60+ python 2.7/3.5/3.6/3.7
61+ PaddleRec >=0.1
62+ os : windows/linux/macos
6363
6464## 快速开始
6565
@@ -72,7 +72,7 @@ python -m paddlerec.run -m models/match/match-pyramid/config.yaml
7272## 论文复现
73731 . 确认您当前所在目录为PaddleRec/models/match/match-pyramid
74742 . 本文提供了原数据集的下载以及一键生成训练和测试数据的预处理脚本,您可以直接一键运行: bash data_process.sh
75- 执行该脚本,会从国内源的服务器上下载Letor07数据集,删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt, 并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于` ./data/train ` ,全量测试数据放置于` ./data/test ` 。并生成用于初始化embedding层的embedding.npy文件
75+ 执行该脚本,会从国内源的服务器上下载Letor07数据集,并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于` ./data/big_train ` ,全量测试数据放置于` ./data/big_test ` 。并生成用于初始化embedding层的embedding.npy文件
7676执行该脚本的理想输出为:
7777```
7878bash data_process.sh
@@ -123,6 +123,8 @@ data/embed_wiki-pdc_d50_norm
1231233 . 打开文件config.yaml,更改其中的参数
124124
125125将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
126+ 将dataset_train下的data_path参数改为{workspace}/data/big_train
127+ 将dataset_infer下的data_path参数改为{workspace}/data/big_test
126128
1271294 . 随后,您直接一键运行:bash run.sh 即可得到复现的论文效果
128130执行该脚本后,会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型,将测试的结果保存到result.txt文件,最后通过执行eval.py进行评估得到数据的map指标
@@ -131,7 +133,7 @@ data/embed_wiki-pdc_d50_norm
131133..............test.................
13213413651
133135336
134- ('map=', 0.420878322843591 )
136+ ('map=', 0.3993127885738651 )
135137```
136138## 进阶使用
137139
Original file line number Diff line number Diff line change 11#! /bin/bash
22echo " ................run................."
33python -m paddlerec.run -m ./config.yaml & > result1.txt
4- grep -i " prediction" ./result1.txt > ./result.txt
4+ grep -i " prediction" ./result1.txt > ./result2.txt
5+ sed ' $d' result2.txt > result.txt
6+ rm -f result2.txt
57rm -f result1.txt
68python eval.py
You can’t perform that action at this time.
0 commit comments