Skip to content

Commit 47a2ea5

Browse files
authored
Integrate Neural Search models into Pipelines (#3172)
* Integrate Neural Search models into Pipelines * Adjust the format * Update Neural Search Recall and Upgrade docx for Pipelines
1 parent 920e1e8 commit 47a2ea5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+816
-245
lines changed

applications/neural_search/recall/in_batch_negative/README.md

Lines changed: 43 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ In-batch Negatives 策略的训练数据为语义相似的 Pair 对,策略核
4242

4343
### 技术方案
4444

45-
双塔模型,采用ERNIE1.0热启,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。
45+
双塔模型,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。
4646

4747

4848
### 评估指标
@@ -53,10 +53,10 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
5353

5454
**效果评估**
5555

56-
| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明|
56+
| 策略 | 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |
5757
| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
58-
| In-batch Negatives | 51.301 | 65.309| 69.878| 73.996|78.881| Inbatch-negative有监督训练|
59-
58+
| In-batch Negatives | ernie 1.0 | 51.301 | 65.309| 69.878| 73.996|78.881|
59+
| In-batch Negatives | rocketqa-zh-base-query-encoder | **59.622** | **75.089**| **79.668**| **83.404**|**87.773**|
6060

6161

6262
<a name="环境依赖"></a>
@@ -166,10 +166,10 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
166166

167167
|Model|训练参数配置|硬件|MD5|
168168
| ------------ | ------------ | ------------ |-----------|
169-
|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
169+
|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">ernie 1.0 margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
170170

171-
### 训练环境说明
172171

172+
### 训练环境说明
173173

174174
- NVIDIA Driver Version: 440.64.00
175175
- Ubuntu 16.04.6 LTS (Docker)
@@ -185,7 +185,7 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
185185
然后运行下面的命令使用GPU训练,得到语义索引模型:
186186

187187
```
188-
root_path=recall
188+
root_path=inbatch
189189
python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
190190
train_batch_neg.py \
191191
--device gpu \
@@ -194,11 +194,11 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
194194
--learning_rate 5E-5 \
195195
--epochs 3 \
196196
--output_emb_size 256 \
197+
--model_name_or_path rocketqa-zh-base-query-encoder \
197198
--save_steps 10 \
198199
--max_seq_length 64 \
199200
--margin 0.2 \
200201
--train_set_file recall/train.csv \
201-
--evaluate \
202202
--recall_result_dir "recall_result_dir" \
203203
--recall_result_file "recall_result.txt" \
204204
--hnsw_m 100 \
@@ -217,6 +217,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
217217
* `learning_rate`: 训练的学习率的大小
218218
* `epochs`: 训练的epoch数
219219
* `output_emb_size`: Transformer 顶层输出的文本向量维度
220+
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
220221
* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数
221222
* `max_seq_length`: 输入序列的最大长度
222223
* `margin`: 正样本相似度与负样本之间的目标 Gap
@@ -234,7 +235,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
234235
也可以使用bash脚本:
235236

236237
```
237-
sh scripts/train_batch_neg.sh
238+
sh scripts/train.sh
238239
```
239240

240241

@@ -270,6 +271,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
270271
--recall_result_dir "recall_result_dir" \
271272
--recall_result_file "recall_result.txt" \
272273
--params_path "${root_dir}/model_40/model_state.pdparams" \
274+
--model_name_or_path rocketqa-zh-base-query-encoder \
273275
--hnsw_m 100 \
274276
--hnsw_ef 100 \
275277
--batch_size 64 \
@@ -280,16 +282,17 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
280282
--corpus_file "recall/corpus.csv"
281283
```
282284
参数含义说明
283-
* `device`: 使用 cpu/gpu 进行训练
284-
* `recall_result_dir`: 召回结果存储目录
285-
* `recall_result_file`: 召回结果的文件名
285+
* `device` 使用 cpu/gpu 进行训练
286+
* `recall_result_dir` 召回结果存储目录
287+
* `recall_result_file` 召回结果的文件名
286288
* `params_path`: 待评估模型的参数文件名
287-
* `hnsw_m`: hnsw 算法相关参数,保持默认即可
288-
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
289-
* `output_emb_size`: Transformer 顶层输出的文本向量维度
290-
* `recall_num`: 对 1 个文本召回的相似文本数量
291-
* `similar_text_pair`: 由相似文本对构成的评估集
292-
* `corpus_file`: 召回库数据 corpus_file
289+
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
290+
* `hnsw_m`: hnsw 算法相关参数,保持默认即可
291+
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
292+
* `output_emb_size`: Transformer 顶层输出的文本向量维度
293+
* `recall_num`: 对 1 个文本召回的相似文本数量
294+
* `similar_text_pair`: 由相似文本对构成的评估集
295+
* `corpus_file`: 召回库数据 corpus_file
293296

294297
也可以使用下面的bash脚本:
295298

@@ -383,10 +386,11 @@ python inference.py
383386
```
384387
root_dir="checkpoints/inbatch"
385388
386-
python -u -m paddle.distributed.launch --gpus "3" \
389+
python -u -m paddle.distributed.launch --gpus "0" \
387390
predict.py \
388391
--device gpu \
389392
--params_path "${root_dir}/model_40/model_state.pdparams" \
393+
--model_name_or_path rocketqa-zh-base-query-encoder \
390394
--output_emb_size 256 \
391395
--batch_size 128 \
392396
--max_seq_length 64 \
@@ -396,6 +400,7 @@ python -u -m paddle.distributed.launch --gpus "3" \
396400
参数含义说明
397401
* `device`: 使用 cpu/gpu 进行训练
398402
* `params_path`: 预训练模型的参数文件名
403+
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
399404
* `output_emb_size`: Transformer 顶层输出的文本向量维度
400405
* `text_pair_file`: 由文本 Pair 构成的待预测数据集
401406

@@ -423,7 +428,9 @@ predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本
423428
首先把动态图模型转换为静态图:
424429

425430
```
426-
python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams --output_path=./output
431+
python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \
432+
--model_name_or_path rocketqa-zh-base-query-encoder \
433+
--output_path=./output
427434
```
428435
也可以运行下面的bash脚本:
429436

@@ -449,7 +456,9 @@ corpus_list=[['中西方语言与文化的差异','中西方文化差异以及
449456
然后使用PaddleInference
450457

451458
```
452-
python deploy/python/predict.py --model_dir=./output
459+
python deploy/python/predict.py \
460+
--model_dir=./output \
461+
--model_name_or_path rocketqa-zh-base-query-encoder
453462
```
454463
也可以运行下面的bash脚本:
455464

@@ -501,9 +510,16 @@ Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,
501510

502511
#### Pipeline方式
503512

504-
启动 Pipeline Server:
513+
修改模型需要用到的`Tokenizer`
514+
515+
```
516+
self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
517+
```
518+
519+
然后启动 Pipeline Server:
505520

506521
```
522+
cd deploy/python
507523
python web_service.py
508524
```
509525

@@ -520,7 +536,7 @@ list_data = [
520536
然后运行:
521537

522538
```
523-
python rpc_client.py
539+
python deploy/python/rpc_client.py
524540
```
525541
模型的输出为:
526542

@@ -547,12 +563,12 @@ python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_i
547563
也可以使用脚本:
548564

549565
```
550-
sh deploy/C++/start_server.sh
566+
sh deploy/cpp/start_server.sh
551567
```
552568
Client 可以使用 http 或者 rpc 两种方式,rpc 的方式为:
553569

554570
```
555-
python deploy/C++/rpc_client.py
571+
python deploy/cpp/rpc_client.py
556572
```
557573
运行的输出为:
558574
```
@@ -571,7 +587,7 @@ time to cost :0.3960278034210205 seconds
571587
或者使用 http 的客户端访问模式:
572588

573589
```
574-
python deploy/C++/http_client.py
590+
python deploy/cpp/http_client.py
575591
```
576592
运行的输出为:
577593

@@ -599,6 +615,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
599615
train_batch_neg.py \
600616
--device gpu \
601617
--save_dir ./checkpoints/simcse_inbatch_negative \
618+
--model_name_or_path rocketqa-zh-base-query-encoder \
602619
--batch_size 64 \
603620
--learning_rate 5E-5 \
604621
--epochs 3 \

applications/neural_search/recall/in_batch_negative/deploy/C++/http_client.py renamed to applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ def convert_example(example,
5454
print(fetch_names)
5555

5656
# 创建tokenizer
57-
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
57+
tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
5858
max_seq_len = 64
5959

6060
# 数据预处理

applications/neural_search/recall/in_batch_negative/deploy/C++/rpc_client.py renamed to applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def convert_example(example,
5050
print(fetch_names)
5151

5252
# 创建tokenizer
53-
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
53+
tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
5454
max_seq_len = 64
5555

5656
# 数据预处理
File renamed without changes.

applications/neural_search/recall/in_batch_negative/deploy/python/predict.py

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
help="Batch size per GPU/CPU for training.")
4141
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu",
4242
help="Select which device to train model, defaults to gpu.")
43-
43+
parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.")
4444
parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False],
4545
help='Enable to use tensorrt to speed up.')
4646
parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
@@ -156,22 +156,21 @@ def __init__(self,
156156
if args.benchmark:
157157
import auto_log
158158
pid = os.getpid()
159-
self.autolog = auto_log.AutoLogger(model_name="ernie-3.0-medium-zh",
160-
model_precision=precision,
161-
batch_size=self.batch_size,
162-
data_shape="dynamic",
163-
save_path=args.save_log_path,
164-
inference_config=config,
165-
pids=pid,
166-
process_name=None,
167-
gpu_ids=0,
168-
time_keys=[
169-
'preprocess_time',
170-
'inference_time',
171-
'postprocess_time'
172-
],
173-
warmup=0,
174-
logger=logger)
159+
self.autolog = auto_log.AutoLogger(
160+
model_name=args.model_name_or_path,
161+
model_precision=precision,
162+
batch_size=self.batch_size,
163+
data_shape="dynamic",
164+
save_path=args.save_log_path,
165+
inference_config=config,
166+
pids=pid,
167+
process_name=None,
168+
gpu_ids=0,
169+
time_keys=[
170+
'preprocess_time', 'inference_time', 'postprocess_time'
171+
],
172+
warmup=0,
173+
logger=logger)
175174

176175
def extract_embedding(self, data, tokenizer):
177176
"""
@@ -279,7 +278,7 @@ def predict(self, data, tokenizer):
279278

280279
# ErnieTinyTokenizer is special for ernie-tiny pretained model.
281280
output_emb_size = 256
282-
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
281+
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
283282
id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
284283
corpus_list = [{idx: text} for idx, text in id2corpus.items()]
285284
res = predictor.extract_embedding(corpus_list, tokenizer)

applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ class ErnieOp(Op):
4040

4141
def init_op(self):
4242
from paddlenlp.transformers import AutoTokenizer
43-
self.tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
43+
self.tokenizer = AutoTokenizer.from_pretrained(
44+
"rocketqa-zh-base-query-encoder")
4445

4546
def preprocess(self, input_dicts, data_id, log_id):
4647
from paddlenlp.data import Stack, Tuple, Pad
@@ -56,7 +57,7 @@ def preprocess(self, input_dicts, data_id, log_id):
5657
batchify_fn = lambda samples, fn=Tuple(
5758
Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
5859
), # input
59-
Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
60+
Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"
6061
), # segment
6162
): fn(samples)
6263
input_ids, segment_ids = batchify_fn(examples)

applications/neural_search/recall/in_batch_negative/evaluate.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,6 @@ def recall(rs, N=10):
7676
relevance_labels.append(1)
7777
else:
7878
relevance_labels.append(0)
79-
# print(len(rs))
80-
# print(rs[:50])
8179

8280
recall_N = []
8381
recall_num = [1, 5, 10, 20, 50]
@@ -92,4 +90,3 @@ def recall(rs, N=10):
9290
print('recall@{}={}'.format(key, val))
9391
res.append(str(val))
9492
result.write('\t'.join(res) + '\n')
95-
# print("\t".join(recall_N))

applications/neural_search/recall/in_batch_negative/export_model.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,16 @@
2828
parser = argparse.ArgumentParser()
2929
parser.add_argument("--params_path", type=str, required=True,
3030
default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
31+
parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select model to train, defaults to rocketqa-zh-base-query-encoder.")
3132
parser.add_argument("--output_path", type=str, default='./output',
3233
help="The path of model parameter in static graph to be saved.")
3334
args = parser.parse_args()
3435
# yapf: enable
3536

3637
if __name__ == "__main__":
3738
output_emb_size = 256
38-
pretrained_model = AutoModel.from_pretrained("ernie-1.0")
39-
tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
39+
pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
40+
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
4041
model = SemanticIndexBaseStatic(pretrained_model,
4142
output_emb_size=output_emb_size)
4243
if args.params_path and os.path.isfile(args.params_path):

applications/neural_search/recall/in_batch_negative/inference.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,10 @@
2626
batch_size = 1
2727
params_path = 'checkpoints/inbatch/model_40/model_state.pdparams'
2828
id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
29+
model_name_or_path = "rocketqa-zh-base-query-encoder"
2930
paddle.set_device(device)
3031

31-
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
32+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
3233
trans_func = partial(convert_example,
3334
tokenizer=tokenizer,
3435
max_seq_length=max_seq_length)
@@ -38,7 +39,7 @@
3839
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment
3940
): [data for data in fn(samples)]
4041

41-
pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
42+
pretrained_model = AutoModel.from_pretrained(model_name_or_path)
4243

4344
model = SemanticIndexBaseStatic(pretrained_model,
4445
output_emb_size=output_emb_size)

0 commit comments

Comments
 (0)