Skip to content

Commit 6213573

Browse files
w5688414ZeyuChentianxin
authored
Update neural search readme and Add Paddle Serving Support (#1558)
* add recall inference similarity * update examples * updatea readme * update dir name * update neural search readme * update milvus readme * update domain adaptive pretraining readme * fix the mistakes * update readme * add recall Paddle Serving Support * update readme * update readme and format the code * reformat the files * move the files * reformat the code * remove redundant code Co-authored-by: Zeyu Chen <[email protected]> Co-authored-by: tianxin <[email protected]>
1 parent 2a620ba commit 6213573

File tree

100 files changed

+1240
-866
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+1240
-866
lines changed

applications/neural_search/recall/in_batch_negative/README.md

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -91,15 +91,20 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
9191
|—— export_model.py # 动态图转换成静态图
9292
|—— scripts
9393
|—— export_model.sh # 动态图转换成静态图脚本
94-
|—— predict.sh # 预测bash版本
95-
|—— evaluate.sh # 评估bash版本
96-
|—— run_build_index.sh # 构建索引bash版本
97-
|—— train_batch_neg.sh # 训练bash版本
94+
|—— predict.sh # 预测 bash 版本
95+
|—— evaluate.sh # 评估 bash 版本
96+
|—— run_build_index.sh # 构建索引 bash 版本
97+
|—— train_batch_neg.sh # 训练 bash 版本
98+
|—— export_to_serving.sh # Paddle Inference 转 Serving 的 bash 脚本
9899
|—— deploy
99100
|—— python
100101
|—— predict.py # PaddleInference
101-
|—— deploy.sh # Paddle Inference部署脚本
102+
|—— deploy.sh # Paddle Inference 部署脚本
103+
|—— rpc_client.py # Paddle Serving 的 Client 端
104+
|—— web_service.py # Paddle Serving 的 Serving 端
105+
|—— config_nlp.yml # Paddle Serving 的配置文件
102106
|—— inference.py # 动态图抽取向量
107+
|—— export_to_serving.py # 静态图转 Serving
103108
104109
```
105110

@@ -237,7 +242,7 @@ c. 获取Query的Embedding并查询相似结果
237242

238243
d. 评估
239244

240-
基于评估集 `same_semantic.tsv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50。
245+
基于评估集 `dev.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50。
241246

242247
运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result`
243248

@@ -267,7 +272,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
267272
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
268273
* `output_emb_size`: Transformer 顶层输出的文本向量维度
269274
* `recall_num`: 对 1 个文本召回的相似文本数量
270-
* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
275+
* `similar_text_pair`: 由相似文本对构成的评估集
271276
* `corpus_file`: 召回库数据 corpus_file
272277

273278
也可以使用下面的bash脚本:
@@ -447,6 +452,71 @@ sh deploy.sh
447452
[0.959269642829895, 0.04725276678800583]
448453
```
449454

455+
### Paddle Serving部署
456+
457+
Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式:
458+
459+
```
460+
python export_to_serving.py \
461+
--dirname "output" \
462+
--model_filename "inference.get_pooled_embedding.pdmodel" \
463+
--params_filename "inference.get_pooled_embedding.pdiparams" \
464+
--server_path "./serving_server" \
465+
--client_path "./serving_client" \
466+
--fetch_alias_names "output_embedding"
467+
468+
```
469+
470+
参数含义说明
471+
* `dirname`: 需要转换的模型文件存储路径,Program 结构文件和参数文件均保存在此目录。
472+
* `model_filename`: 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ,则使用 `__model__` 作为默认的文件名
473+
* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为 None
474+
* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
475+
* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
476+
* `fetch_alias_names`: 模型输出的别名设置,比如输入的 input_ids 等,都可以指定成其他名字,默认不指定
477+
* `feed_alias_names`: 模型输入的别名设置,比如输出 pooled_out 等,都可以重新指定成其他模型,默认不指定
478+
479+
也可以运行下面的 bash 脚本:
480+
```
481+
sh scripts/export_to_serving.sh
482+
```
483+
484+
然后启动server:
485+
486+
```
487+
python web_service.py
488+
```
489+
490+
启动客户端调用 Server。
491+
492+
首先修改rpc_client.py中需要预测的样本:
493+
494+
```
495+
list_data = [
496+
"国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据",
497+
"试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"
498+
]
499+
```
500+
然后运行:
501+
502+
```
503+
python rpc_client.py
504+
```
505+
模型的输出为:
506+
507+
```
508+
{'0': '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据', '1': '试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比'}
509+
PipelineClient::predict pack_data time:1641450851.3752182
510+
PipelineClient::predict before time:1641450851.375738
511+
['output_embedding']
512+
(2, 256)
513+
[[ 0.07830612 -0.14036864 0.03433796 -0.14967982 -0.03386067 0.06630666
514+
0.01357943 0.03531194 0.02411093 0.02000859 0.05724002 -0.08119463
515+
......
516+
```
517+
518+
可以看到客户端发送了2条文本,返回了2个 embedding 向量
519+
450520
## Reference
451521

452522
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.

applications/neural_search/recall/in_batch_negative/base_model.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
2828
self.ptm = pretrained_model
2929
self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
3030

31-
# if output_emb_size is not None, then add Linear layer to reduce embedding_size,
32-
# we recommend set output_emb_size = 256 considering the trade-off beteween
31+
# if output_emb_size is not None, then add Linear layer to reduce embedding_size,
32+
# we recommend set output_emb_size = 256 considering the trade-off beteween
3333
# recall performance and efficiency
3434

3535
self.output_emb_size = output_emb_size
@@ -105,8 +105,8 @@ def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
105105
self.ptm = pretrained_model
106106
self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
107107

108-
# if output_emb_size is not None, then add Linear layer to reduce embedding_size,
109-
# we recommend set output_emb_size = 256 considering the trade-off beteween
108+
# if output_emb_size is not None, then add Linear layer to reduce embedding_size,
109+
# we recommend set output_emb_size = 256 considering the trade-off beteween
110110
# recall performance and efficiency
111111

112112
self.output_emb_size = output_emb_size

applications/neural_search/recall/in_batch_negative/data.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,7 @@
1313
# limitations under the License.
1414

1515
import os
16-
1716
import paddle
18-
1917
from paddlenlp.utils.log import logger
2018

2119

@@ -47,7 +45,7 @@ def convert_example(example,
4745
pad_to_max_seq_len=False):
4846
"""
4947
Builds model inputs from a sequence.
50-
48+
5149
A BERT sequence has the following format:
5250
5351
- single sequence: ``[CLS] X [SEP]``
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
2+
# 当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
3+
worker_num: 20
4+
# build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
5+
build_dag_each_worker: false
6+
7+
dag:
8+
# op资源类型, True, 为线程模型;False,为进程模型
9+
is_thread_op: False
10+
# 使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
11+
tracer:
12+
interval_s: 10
13+
# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
14+
http_port: 18082
15+
# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
16+
rpc_port: 8080
17+
op:
18+
ernie:
19+
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
20+
concurrency: 1
21+
# 当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
22+
local_service_conf:
23+
# client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
24+
client_type: local_predictor
25+
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
26+
device_type: 1
27+
# 计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
28+
devices: '2'
29+
# Fetch结果列表,以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
30+
fetch_list: ['output_embedding']
31+
# 模型路径
32+
model_config: ../../serving_server/

applications/neural_search/recall/in_batch_negative/deploy/python/predict.py

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -230,11 +230,13 @@ def predict(self, data, tokenizer):
230230
self.autolog.times.start()
231231

232232
examples = []
233-
for idx,text in enumerate(data):
234-
input_ids, segment_ids = convert_example(
235-
{idx:text[0]}, tokenizer)
236-
title_ids,title_segment_ids=convert_example({idx:text[1]},tokenizer)
237-
examples.append((input_ids, segment_ids,title_ids,title_segment_ids))
233+
for idx, text in enumerate(data):
234+
input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer)
235+
title_ids, title_segment_ids = convert_example({
236+
idx: text[1]
237+
}, tokenizer)
238+
examples.append(
239+
(input_ids, segment_ids, title_ids, title_segment_ids))
238240

239241
batchify_fn = lambda samples, fn=Tuple(
240242
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
@@ -246,7 +248,8 @@ def predict(self, data, tokenizer):
246248
if args.benchmark:
247249
self.autolog.times.stamp()
248250

249-
query_ids, query_segment_ids,title_ids, title_segment_ids = batchify_fn(examples)
251+
query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(
252+
examples)
250253
self.input_handles[0].copy_from_cpu(query_ids)
251254
self.input_handles[1].copy_from_cpu(query_segment_ids)
252255
self.predictor.run()
@@ -259,10 +262,13 @@ def predict(self, data, tokenizer):
259262

260263
if args.benchmark:
261264
self.autolog.times.stamp()
262-
265+
263266
if args.benchmark:
264267
self.autolog.times.end(stamp=True)
265-
result=[float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
268+
result = [
269+
float(1 - spatial.distance.cosine(arr1, arr2))
270+
for arr1, arr2 in zip(query_logits, title_logits)
271+
]
266272
return result
267273

268274

@@ -277,10 +283,10 @@ def predict(self, data, tokenizer):
277283
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')
278284
id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
279285
corpus_list = [{idx: text} for idx, text in id2corpus.items()]
280-
res=predictor.extract_embedding(corpus_list, tokenizer)
286+
res = predictor.extract_embedding(corpus_list, tokenizer)
281287
print(res.shape)
282288
print(res)
283-
corpus_list=[['中西方语言与文化的差异','中西方文化差异以及语言体现中西方文化,差异,语言体现'],
284-
['中西方语言与文化的差异','飞桨致力于让深度学习技术的创新与应用更简单']]
285-
res=predictor.predict(corpus_list,tokenizer)
289+
corpus_list = [['中西方语言与文化的差异', '中西方文化差异以及语言体现中西方文化,差异,语言体现'],
290+
['中西方语言与文化的差异', '飞桨致力于让深度学习技术的创新与应用更简单']]
291+
res = predictor.predict(corpus_list, tokenizer)
286292
print(res)
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from paddle_serving_server.pipeline import PipelineClient
16+
import numpy as np
17+
18+
client = PipelineClient()
19+
client.connect(['127.0.0.1:8080'])
20+
21+
list_data = [
22+
"国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据",
23+
"试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"
24+
]
25+
feed = {}
26+
for i, item in enumerate(list_data):
27+
feed[str(i)] = item
28+
29+
print(feed)
30+
ret = client.predict(feed_dict=feed)
31+
# print(ret)
32+
result = np.array(eval(ret.value[0]))
33+
print(ret.key)
34+
print(result.shape)
35+
print(result)
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import logging
16+
import numpy as np
17+
import sys
18+
19+
from paddle_serving_server.web_service import WebService, Op
20+
21+
_LOGGER = logging.getLogger()
22+
23+
24+
def convert_example(example,
25+
tokenizer,
26+
max_seq_length=512,
27+
pad_to_max_seq_len=False):
28+
result = []
29+
for text in example:
30+
encoded_inputs = tokenizer(
31+
text=text,
32+
max_seq_len=max_seq_length,
33+
pad_to_max_seq_len=pad_to_max_seq_len)
34+
input_ids = encoded_inputs["input_ids"]
35+
token_type_ids = encoded_inputs["token_type_ids"]
36+
result += [input_ids, token_type_ids]
37+
return result
38+
39+
40+
class ErnieOp(Op):
41+
def init_op(self):
42+
import paddlenlp as ppnlp
43+
self.tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(
44+
'ernie-1.0')
45+
46+
def preprocess(self, input_dicts, data_id, log_id):
47+
from paddlenlp.data import Stack, Tuple, Pad
48+
49+
(_, input_dict), = input_dicts.items()
50+
print("input dict", input_dict)
51+
batch_size = len(input_dict.keys())
52+
examples = []
53+
for i in range(batch_size):
54+
input_ids, segment_ids = convert_example([input_dict[str(i)]],
55+
self.tokenizer)
56+
examples.append((input_ids, segment_ids))
57+
batchify_fn = lambda samples, fn=Tuple(
58+
Pad(axis=0, pad_val=self.tokenizer.pad_token_id), # input
59+
Pad(axis=0, pad_val=self.tokenizer.pad_token_id), # segment
60+
): fn(samples)
61+
input_ids, segment_ids = batchify_fn(examples)
62+
feed_dict = {}
63+
feed_dict['input_ids'] = input_ids
64+
feed_dict['token_type_ids'] = segment_ids
65+
return feed_dict, False, None, ""
66+
67+
def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
68+
new_dict = {}
69+
new_dict["output_embedding"] = str(fetch_dict["output_embedding"]
70+
.tolist())
71+
return new_dict, None, ""
72+
73+
74+
class ErnieService(WebService):
75+
def get_pipeline_response(self, read_op):
76+
ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
77+
return ernie_op
78+
79+
80+
ernie_service = ErnieService(name="ernie")
81+
ernie_service.prepare_pipeline_config("config_nlp.yml")
82+
ernie_service.run_service()

applications/neural_search/recall/in_batch_negative/evaluate.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,12 @@
2222

2323
# yapf: disable
2424
parser = argparse.ArgumentParser()
25-
parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file")
26-
parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file")
27-
parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query")
25+
parser.add_argument("--similar_text_pair", type=str,
26+
default='', help="The full path of similat pair file")
27+
parser.add_argument("--recall_result_file", type=str,
28+
default='', help="The full path of recall result file")
29+
parser.add_argument("--recall_num", type=int, default=10,
30+
help="Most similair number of doc recalled from corpus per query")
2831

2932

3033
args = parser.parse_args()

applications/neural_search/recall/in_batch_negative/export_model.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,10 @@
2626

2727
# yapf: disable
2828
parser = argparse.ArgumentParser()
29-
parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
30-
parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
29+
parser.add_argument("--params_path", type=str, required=True,
30+
default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
31+
parser.add_argument("--output_path", type=str, default='./output',
32+
help="The path of model parameter in static graph to be saved.")
3133
args = parser.parse_args()
3234
# yapf: enable
3335

0 commit comments

Comments
 (0)