Skip to content

Commit a3ac5a9

Browse files
authored
Update UIE model and fix NPTag preprocess bugs (#2138)
* Update UIE model and taskflow.md * Update taskflow.md * Update taskflow.md * Update README.md
1 parent 1ab0e48 commit a3ac5a9

File tree

9 files changed

+103
-63
lines changed

9 files changed

+103
-63
lines changed

docs/model_zoo/taskflow.md

Lines changed: 48 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ PaddleNLP提供**开箱即用**的产业级NLP预置任务能力,无需训练
3232
| [词性标注](#词性标注) | `Taskflow("pos_tagging")` |||||| 基于百度前沿词法分析工具LAC |
3333
| [命名实体识别](#命名实体识别) | `Taskflow("ner")` |||||| 覆盖最全中文实体标签 |
3434
| [依存句法分析](#依存句法分析) | `Taskflow("dependency_parsing")` |||| || 基于最大规模中文依存句法树库研发的DDParser |
35-
| [信息抽取](#信息抽取) | `Taskflow("information_extraction")`|||| || 适配多场景的开放域通用信息抽取工具 |
35+
| [信息抽取](#信息抽取) | `Taskflow("information_extraction")`|||||| 适配多场景的开放域通用信息抽取工具 |
3636
| [『解语』-知识标注](#解语知识标注) | `Taskflow("knowledge_mining")` |||||| 覆盖所有中文词汇的知识标注工具 |
3737
| [文本纠错](#文本纠错) | `Taskflow("text_correction")` |||||| 融合拼音特征的端到端文本纠错模型ERNIE-CSC |
3838
| [文本相似度](#文本相似度) | `Taskflow("text_similarity")` |||| | | 基于百度知道2200万对相似句组训练 |
@@ -422,15 +422,15 @@ from paddlenlp import Taskflow
422422
>>> ie = Taskflow('information_extraction', schema=schema)
423423
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")) # Better print results using pprint
424424
[{'时间': [{'end': 6,
425-
'probability': 0.9907337794563702,
425+
'probability': 0.9857378532924486,
426426
'start': 0,
427427
'text': '2月8日上午'}],
428428
'赛事名称': [{'end': 23,
429-
'probability': 0.8944205558197353,
429+
'probability': 0.8503089953268272,
430430
'start': 6,
431431
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
432432
'选手': [{'end': 31,
433-
'probability': 0.8914297225026147,
433+
'probability': 0.8981548639781138,
434434
'start': 28,
435435
'text': '谷爱凌'}]}]
436436
```
@@ -481,21 +481,48 @@ from paddlenlp import Taskflow
481481

482482
评论观点抽取,是指抽取文本中包含的评价维度、观点词。
483483

484-
例如抽取的目标是文本中包含的评价维度以及对应的观点词,schema构造如下:
484+
例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,schema构造如下:
485485

486486
```text
487-
{'评价维度': '观点词'}
487+
{'评价维度': ['观点词', '情感倾向[正向,负向]']}
488488
```
489489

490-
评论观点抽取默认统一使用`评价维度``观点词`作为schema。
491-
492490
预测:
493491

494492
```python
495-
>>> schema = {'评价维度': '观点词'} # Define the schema for opinion extraction
493+
>>> schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction
496494
>>> ie.set_schema(schema) # Reset schema
497-
>>> ie('个人觉得管理太混乱了,票价太高了')
498-
[{'评价维度': [{'text': '管理', 'start': 4, 'end': 6, 'probability': 0.8902373594544031, 'relations': {'观点词': [{'text': '混乱', 'start': 7, 'end': 9, 'probability': 0.9993566520321409}]}}, {'text': '票价', 'start': 11, 'end': 13, 'probability': 0.9856116411308662, 'relations': {'观点词': [{'text': '', 'start': 14, 'end': 15, 'probability': 0.995628420935013}]}}]}]
495+
>>> pprint(ie("地址不错,服务一般,设施陈旧")) # Better print results using pprint
496+
[{'评价维度': [{'end': 2,
497+
'probability': 0.9888139270606509,
498+
'relations': {'情感倾向[正向,负向]': [{'probability': 0.998228967796706,
499+
'text': '正向'}],
500+
'观点词': [{'end': 4,
501+
'probability': 0.9927847072459528,
502+
'start': 2,
503+
'text': '不错'}]},
504+
'start': 0,
505+
'text': '地址'},
506+
{'end': 12,
507+
'probability': 0.9588297379365116,
508+
'relations': {'情感倾向[正向,负向]': [{'probability': 0.9949389795770394,
509+
'text': '负向'}],
510+
'观点词': [{'end': 14,
511+
'probability': 0.9286753967902683,
512+
'start': 12,
513+
'text': '陈旧'}]},
514+
'start': 10,
515+
'text': '设施'},
516+
{'end': 7,
517+
'probability': 0.9592857070501211,
518+
'relations': {'情感倾向[正向,负向]': [{'probability': 0.9952498258302498,
519+
'text': '负向'}],
520+
'观点词': [{'end': 9,
521+
'probability': 0.9949359182521675,
522+
'start': 7,
523+
'text': '一般'}]},
524+
'start': 5,
525+
'text': '服务'}]}]
499526
```
500527

501528

@@ -531,15 +558,15 @@ from paddlenlp import Taskflow
531558
>>> ie.set_schema(schema)
532559
>>> pprint(ie('李治即位后,让身在感业寺的武则天续起头发,重新纳入后宫。')) # Better print results using pprint
533560
[{'丈夫': [{'end': 2,
534-
'probability': 0.993496447299993,
561+
'probability': 0.989690572797457,
535562
'relations': {'妻子': [{'end': 16,
536-
'probability': 0.9994008822614759,
563+
'probability': 0.9987625986790256,
537564
'start': 13,
538565
'text': '武则天'}]},
539566
'start': 0,
540567
'text': '李治'}],
541568
'寺庙': [{'end': 12,
542-
'probability': 0.998334669586864,
569+
'probability': 0.9888581774497425,
543570
'start': 9,
544571
'text': '感业寺'}]}]
545572
```
@@ -563,20 +590,20 @@ from paddlenlp import Taskflow
563590
>>> schema = ['时间', '选手', '赛事名称']
564591
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-tiny")
565592
>>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!")
566-
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.9939956659967066}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.8323544377549155}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.624098394612048}]}]
593+
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.9492842181233527}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.7277186614493836}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.8751028059367947}]}]
567594
```
568595

569596
#### 定制训练
570597

571-
对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用[定制训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/uie)(标注少量数据进行模型微调)以进一步提升效果。
598+
对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用[定制训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)(标注少量数据进行模型微调)以进一步提升效果。
572599

573600
我们在互联网、医疗、金融三大垂类自建测试集上进行了实验:
574601

575602
<table>
576-
<tr><th row_span='2'><th colspan='2'>互联网<th colspan='2'>医疗<th colspan='2'>金融
603+
<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
577604
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
578-
<tr><td>uie-tiny<td>75.92<td>78.45<td>63.34<td>74.65<td>42.03<td>65.78
579-
<tr><td>uie-base<td>80.13<td>81.53<td>66.71<td>79.94<td>41.29<td>70.91
605+
<tr><td>uie-tiny<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
606+
<tr><td>uie-base<td>46.43<td>70.92<td>71.83<td>85.72<td>78.33<td>81.86
580607
</table>
581608

582609
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示基于5条标注数据进行模型微调。
@@ -826,8 +853,8 @@ from paddlenlp import Taskflow
826853
| `Taskflow("pos_tagging")` | `$HOME/.paddlenlp/taskflow/lac` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) |
827854
| `Taskflow("ner", mode="fast")` | `$HOME/.paddlenlp/taskflow/lac` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) |
828855
| `Taskflow("ner", mode="accurate")` | `$HOME/.paddlenlp/taskflow/wordtag` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) |
829-
| `Taskflow("information_extraction", model="uie-base")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-base` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/uie) |
830-
| `Taskflow("information_extraction", model="uie-tiny")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-tiny` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/uie) |
856+
| `Taskflow("information_extraction", model="uie-base")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-base` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) |
857+
| `Taskflow("information_extraction", model="uie-tiny")` | `$HOME/.paddlenlp/taskflow/information_extraction/uie-tiny` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) |
831858
| `Taskflow("text_correction", model="ernie-csc")` | `$HOME/.paddlenlp/taskflow/text_correction/ernie-csc` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_correction/ernie-csc) |
832859
| `Taskflow("dependency_parsing", model="ddparser")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) |
833860
| `Taskflow("dependency_parsing", model="ddparser-ernie-1.0")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) |

examples/text_to_knowledge/nptag/data.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,6 @@ def convert_example(example,
5959
tokens,
6060
return_length=True,
6161
is_split_into_words=True,
62-
pad_to_max_seq_len=True,
6362
max_seq_len=max_seq_len)
6463

6564
label_indices = list(
@@ -70,7 +69,7 @@ def convert_example(example,
7069

7170
label_tokens = list(example["label"]) + ["[PAD]"] * (max_cls_len -
7271
len(example["label"]))
73-
labels = np.full([max_seq_len], fill_value=-100, dtype=np.int64)
72+
labels = np.full([inputs["seq_len"]], fill_value=-100, dtype=np.int64)
7473
labels[label_indices] = tokenzier.convert_tokens_to_ids(label_tokens)
7574
return inputs["input_ids"], inputs["token_type_ids"], labels
7675

examples/text_to_knowledge/nptag/predict.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ def do_predict(data,
5454
]
5555

5656
batchify_fn = lambda samples, fn=Tuple(
57-
Stack(dtype='int64'), # input_ids
58-
Stack(dtype='int64'), # token_type_ids
57+
Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype='int64'), # input_ids
58+
Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype='int64'), # token_type_ids
5959
Stack(dtype='int64'), # label_indices
6060
): fn(samples)
6161

examples/text_to_knowledge/nptag/train.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
import paddle.nn.functional as F
2424
from paddlenlp.utils.log import logger
2525
from paddlenlp.transformers import ErnieCtmNptagModel, ErnieCtmTokenizer, LinearDecayWithWarmup
26-
from paddlenlp.data import Stack, Tuple
26+
from paddlenlp.data import Pad, Stack, Tuple
2727
from paddlenlp.datasets import load_dataset
2828

2929
from data import convert_example, create_dataloader, read_custom_data
@@ -108,9 +108,9 @@ def do_train(args):
108108
convert_example, tokenzier=tokenizer, max_seq_len=args.max_seq_len)
109109

110110
batchify_fn = lambda samples, fn=Tuple(
111-
Stack(dtype='int64'), # input_ids
112-
Stack(dtype='int64'), # token_type_ids
113-
Stack(dtype='int64'), # labels
111+
Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype='int64'), # input_ids
112+
Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype='int64'), # token_type_ids
113+
Pad(axis=0, pad_val=-100, dtype='int64'), # labels
114114
): fn(samples)
115115

116116
train_data_loader = create_dataloader(

model_zoo/uie/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,15 +64,15 @@ UIE可以从自然语言文本中,抽取出结构化的关键字段信息,
6464
>>> ie = Taskflow('information_extraction', schema=schema)
6565
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!"))
6666
[{'时间': [{'end': 6,
67-
'probability': 0.9907337794563702,
67+
'probability': 0.9857378532924486,
6868
'start': 0,
6969
'text': '2月8日上午'}],
7070
'赛事名称': [{'end': 23,
71-
'probability': 0.8944205558197353,
71+
'probability': 0.8503089953268272,
7272
'start': 6,
7373
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
7474
'选手': [{'end': 31,
75-
'probability': 0.8914297225026147,
75+
'probability': 0.8981548639781138,
7676
'start': 28,
7777
'text': '谷爱凌'}]}]
7878
```
@@ -247,10 +247,10 @@ python evaluate.py \
247247
我们在互联网、医疗、金融三大垂类自建测试集上进行了实验:
248248

249249
<table>
250-
<tr><th row_span='2'><th colspan='2'>互联网<th colspan='2'>医疗<th colspan='2'>金融
250+
<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
251251
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
252-
<tr><td>uie-tiny<td>75.92<td>78.45<td>63.34<td>74.65<td>42.03<td>65.78
253-
<tr><td>uie-base<td>80.13<td>81.53<td>66.71<td>79.94<td>41.29<td>70.91
252+
<tr><td>uie-tiny<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
253+
<tr><td>uie-base<td>46.43<td>70.92<td>71.83<td>85.72<td>78.33<td>81.86
254254
</table>
255255

256256
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示基于5条标注数据进行模型微调。实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果。

0 commit comments

Comments
 (0)