Skip to content

Commit 6f953a9

Browse files
authored
Add UIE-M (#3192)
* Add UIE-M * Update README.md * Update taskflow.md * Update README.md * Update README.md
1 parent ab2bd21 commit 6f953a9

File tree

8 files changed

+304
-96
lines changed

8 files changed

+304
-96
lines changed

docs/model_zoo/taskflow.md

Lines changed: 61 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -668,7 +668,7 @@ from paddlenlp import Taskflow
668668
```python
669669
>>> schema = [{'Person': ['Company', 'Position']}]
670670
>>> ie_en.set_schema(schema)
671-
>>> ie_en('In 1997, Steve was excited to become the CEO of Apple.')
671+
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
672672
[{'Person': [{'end': 14,
673673
'probability': 0.999631971804547,
674674
'relations': {'Company': [{'end': 53,
@@ -711,7 +711,7 @@ from paddlenlp import Taskflow
711711
[{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]
712712
```
713713
714-
- 英文模型**暂不支持事件抽取**
714+
- 英文模型zero-shot方式**暂不支持事件抽取**,如有英文事件抽取相关语料请进行训练定制。
715715
716716
#### 评论观点抽取
717717
@@ -770,19 +770,19 @@ from paddlenlp import Taskflow
770770
英文模型调用示例:
771771
772772
```python
773-
>>> schema = [{'Comment object': ['Opinion', 'Sentiment classification [negative, positive]']}]
773+
>>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
774774
>>> ie_en.set_schema(schema)
775-
>>> ie_en("overall i 'm happy with my toy.")
776-
[{'Comment object': [{'end': 30,
777-
'probability': 0.9774399346859042,
778-
'relations': {'Opinion': [{'end': 18,
779-
'probability': 0.6168918705033555,
780-
'start': 13,
781-
'text': 'happy'}],
782-
'Sentiment classification [negative, positive]': [{'probability': 0.9999556545777182,
783-
'text': 'positive'}]},
784-
'start': 24,
785-
'text': 'my toy'}]}]
775+
>>> pprint(ie_en("The teacher is very nice."))
776+
[{'Aspect': [{'end': 11,
777+
'probability': 0.4301476415932193,
778+
'relations': {'Opinion': [{'end': 24,
779+
'probability': 0.9072940447883724,
780+
'start': 15,
781+
'text': 'very nice'}],
782+
'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
783+
'text': 'positive'}]},
784+
'start': 4,
785+
'text': 'teacher'}]}]
786786
```
787787
788788
#### 情感分类
@@ -811,7 +811,7 @@ from paddlenlp import Taskflow
811811
英文模型调用示例:
812812
813813
```python
814-
>>> schema = [{'Person': ['Company', 'Position']}]
814+
>>> schema = 'Sentiment classification [negative, positive]'
815815
>>> ie_en.set_schema(schema)
816816
>>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
817817
[{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
@@ -874,8 +874,10 @@ from paddlenlp import Taskflow
874874
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
875875
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
876876
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
877+
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
878+
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
877879
878-
- `uie-nano`调用示例
880+
- `uie-nano`调用示例
879881
880882
```python
881883
>>> from paddlenlp import Taskflow
@@ -886,6 +888,41 @@ from paddlenlp import Taskflow
886888
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
887889
```
888890

891+
- `uie-m-base``uie-m-large`支持中英文混合抽取,调用示例:
892+
893+
```python
894+
>>> from pprint import pprint
895+
>>> from paddlenlp import Taskflow
896+
897+
>>> schema = ['Time', 'Player', 'Competition', 'Score']
898+
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
899+
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"]))
900+
[{'Competition': [{'end': 23,
901+
'probability': 0.9373889907291257,
902+
'start': 6,
903+
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
904+
'Player': [{'end': 31,
905+
'probability': 0.6981119555336441,
906+
'start': 28,
907+
'text': '谷爱凌'}],
908+
'Score': [{'end': 39,
909+
'probability': 0.9888507878270296,
910+
'start': 32,
911+
'text': '188.25分'}],
912+
'Time': [{'end': 6,
913+
'probability': 0.9784080036931151,
914+
'start': 0,
915+
'text': '2月8日上午'}]},
916+
{'Competition': [{'end': 35,
917+
'probability': 0.9851549932171295,
918+
'start': 18,
919+
'text': 'French Open Final'}],
920+
'Player': [{'end': 12,
921+
'probability': 0.9379371275888104,
922+
'start': 0,
923+
'text': 'Rafael Nadal'}]}]
924+
```
925+
889926
#### 定制训练
890927

891928
对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用[定制训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)(标注少量数据进行模型微调)以进一步提升效果。
@@ -895,19 +932,24 @@ from paddlenlp import Taskflow
895932
<table>
896933
<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
897934
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
898-
<tr><td>uie-base (12L768H)<td><b>46.43</b><td><b>70.92</b><td><b>71.83</b><td><b>85.72</b><td><b>78.33</b><td><b>81.86</b>
935+
<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
899936
<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
900937
<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
901938
<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
902939
<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
903940
</table>
941+
<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
942+
<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
943+
</table>
904944

905-
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示基于5条标注数据进行模型微调**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**
945+
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**
906946

907947
#### 可配置参数说明
948+
949+
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
950+
* `schema_lang`:设置schema的语言,默认为`zh`, 可选有`zh``en`。因为中英schema的构造有所不同,因此需要指定schema的语言。该参数只对`uie-m-base``uie-m-large`模型有效。
908951
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。
909952
* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`, `uie-medical-base`, `uie-base-en`
910-
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
911953
* `position_prob`:模型对于span的起始位置/终止位置的结果概率0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。
912954
* `precision`:选择模型精度,默认为`fp32`,可选有`fp16``fp32``fp16`推理速度更快。如果选择`fp16`,请先确保机器正确安装NVIDIA相关驱动和基础软件,**确保CUDA>=11.2,cuDNN>=8.1.1**,初次使用需按照提示安装相关依赖(主要是**确保安装onnxruntime-gpu**)。其次,需要确保GPU设备的CUDA计算能力(CUDA Compute Capability)大于7.0,典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档:[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)
913955
</div></details>

model_zoo/uie/README.md

Lines changed: 58 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
234234
```python
235235
>>> schema = [{'Person': ['Company', 'Position']}]
236236
>>> ie_en.set_schema(schema)
237-
>>> ie_en('In 1997, Steve was excited to become the CEO of Apple.')
237+
>>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
238238
[{'Person': [{'end': 14,
239239
'probability': 0.999631971804547,
240240
'relations': {'Company': [{'end': 53,
@@ -340,19 +340,19 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
340340
调用示例:
341341
342342
```python
343-
>>> schema = [{'Comment object': ['Opinion', 'Sentiment classification [negative, positive]']}]
343+
>>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
344344
>>> ie_en.set_schema(schema)
345-
>>> ie_en("overall i 'm happy with my toy.")
346-
[{'Comment object': [{'end': 30,
347-
'probability': 0.9774399346859042,
348-
'relations': {'Opinion': [{'end': 18,
349-
'probability': 0.6168918705033555,
350-
'start': 13,
351-
'text': 'happy'}],
352-
'Sentiment classification [negative, positive]': [{'probability': 0.9999556545777182,
353-
'text': 'positive'}]},
354-
'start': 24,
355-
'text': 'my toy'}]}]
345+
>>> pprint(ie_en("The teacher is very nice."))
346+
[{'Aspect': [{'end': 11,
347+
'probability': 0.4301476415932193,
348+
'relations': {'Opinion': [{'end': 24,
349+
'probability': 0.9072940447883724,
350+
'start': 15,
351+
'text': 'very nice'}],
352+
'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
353+
'text': 'positive'}]},
354+
'start': 4,
355+
'text': 'teacher'}]}]
356356
```
357357
358358
<a name="情感分类"></a>
@@ -383,7 +383,7 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
383383
英文模型调用示例:
384384
385385
```python
386-
>>> schema = [{'Person': ['Company', 'Position']}]
386+
>>> schema = 'Sentiment classification [negative, positive]'
387387
>>> ie_en.set_schema(schema)
388388
>>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
389389
[{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
@@ -450,9 +450,11 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
450450
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
451451
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
452452
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
453+
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
454+
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
453455
454456
455-
- `uie-nano`调用示例
457+
- `uie-nano`调用示例
456458
457459
```python
458460
>>> from paddlenlp import Taskflow
@@ -463,6 +465,41 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
463465
[{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
464466
```
465467

468+
- `uie-m-base``uie-m-large`支持中英文混合抽取,调用示例:
469+
470+
```python
471+
>>> from pprint import pprint
472+
>>> from paddlenlp import Taskflow
473+
474+
>>> schema = ['Time', 'Player', 'Competition', 'Score']
475+
>>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
476+
>>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", "Rafael Nadal wins French Open Final!"]))
477+
[{'Competition': [{'end': 23,
478+
'probability': 0.9373889907291257,
479+
'start': 6,
480+
'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
481+
'Player': [{'end': 31,
482+
'probability': 0.6981119555336441,
483+
'start': 28,
484+
'text': '谷爱凌'}],
485+
'Score': [{'end': 39,
486+
'probability': 0.9888507878270296,
487+
'start': 32,
488+
'text': '188.25分'}],
489+
'Time': [{'end': 6,
490+
'probability': 0.9784080036931151,
491+
'start': 0,
492+
'text': '2月8日上午'}]},
493+
{'Competition': [{'end': 35,
494+
'probability': 0.9851549932171295,
495+
'start': 18,
496+
'text': 'French Open Final'}],
497+
'Player': [{'end': 12,
498+
'probability': 0.9379371275888104,
499+
'start': 0,
500+
'text': 'Rafael Nadal'}]}]
501+
```
502+
466503
<a name="更多配置"></a>
467504

468505
#### 3.8 更多配置
@@ -472,13 +509,15 @@ UIE不限定行业领域和抽取目标,以下是一些零样本行业示例
472509

473510
>>> ie = Taskflow('information_extraction',
474511
schema="",
512+
schema_lang="zh",
475513
batch_size=1,
476514
model='uie-base',
477515
position_prob=0.5,
478516
precision='fp32')
479517
```
480518

481519
* `schema`:定义任务抽取目标,可参考开箱即用中不同任务的调用示例进行配置。
520+
* `schema_lang`:设置schema的语言,默认为`zh`, 可选有`zh``en`。因为中英schema的构造有所不同,因此需要指定schema的语言。该参数只对`uie-m-base``uie-m-large`模型有效。
482521
* `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。
483522
* `model`:选择任务使用的模型,默认为`uie-base`,可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano``uie-medical-base`, `uie-base-en`
484523
* `position_prob`:模型对于span的起始位置/终止位置的结果概率在0~1之间,返回结果去掉小于这个阈值的结果,默认为0.5,span的最终概率输出为起始位置概率和终止位置概率的乘积。
@@ -726,14 +765,16 @@ python evaluate.py \
726765
<table>
727766
<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
728767
<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
729-
<tr><td>uie-base (12L768H)<td><b>46.43</b><td><b>70.92</b><td><b>71.83</b><td><b>85.72</b><td><b>78.33</b><td><b>81.86</b>
768+
<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
730769
<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
731770
<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
732771
<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
733772
<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
773+
<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
774+
<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
734775
</table>
735776

736-
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示基于5条标注数据进行模型微调**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**
777+
0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**
737778

738779
<a name="模型部署"></a>
739780

model_zoo/uie/model.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,10 @@ def __init__(self, encoding_model):
2828
self.sigmoid = nn.Sigmoid()
2929

3030
def forward(self, input_ids, token_type_ids, pos_ids, att_mask):
31-
sequence_output, pooled_output = self.encoder(
32-
input_ids=input_ids,
33-
token_type_ids=token_type_ids,
34-
position_ids=pos_ids,
35-
attention_mask=att_mask)
31+
sequence_output, _ = self.encoder(input_ids=input_ids,
32+
token_type_ids=token_type_ids,
33+
position_ids=pos_ids,
34+
attention_mask=att_mask)
3635
start_logits = self.linear_start(sequence_output)
3736
start_logits = paddle.squeeze(start_logits, -1)
3837
start_prob = self.sigmoid(start_logits)

0 commit comments

Comments
 (0)