Skip to content

Commit 0d9f117

Browse files
Add ernie doc (#613)
* add Ernie-Doc modeling, tokenizer * add ernie-doc multihead attention * add encoder * add ErnieDocModel, ErnieDocForSequenceClassification, ErnieDocForTokenClassification, ErnieDocForQuestionAnswering * add ErnieDocTokenizer, BPETokenizer * add data preprocess * fix embedding norm * free the previous useless memory to save gpu memory * add shuffle sample for ClassifierIterator * add best model save * upload ernie_doc base model * add tokenizer save and from pretrained * decrease memory cost * remove cast dtype * use try_import for nltk, regex * add init README.md * add different tokenizer for different tasks * add iflytek classifier task * add layerwise decay optimizer * AdamLW->AdamWDL * fix data iterator * eval same samples * add hyp dataset * add hyp preprocessor * temp * fix the wrong acc calculation * add ernie_doc metrics * Revert "add hyp dataset" This reverts commit c6c3353. * Add hyp thunews (#712) * add hyp dataset * add thucnews * add more comments * fix some thucnews dataset label path * add THUCNews, HYP dataset list * remove LABELS member Co-authored-by: smallv0221 <[email protected]> * fix some print when evaling * add init mrc script * add chinese tokenize * add MRCIterator * add mrc loss * add EM_AND_F1 for qa tasks * add dropout argument for ernie-doc task * optimize output of evaluation of mrc * PreProcessor -> Preprocessor * add model save for mrc * fix mrc eval print * add cls token index of sequence * add c3 mrc dataset * add MCQIterator * add mcq task * fix c3 bug * fix MCQIterator * add cail2019 scm dataset * upgrade MCQIterator * add semantic matching * add simnet model based on ernie_doc * change SemanticMatchingIterator * fix semantic matching save model * add label key for c3 dataset iterator * Revert "add cail2019 scm dataset" This reverts commit a7aefdc. * Revert "add c3 mrc dataset" This reverts commit d240639. * Add C3, TriviaQa, CAIL2019-SCM dataset (#754) * add c3 mrc dataset * add triviaqa dataset * add cail2019 scm dataset * add dataset desc * update accroding to advices * add label key in c3 dataset * add triviaqa task;fix mcq,cail task bug * dev->eval * fix mcq data iterator * fix _get_samples for mcq * fix c3 training * ErnieDocSimNet->ErnieDocForTextMatching * optimize data.py * add sequence labeling for ernie doc * remove useless comments * add README.md for ernie_doc * add contents for ernie doc readme.md * fix readme.md * add ERNIE-DOC BPEtokenizer * remove save test model * use paddle.metric.Accuracy instead of Acc * fix README.md * remove adamwdl to paddlenlp.ops.optimizer * add __rel_shift comments * optimize comments * add comments for tokenizer * add comments for ErnieDocModel, ErnieDocForSequenceClassification, ErnieDocForTokenClassification, ErnieDocForQuestionAnswering * add ernie_doc api doc reference * fix dtype of semantic matching * reseet metric when finishing evaluation * add adamwdl docs * add adamwdl, ernie_doc navigation * add more comments of admawdl * fix reference of ernie-doc in pretraining model * add adamwdl formula * fix gather_idxs Co-authored-by: smallv0221 <[email protected]>
1 parent bc178ab commit 0d9f117

24 files changed

+5183
-21
lines changed

docs/model_zoo/transformers.rst

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ PaddleNLP为用户提供了常用的 ``BERT``、``ERNIE``、``ALBERT``、``RoBER
99
Transformer预训练模型汇总
1010
------------------------------------
1111

12-
下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **68** 种预训练的参数权重供用户使用,
13-
其中包含了 **33** 种中文语言模型的预训练权重。
12+
下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **70** 种预训练的参数权重供用户使用,
13+
其中包含了 **34** 种中文语言模型的预训练权重。
1414

1515
+--------------------+-------------------------------------+--------------+-----------------------------------------+
1616
| Model | Pretrained Weight | Language | Details of the model |
@@ -171,6 +171,14 @@ Transformer预训练模型汇总
171171
| | | | 16-heads, 336M parameters. |
172172
| | | | Trained on lower-cased English text. |
173173
+--------------------+-------------------------------------+--------------+-----------------------------------------+
174+
|ERNIE-DOC_ |``ernie-doc-base-zh`` | Chinese | 12-layer, 768-hidden, |
175+
| | | | 12-heads, 108M parameters. |
176+
| | | | Trained on Chinese text. |
177+
| +-------------------------------------+--------------+-----------------------------------------+
178+
| |``ernie-doc-base-en`` | English | 12-layer, 768-hidden, |
179+
| | | | 12-heads, 103M parameters. |
180+
| | | | Trained on lower-cased English text. |
181+
+--------------------+-------------------------------------+--------------+-----------------------------------------+
174182
|ERNIE-GEN_ |``ernie-gen-base-en`` | English | 12-layer, 768-hidden, |
175183
| | | | 12-heads, 108M parameters. |
176184
| | | | Trained on lower-cased English text. |
@@ -332,6 +340,8 @@ Transformer预训练模型适用任务汇总
332340
+--------------------+-------------------------+----------------------+--------------------+-----------------+
333341
|ERNIE_ |||||
334342
+--------------------+-------------------------+----------------------+--------------------+-----------------+
343+
|ERNIE-DOC_ |||||
344+
+--------------------+-------------------------+----------------------+--------------------+-----------------+
335345
|ERNIE-GEN_ |||||
336346
+--------------------+-------------------------+----------------------+--------------------+-----------------+
337347
|ERNIE-GRAM_ |||||
@@ -357,6 +367,7 @@ Transformer预训练模型适用任务汇总
357367
.. _DistilBert: https://arxiv.org/abs/1910.01108
358368
.. _ELECTRA: https://arxiv.org/abs/2003.10555
359369
.. _ERNIE: https://arxiv.org/abs/1904.09223
370+
.. _ERNIE-DOC: https://arxiv.org/abs/2012.15688
360371
.. _ERNIE-GEN: https://arxiv.org/abs/2001.11314
361372
.. _ERNIE-GRAM: https://arxiv.org/abs/2010.12148
362373
.. _GPT: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
adamwdl
2+
====================================
3+
4+
.. automodule:: paddlenlp.ops.optimizer.adamwdl
5+
:members:
6+
:no-undoc-members:
7+
:show-inheritance:

docs/source/paddlenlp.ops.optimizer.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ optimizer
1212

1313
paddlenlp.ops.optimizer.AdamwOptimizer
1414
paddlenlp.ops.optimizer.adamw
15+
paddlenlp.ops.optimizer.adamwdl
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
modeling
2+
==============================================
3+
4+
.. automodule:: paddlenlp.transformers.ernie_doc.modeling
5+
:members:
6+
:no-undoc-members:
7+
:show-inheritance:
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
ernie_doc
2+
======================================
3+
4+
.. automodule:: paddlenlp.transformers.ernie_doc
5+
:members:
6+
:no-undoc-members:
7+
:show-inheritance:
8+
9+
10+
.. toctree::
11+
:maxdepth: 4
12+
13+
paddlenlp.transformers.ernie_doc.modeling
14+
paddlenlp.transformers.ernie_doc.tokenizer
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
tokenizer
2+
===============================================
3+
4+
.. automodule:: paddlenlp.transformers.ernie_doc.tokenizer
5+
:members:
6+
:no-undoc-members:
7+
:show-inheritance:

docs/source/paddlenlp.transformers.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ paddlenlp.transformers
1717
paddlenlp.transformers.electra
1818
paddlenlp.transformers.ernie
1919
paddlenlp.transformers.ernie_ctm
20+
paddlenlp.transformers.ernie_doc
2021
paddlenlp.transformers.ernie_gen
2122
paddlenlp.transformers.ernie_gram
2223
paddlenlp.transformers.gpt
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# ERNIE-Doc
2+
3+
* [模型简介](#模型简介)
4+
* [快速开始](#快速开始)
5+
* [环境依赖](#环境依赖)
6+
* [通用参数释义](#通用参数释义)
7+
* [分类任务](#分类任务)
8+
* [阅读理解任务](#阅读理解任务)
9+
* [语义匹配任务](#语义匹配任务)
10+
* [序列标注任务](#序列标注任务)
11+
* [致谢](#致谢)
12+
* [参考论文](#参考论文)
13+
14+
## 模型简介
15+
[ERNIE-Doc](https://arxiv.org/abs/2012.15688)是百度NLP提出的针对长文本的预训练模型。在循环Transformer机制之上,创新性地提出两阶段重复学习以及增强的循环机制,以此提高模型感受野,加强模型对长文本的理解能力。
16+
17+
本项目是 ERNIE-Doc 的 PaddlePaddle 动态图实现, 包含模型训练,模型验证等内容。以下是本例的简要目录结构及说明:
18+
19+
```text
20+
.
21+
├── README.md # 文档
22+
├── data.py # 数据处理
23+
├── metrics.py # ERNIE-Doc下游任务指标
24+
├── model.py # 下游任务模型实现
25+
├── optimization.py # 优化算法
26+
├── run_classifier.py # 分类任务
27+
├── run_mcq.py # 阅读理解任务,单项选择题
28+
├── run_mrc.py # 抽取式阅读理解任务
29+
├── run_semantic_matching.py # 语义匹配任务
30+
└── run_sequence_labeling.py # 序列标注任务
31+
32+
```
33+
34+
## 快速开始
35+
36+
### 环境依赖
37+
38+
- re
39+
- nltk
40+
- beautifulsoup4
41+
42+
安装命令:`pip install re nltk beautifulsoup4`
43+
44+
### 通用参数释义
45+
46+
- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型以及预训练时使用的tokenizer,目前支持的预训练模型有:"ernie-doc-base-zh", "ernie-doc-base-en"。若模型相关内容保存在本地,这里也可以提供相应目录地址,例如:"./checkpoint/model_xx/"。
47+
- `dataset` 表示Fine-tuning需要加载的数据集。
48+
- `memory_length` 表示当前的句子被截取作为下一个样本的特征的长度。
49+
- `max_seq_length` 表示最大句子长度,超过该长度的部分将被切分成下一个样本。
50+
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
51+
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
52+
- `epochs` 表示训练轮数。
53+
- `logging_steps` 表示日志打印间隔步数。
54+
- `save_steps` 表示模型保存及评估间隔步数。
55+
- `output_dir` 表示模型保存路径。
56+
- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
57+
- `seed` 表示随机数种子。
58+
- `weight_decay` 表示AdamW的权重衰减系数。
59+
- `warmup_proportion` 表示学习率warmup系数。
60+
- `layerwise_decay` 表示AdamW with Layerwise decay的逐层衰减系数。
61+
62+
由于不同任务、不同数据集所设的超参数差别较大,可查看[ERNIE-Doc](https://arxiv.org/abs/2012.15688)论文附录中具体超参设定,此处不一一列举。
63+
64+
### 分类任务
65+
66+
分类任务支持多种数据集的评测,目前支持`imdb`, `iflytek`, `thucnews`, `hyp`四个数据集(有关数据集的描述可查看[PaddleNLP文本分类数据集](../../../docs/data_prepare/dataset_list.md))。可通过参数`dataset`指定具体的数据集,下面以`imdb`为例子运行分类任务。
67+
68+
#### 单卡训练
69+
70+
```shell
71+
python run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
72+
73+
```
74+
75+
#### 多卡训练
76+
77+
```shell
78+
python -m paddle.distributed.launch --gpus "0,1" --log_dir imdb run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
79+
80+
```
81+
82+
`imdb`, `iflytek`, `thucnews`, `hyp`各数据集上Fine-tuning后,在验证集上有如下结果:
83+
84+
| Dataset | Model | Dev ACC |
85+
|:---------:|:-----------------:|:----------------:|
86+
| IMDB | ernie-doc-base-en | 0.9506 |
87+
| THUCNews | ernie-doc-base-zh | 0.9854 |
88+
| HYP | ernie-doc-base-en | 0.7412 |
89+
| IFLYTEK | ernie-doc-base-zh | 0.6179 |
90+
91+
92+
### 阅读理解任务
93+
94+
阅读理解任务支持抽取式阅读理解与单项选择题任务。
95+
96+
- 抽取式阅读理解
97+
98+
目前抽取式阅读理解支持`duredear-robust`, `drcd`,`cmrc2018`数据集。可通过参数`dataset`指定具体的数据集,下面以`dureader_robust`为例子运行抽取式阅读理解任务。
99+
100+
#### 单卡训练
101+
102+
```shell
103+
python run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
104+
```
105+
106+
#### 多卡训练
107+
108+
```shell
109+
python -m paddle.distributed.launch --gpus "0,1" --log_dir dureader_robust run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
110+
```
111+
112+
`duredear-robust`, `drcd`, `cmrc2018`各数据集上Fine-tuning后,在验证集上有如下结果:
113+
114+
| Dataset | Model | Dev EM/F1 |
115+
|:--------------:|:-----------------:|:----------------:|
116+
| Dureader-robust| ernie-doc-base-zh | 0.7481/0.8637 |
117+
| DRCD | ernie-doc-base-zh | 0.8879/0.9392 |
118+
| CMRC2018 | ernie-doc-base-zh | 0.7061/0.9004 |
119+
120+
121+
- 单项选择题
122+
123+
[C3](https://github.com/nlpdata/c3)是首个自由形式的多选项中文机器阅读理解数据集。该数据集每个样本提供一个上下文(文章或者对话)、问题以及至多四个答案选项,要求从答案选项中选择一个正确选项。
124+
125+
目前PaddleNLP提供`C3`阅读理解单项选择题数据集,可执行以下命令运行该任务。
126+
127+
#### 单卡训练
128+
129+
```shell
130+
python run_mcq.py --batch_size 8
131+
132+
```
133+
134+
#### 多卡训练
135+
136+
```shell
137+
python -m paddle.distributed.launch --gpus "0,1" --log_dir mcq run_mcq.py --batch_size 8
138+
139+
```
140+
141+
`C3`数据集上Fine-tuning后,在验证集上有如下结果:
142+
| Dataset | Model | Dev/Test Acc |
143+
|:--------------:|:-----------------:|:----------------:|
144+
| C3 | ernie-doc-base-zh | 0.7573/0.7583 |
145+
146+
147+
### 语义匹配任务
148+
149+
[CAIL2019 SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) 数据集是来自“中国裁判文书网”公开的法律文书,其中每份数据由三篇法律文书组成。对于每份数据,用`(A,B,C)`来代表该组数据,其中`(A,B,C)`均对应某一篇文书。该任务要求判别similarity(A, B)是否大于similarity(A, C)。
150+
151+
可执行以下命令运行该任务。
152+
153+
#### 单卡训练
154+
155+
```shell
156+
python run_semantic_matching.py --batch_size 6 --learning_rate 2e-5
157+
```
158+
159+
#### 多卡训练
160+
161+
```shell
162+
python -m paddle.distributed.launch --gpus "0,1" --log_dir cail run_semantic_matching.py --batch_size 6 --learning_rate 2e-5
163+
```
164+
165+
`CAIL2019-SCM`数据集上Fine-tuning后,在验证集与测试集上有如下结果:
166+
167+
| Dataset | Model | Dev/Test Acc |
168+
|:--------------:|:-----------------:|:----------------:|
169+
| CAIL2019-SCM | ernie-doc-base-zh | 0.6420/0.6484 |
170+
171+
172+
### 序列标注任务
173+
174+
175+
MSRA-NER 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,主要包括人名、地名、机构名等。示例如下:
176+
177+
```
178+
不\002久\002前\002,\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。 O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O
179+
这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。 O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O
180+
```
181+
182+
PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整:每一行文本、标签以特殊字符"\t"进行分隔,每个字之间以特殊字符"\002"分隔。
183+
184+
可执行以下命令运行序列标注任务。
185+
186+
#### 单卡训练
187+
188+
```shell
189+
python run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
190+
```
191+
192+
#### 多卡训练
193+
194+
```shell
195+
python -m paddle.distributed.launch --gpus "0,1" --log_dir msra_ner run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
196+
```
197+
198+
`MSRA-NER`数据集上Fine-tuning后,在验证集与测试集上有如下最佳结果:
199+
200+
| Dataset | Model | Precision/Recall/F1 |
201+
|:--------------:|:-----------------:|:-----------------------:|
202+
| MSRA-NER | ernie-doc-base-zh | 0.9288/0.9139/0.9213 |
203+
204+
205+
## 致谢
206+
* 感谢[百度NLP](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)提供ERNIE-Doc开源代码的实现以及预训练模型。
207+
208+
## 参考论文
209+
210+
* Siyu Ding, Junyuan Shang et al. "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer" ACL, 2021

0 commit comments

Comments
 (0)