Skip to content

Commit a5f8a3e

Browse files
ZHUIZeyuChen
andauthored
[Pre-Training] Add tutorial for clue small 14g dataset (#1555)
* add tutorial for clue small 14g. * add pre-train weight to community. * fix typos. * fix typo. * add dataset link. * change name to ernie-1.0-cluecorpussmall Co-authored-by: Zeyu Chen <[email protected]>
1 parent cf51c8a commit a5f8a3e

File tree

7 files changed

+234
-11
lines changed

7 files changed

+234
-11
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# 详细介绍
2+
本权重为使用PaddleNLP提供的[ERNIE-1.0预训练教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/ernie-1.0),在CLUECorpusSmall 14g数据集上训练得到的权重。
3+
4+
本模型结构与ernie-1.0完全相同。使用训练配置`batch_size=512, max_steps=100w`, 训练得到。模型使用方法与原始ernie-1.0权重相同。
5+
6+
预训练全流程参见:https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/ernie-1.0/README.md
7+
8+
# 使用示例
9+
10+
示例一:
11+
```python
12+
import paddle
13+
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
14+
tokenizer = ErnieTokenizer.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
15+
model = ErnieForMaskedLM.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
16+
17+
tokens = ['[CLS]', '', '', '[MASK]','', '', '','', '[SEP]']
18+
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)])
19+
segment_ids = paddle.to_tensor([[0] * len(tokens)])
20+
21+
outputs = model(masked_ids, token_type_ids=segment_ids)
22+
prediction_scores = outputs
23+
prediction_index = paddle.argmax(prediction_scores[0, 3]).item()
24+
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0]
25+
print(tokens)
26+
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]']
27+
print(predicted_token)
28+
#
29+
```
30+
31+
示例二:
32+
```python
33+
import paddle
34+
from paddlenlp.transformers import *
35+
36+
tokenizer = AutoTokenizer.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
37+
text = tokenizer('自然语言处理')
38+
39+
# 语义表示
40+
model = AutoModel.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
41+
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
42+
# 文本分类 & 句对匹配
43+
model = AutoModelForSequenceClassification.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
44+
# 序列标注
45+
model = AutoModelForTokenClassification.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
46+
# 问答
47+
model = AutoModelForQuestionAnswering.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
48+
```
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/model_config.json",
3+
"model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/model_state.pdparams",
4+
"tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/tokenizer_config.json",
5+
"vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/vocab.txt",
6+
}

examples/language_model/data_tools/README.md

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ chinese words:
131131
可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。
132132
--cn_seg_func {lac,seg,jieba}
133133
Words segment function for chinese words.
134-
默认lac,jieba速度较快
134+
默认jieba,jieba速度较快,lac模型更准确,计算量高。
135135
--cn_splited Is chinese corpus is splited in to words.
136136
分词后的文本,可选。设置此选项则,cn_seg_func不起作用。
137137
例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台"
@@ -148,7 +148,7 @@ common config:
148148
--workers WORKERS Number of worker processes to launch
149149
处理文本id化的进程个数。
150150
```
151-
同过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
151+
通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
152152
```
153153
python -u create_pretraining_data.py \
154154
--model_name ernie-1.0 \
@@ -190,3 +190,51 @@ sh run_static.sh
190190
## 参考内容
191191

192192
注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。
193+
194+
195+
# 附录
196+
197+
## CLUECorpusSmall 数据集处理教程
198+
**数据集简介**:可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目
199+
包含如下子语料库(总共14G语料):新闻语料[news2016zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/6bac09db4e6d4857b6d680d34447457490cb2dbdd8b8462ea1780a407f38e12b?responseContentDisposition=attachment%3B%20filename%3Dnews2016zh_corpus.zip), 社区互动语料[webText2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/83da03f7b4974871a52348b41c16c7e3b34a26d5ca644f558df8435be4de51c3?responseContentDisposition=attachment%3B%20filename%3DwebText2019zh_corpus.zip),维基百科语料[wiki2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/d7a166408d8b4ffdaf4de9cfca09f6ee1e2340260f26440a92f78134d068b28f?responseContentDisposition=attachment%3B%20filename%3Dwiki2019zh_corpus.zip),评论数据语料[comment2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/b66ddd445735408383c42322850ac4bb82faf9cc611447c2affb925443de7a6d?responseContentDisposition=attachment%3B%20filename%3Dcomment2019zh_corpus.zip)
200+
201+
**数据集下载**
202+
用户可以通过官方github网页下载,https://github.com/CLUEbenchmark/CLUECorpus2020 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598)[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值:
203+
```shell
204+
> md5sum ./*
205+
8a8be341ebce39cfe9524fb0b46b08c5 ./comment2019zh_corpus.zip
206+
4bdc2c941a7adb4a061caf273fea42b8 ./news2016zh_corpus.zip
207+
fc582409f078b10d717caf233cc58ddd ./webText2019zh_corpus.zip
208+
157dacde91dcbd2e52a60af49f710fa5 ./wiki2019zh_corpus.zip
209+
```
210+
解压文件
211+
```shell
212+
unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus
213+
unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus
214+
unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus
215+
unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus
216+
```
217+
将txt文件转换为jsonl格式
218+
```
219+
python trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
220+
```
221+
现在我们得到了jsonl格式的数据集,下面是针对训练任务的数据集应用,此处以ernie为例。
222+
```
223+
python -u create_pretraining_data.py \
224+
--model_name ernie-1.0 \
225+
--tokenizer_name ErnieTokenizer \
226+
--input_path clue_corpus_small_14g.jsonl \
227+
--split_sentences\
228+
--chinese \
229+
--cn_whole_word_segment \
230+
--cn_seg_func jieba \
231+
--output_prefix clue_corpus_small_14g_20220104 \
232+
--workers 48 \
233+
--log_interval 10000
234+
```
235+
数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。
236+
```
237+
clue_corpus_small_14g_20220104_ids.npy
238+
clue_corpus_small_14g_20220104_idx.npz
239+
```
240+
用户可以使用此数据进行预训练任务。

examples/language_model/data_tools/create_pretraining_data.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ def get_args():
8686
group.add_argument(
8787
'--cn_seg_func',
8888
type=str,
89-
default='lac',
89+
default='jieba',
9090
choices=['lac', 'seg', 'jieba'],
9191
help='Words segment function for chinese words.')
9292
group.add_argument(

examples/language_model/ernie-1.0/README.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,12 +40,12 @@ python -u -m paddle.distributed.launch \
4040
--use_recompute false \
4141
--max_lr 0.0001 \
4242
--min_lr 0.00001 \
43-
--max_steps 4000000 \
43+
--max_steps 1000000 \
4444
--save_steps 50000 \
4545
--checkpoint_steps 5000 \
46-
--decay_steps 3960000 \
46+
--decay_steps 990000 \
4747
--weight_decay 0.01 \
48-
--warmup_rate 0.0025 \
48+
--warmup_rate 0.01 \
4949
--grad_clip 1.0 \
5050
--logging_freq 20\
5151
--num_workers 2 \
@@ -82,6 +82,32 @@ python -u -m paddle.distributed.launch \
8282
- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。
8383
- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。
8484

85+
86+
### CLUECorpusSmall 数据集训练结果
87+
88+
数据准备部分参考[data_tools](../data_tools/)中的附录部分,根据文档,创建训练clue_corpus_small_14g数据集。
89+
使用本训练脚本, batch_size=512, max_steps=100w,详细训练日志请参考:https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/scalar?id=b0e19e554d68b9165a55901f0eb92812
90+
91+
最终训练loss结果:
92+
93+
|Loss | Train | Validation |
94+
|-|-|-|
95+
|loss |2.72 | 2.60 |
96+
|lm_loss|2.60 | 2.50 |
97+
|sop_loss|0.12 | 0.10 |
98+
99+
训练集 lm_loss=2.60 左右, 验证集 lm_loss=2.50 左右。
100+
101+
使用训练好的模型参数,在下游任务重进行finetune(需要先将静态图参数转换为动态图,请参考模型参数转换部分)。这里报告部分数据集上的finetune结果:
102+
103+
|Dataset | Dev | Test|
104+
|--|--|--|
105+
XNLI-CN | 0.79269 | 0.78339 |
106+
ChnSentiCorp | 0.94495 | 0.95496 |
107+
PeoplesDailyNer | 0.95128 | 0.94035 |
108+
CMRC2018 | 72.05/85.67 | - |
109+
110+
85111
### 其他
86112
#### 模型参数转换
87113
本示例提供了静态图训练脚本,但Paddle目前主要的使用方式是动态图。因此,本示例提供了静态图参数到动态图参数的转换脚本:
@@ -93,6 +119,12 @@ python converter/params_static_to_dygraph.py --model ernie-1.0 --path ./output/t
93119
```
94120
在当前目录下,可以看到转换后的参数`ernie-1.0_converted.pdparams`, 也可以设置脚本中`--output_path`参数,指定输出路径。
95121

122+
#### 为PaddleNLP贡献预训练参数
123+
PaddleNLP为开发者支持了[community](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community)模块,用户可以上传自己训练的模型,开源给其他用户使用。
124+
使用本文档给出的参数配置,在CLUECorpusSmall数据集上训练,可以得到[zhui/ernie-1.0-cluecorpussmall](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community/zhui/ernie-1.0-cluecorpussmall)参数,点击链接即可使用。
125+
126+
贡献预训练模型的方法,可以参考[贡献预训练模型权重](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)教程。
127+
96128

97129
### 参考文献
98130
- [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf)

examples/language_model/ernie-1.0/converter/params_static_to_dygraph.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import argparse
22
import paddle
3-
from paddlenlp.transformers import AutoModel
3+
from paddlenlp.transformers import AutoModelForPretraining
44
from paddlenlp.utils.log import logger
55

66
paddle.set_device("cpu")
@@ -25,7 +25,7 @@ def init_dygraph_with_static(model, static_params_path):
2525

2626
def main(args):
2727
logger.info("Loading model: %s" % args.model)
28-
model = AutoModel.from_pretrained(args.model)
28+
model = AutoModelForPretraining.from_pretrained(args.model)
2929
logger.info("Loading static params and trans paramters...")
3030
model_dict = init_dygraph_with_static(model, args.path)
3131
save_name = args.output_path

paddlenlp/transformers/ernie/modeling.py

Lines changed: 92 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,14 @@
1919
from .. import PretrainedModel, register_base_model
2020

2121
__all__ = [
22-
'ErnieModel', 'ErniePretrainedModel', 'ErnieForSequenceClassification',
23-
'ErnieForTokenClassification', 'ErnieForQuestionAnswering',
24-
'ErnieForPretraining', 'ErniePretrainingCriterion'
22+
'ErnieModel',
23+
'ErniePretrainedModel',
24+
'ErnieForSequenceClassification',
25+
'ErnieForTokenClassification',
26+
'ErnieForQuestionAnswering',
27+
'ErnieForPretraining',
28+
'ErniePretrainingCriterion',
29+
'ErnieForMaskedLM',
2530
]
2631

2732

@@ -770,3 +775,87 @@ def forward(self, prediction_scores, seq_relationship_score,
770775
next_sentence_loss = F.cross_entropy(
771776
seq_relationship_score, next_sentence_labels, reduction='none')
772777
return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss)
778+
779+
780+
class ErnieOnlyMLMHead(nn.Layer):
781+
def __init__(self, hidden_size, vocab_size, activation, embedding_weights):
782+
super().__init__()
783+
self.predictions = ErnieLMPredictionHead(
784+
hidden_size=hidden_size,
785+
vocab_size=vocab_size,
786+
activation=activation,
787+
embedding_weights=embedding_weights)
788+
789+
def forward(self, sequence_output, masked_positions=None):
790+
prediction_scores = self.predictions(sequence_output, masked_positions)
791+
return prediction_scores
792+
793+
794+
class ErnieForMaskedLM(ErniePretrainedModel):
795+
"""
796+
Ernie Model with a `masked language modeling` head on top.
797+
798+
Args:
799+
ernie (:class:`ErnieModel`):
800+
An instance of :class:`ErnieModel`.
801+
802+
"""
803+
804+
def __init__(self, ernie):
805+
super(ErnieForMaskedLM, self).__init__()
806+
self.ernie = ernie
807+
self.cls = ErnieOnlyMLMHead(
808+
self.ernie.config["hidden_size"],
809+
self.ernie.config["vocab_size"],
810+
self.ernie.config["hidden_act"],
811+
embedding_weights=self.ernie.embeddings.word_embeddings.weight)
812+
813+
self.apply(self.init_weights)
814+
815+
def forward(self,
816+
input_ids,
817+
token_type_ids=None,
818+
position_ids=None,
819+
attention_mask=None):
820+
r"""
821+
822+
Args:
823+
input_ids (Tensor):
824+
See :class:`ErnieModel`.
825+
token_type_ids (Tensor, optional):
826+
See :class:`ErnieModel`.
827+
position_ids (Tensor, optional):
828+
See :class:`ErnieModel`.
829+
attention_mask (Tensor, optional):
830+
See :class:`ErnieModel`.
831+
832+
Returns:
833+
Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
834+
Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
835+
836+
Example:
837+
.. code-block::
838+
839+
import paddle
840+
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
841+
842+
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
843+
model = ErnieForMaskedLM.from_pretrained('ernie-1.0')
844+
845+
inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
846+
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
847+
848+
logits = model(**inputs)
849+
print(logits.shape)
850+
# [1, 17, 18000]
851+
852+
"""
853+
854+
outputs = self.ernie(
855+
input_ids,
856+
token_type_ids=token_type_ids,
857+
position_ids=position_ids,
858+
attention_mask=attention_mask)
859+
sequence_output = outputs[0]
860+
prediction_scores = self.cls(sequence_output, masked_positions=None)
861+
return prediction_scores

0 commit comments

Comments
 (0)