Skip to content

Commit aa63b68

Browse files
authored
Add clue benchmark classification (#1732)
* fix inference bug * add clue benchmark classification * add cmrc, chid, c3 * update readme * add ernie-3.0-base * update reamde data * update readme * update readme * update readme result * add predict script * remove chid, c3, cmrc2018 * fix readme syntax error * fix predict script * fix readme * remove loggng, update readme * fix readme * Fix readme * Fix readme * update readme
1 parent c6d7c7c commit aa63b68

File tree

6 files changed

+841
-5
lines changed

6 files changed

+841
-5
lines changed

examples/benchmark/clue/README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# CLUE Benchmark
2+
3+
[CLUE](https://www.cluebenchmarks.com/) 自成立以来发布了多项 NLP 评测基准,包括分类榜单,阅读理解榜单和自然语言推断榜单等,在学术界、工业界产生了深远影响。是目前应用最广泛的中文语言测评指标之一。详细可参考 [CLUE论文](https://arxiv.org/abs/2004.05986)
4+
5+
本项目基于 PaddlePaddle 在 CLUE 数据集上对领先的开源预训练模型模型进行了充分评测,为开发者在预训练模型选择上提供参考,同时开发者基于本项目可以轻松一键复现模型效果,也可以参加 CLUE 竞赛取得好成绩。
6+
7+
## CLUE 评测结果
8+
9+
使用多种中文预训练模型微调在 CLUE 的各验证集上有如下结果:
10+
11+
| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL |
12+
| --------------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- |
13+
| RoBERTa-wwm-ext-large | 76.20 | 59.50 | 62.10 | 84.02 | 79.15 | 90.79 | 82.03 |
14+
15+
16+
其中 AFQMC、TNEWS、 IFLYTEK、CMNLI、OCNLI、CLUEWSC2020 和 CSL 任务使用的评估指标均是 Accuracy。
17+
18+
**NOTE:具体评测方式如下**
19+
1. 以上所有任务均基于 Grid Search 方式进行超参寻优,训练每间隔 100 steps 评估验证集效果,取验证集最优效果作为表格中的汇报指标。
20+
21+
2. Grid Search 超参范围: batch_size: 16, 32, 64; learning rates: 1e-5, 2e-5, 3e-5, 5e-5;
22+
23+
3. 因为 CLUEWSC2020 数据集效果对 batch_size 较为敏感,对CLUEWSC2020 评测时额外增加了 batch_size = 8 的超参搜索。
24+
25+
26+
## 一键复现模型效果
27+
28+
这一小节以 TNEWS 任务为例展示如何一键复现本文的评测结果。
29+
30+
### 启动 CLUE 任务
31+
以 CLUE 的 TNEWS 任务为例,启动 CLUE 任务进行 Fine-tuning 的方式如下:
32+
33+
#### 单卡训练
34+
```shell
35+
export CUDA_VISIBLE_DEVICES=0
36+
export TASK_NAME=TNEWS
37+
export LR=3e-5
38+
export BS=16
39+
export EPOCH=6
40+
export MAX_SEQ_LEN=128
41+
export MODEL_PATH=ernie-3.0-base
42+
43+
cd classification
44+
python -u ./run_clue_classifier.py \
45+
--model_type ernie \
46+
--model_name_or_path ${MODEL_PATH} \
47+
--task_name ${TASK_NAME} \
48+
--max_seq_length ${MAX_SEQ_LEN} \
49+
--batch_size ${BS} \
50+
--learning_rate ${LR} \
51+
--num_train_epochs ${EPOCH} \
52+
--logging_steps 100 \
53+
--seed 42 \
54+
--save_steps 100 \
55+
--warmup_proportion 0.1 \
56+
--weight_decay 0.01 \
57+
--adam_epsilon 1e-8 \
58+
--output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
59+
--device gpu \
60+
61+
```
62+
63+
另外,如需评估,传入参数 `--do_eval True` 即可,如果只对读入的 checkpoint 进行评估不训练,可以将 `--do_train` 设为 False。
64+
65+
#### 多卡训练
66+
67+
```shell
68+
69+
unset CUDA_VISIBLE_DEVICES
70+
export TASK_NAME=TNEWS
71+
export LR=3e-5
72+
export BS=32
73+
export EPOCH=6
74+
export MAX_SEQ_LEN=128
75+
export MODEL_PATH=ernie-3.0-base
76+
77+
cd classification
78+
python -m paddle.distributed.launch --gpus "0,1" run_clue_classifier.py \
79+
--model_type ernie \
80+
--model_name_or_path ${MODEL_PATH} \
81+
--task_name ${TASK_NAME} \
82+
--max_seq_length ${MAX_SEQ_LEN} \
83+
--batch_size ${BS} \
84+
--learning_rate ${LR} \
85+
--num_train_epochs ${EPOCH} \
86+
--logging_steps 100 \
87+
--seed 42 \
88+
--save_steps 100 \
89+
--warmup_proportion 0.1 \
90+
--weight_decay 0.01 \
91+
--adam_epsilon 1e-8 \
92+
--output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
93+
--device gpu \
94+
95+
```
96+
其中参数释义如下:
97+
- `model_type` 指示了 Fine-tuning 使用的预训练模型类型,如:ernie、bert 等,因不同类型的预训练模型可能有不同的 Fine-tuning layer 和 tokenizer。
98+
- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型,可以是 PaddleNLP 提供的预训练模型,可以选择 `model_type`[Transformer预训练模型汇总](../../../docs/model_zoo/transformers.rst)中相对应的中文预训练权重。注意这里选择的模型权重要和上面配置的模型类型匹配,例如 model_type 配置的是 ernie,则 model_name_or_path 只能选择 ernie 相关的模型。另,clue 任务应选择中文预训练权重。
99+
100+
- `task_name` 表示 Fine-tuning 的任务,当前支持 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。
101+
- `max_seq_length` 表示最大句子长度,超过该长度将被截断。
102+
- `batch_size` 表示每次迭代**每张卡**上的样本数目。
103+
- `learning_rate` 表示基础学习率大小,将于 learning rate scheduler 产生的值相乘作为当前学习率。
104+
- `num_train_epochs` 表示训练轮数。
105+
- `logging_steps` 表示日志打印间隔。
106+
- `save_steps` 表示模型保存及评估间隔。
107+
- `output_dir` 表示模型保存路径。
108+
- `device` 表示训练使用的设备, 'gpu' 表示使用GPU, 'xpu' 表示使用百度昆仑卡, 'cpu' 表示使用 CPU。
109+
110+
Fine-tuning 过程将按照 `logging_steps``save_steps` 的设置打印出如下日志:
111+
112+
```
113+
global step 100/20010, epoch: 0, batch: 99, rank_id: 0, loss: 2.734340, lr: 0.0000014993, speed: 8.7969 step/s
114+
eval loss: 2.720359, acc: 0.0827, eval done total : 25.712125062942505 s
115+
global step 200/20010, epoch: 0, batch: 199, rank_id: 0, loss: 2.608563, lr: 0.0000029985, speed: 2.5921 step/s
116+
eval loss: 2.652753, acc: 0.0945, eval done total : 25.64827537536621 s
117+
global step 300/20010, epoch: 0, batch: 299, rank_id: 0, loss: 2.555283, lr: 0.0000044978, speed: 2.6032 step/s
118+
eval loss: 2.572999, acc: 0.112, eval done total : 25.67190170288086 s
119+
global step 400/20010, epoch: 0, batch: 399, rank_id: 0, loss: 2.631579, lr: 0.0000059970, speed: 2.6238 step/s
120+
eval loss: 2.476962, acc: 0.1697, eval done total : 25.794789791107178 s
121+
```
122+
123+
## 参加 CLUE 竞赛
124+
125+
对于 CLUE 分类任务,可以直接使用本项目中提供的脚本 `classification/predict_clue_classifier.py` 对单个任务进行预测,并将分类结果输出到文件。
126+
127+
以 TNEWS 为例,假设 TNEWS 模型所在路径为 `${TNEWS_MODEL}`,可以运行如下脚本得到模型在测试集上的预测结果,并将预测结果写入地址 `${OUTPUT_DIR}/tnews_predict.json`
128+
129+
```
130+
cd classification
131+
OUTPUT_DIR=results
132+
mkdir ${OUTPUT_DIR}
133+
134+
python predict_clue_classifier.py \
135+
--model_type ernie \
136+
--task_name TNEWS \
137+
--model_name_or_path ${TNEWS_MODEL} \
138+
--output_dir ${OUTPUT_DIR} \
139+
```
140+
141+
对各个任务运行预测脚本,汇总多个结果文件压缩之后,即可提交至CLUE官网进行评测。
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
# import logging
17+
import os
18+
import sys
19+
import random
20+
import time
21+
import math
22+
import json
23+
from functools import partial
24+
25+
import numpy as np
26+
import paddle
27+
from paddle.io import DataLoader
28+
import paddle.nn as nn
29+
import paddle.nn.functional as F
30+
from paddle.metric import Metric, Accuracy, Precision, Recall
31+
32+
from paddlenlp.datasets import load_dataset
33+
from paddlenlp.data import Stack, Tuple, Pad, Dict
34+
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
35+
from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
36+
from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
37+
38+
METRIC_CLASSES = {
39+
"afqmc": Accuracy,
40+
"tnews": Accuracy,
41+
"iflytek": Accuracy,
42+
"ocnli": Accuracy,
43+
"cmnli": Accuracy,
44+
"cluewsc2020": Accuracy,
45+
"csl": Accuracy,
46+
}
47+
48+
MODEL_CLASSES = {
49+
"bert": (BertForSequenceClassification, BertTokenizer),
50+
"ernie": (ErnieForSequenceClassification, ErnieTokenizer),
51+
"roberta": (RobertaForSequenceClassification, RobertaTokenizer),
52+
}
53+
54+
55+
def parse_args():
56+
parser = argparse.ArgumentParser()
57+
58+
# Required parameters
59+
parser.add_argument(
60+
"--task_name",
61+
default=None,
62+
type=str,
63+
required=True,
64+
help="The name of the task to train selected in the list: " +
65+
", ".join(METRIC_CLASSES.keys()), )
66+
parser.add_argument(
67+
"--model_type",
68+
default="ernie",
69+
type=str,
70+
help="Model type selected in the list: " +
71+
", ".join(MODEL_CLASSES.keys()), )
72+
parser.add_argument(
73+
"--model_name_or_path",
74+
default=None,
75+
type=str,
76+
required=True,
77+
help="Path to pre-trained model or shortcut name selected in the list: "
78+
+ ", ".join(
79+
sum([
80+
list(classes[-1].pretrained_init_configuration.keys())
81+
for classes in MODEL_CLASSES.values()
82+
], [])), )
83+
parser.add_argument(
84+
"--output_dir",
85+
default="tmp",
86+
type=str,
87+
help="The output directory where the model predictions and checkpoints will be written.",
88+
)
89+
90+
parser.add_argument(
91+
"--max_seq_length",
92+
default=128,
93+
type=int,
94+
help="The maximum total input sequence length after tokenization. Sequences longer "
95+
"than this will be truncated, sequences shorter will be padded.", )
96+
97+
parser.add_argument(
98+
"--batch_size",
99+
default=128,
100+
type=int,
101+
help="Batch size per GPU/CPU for training.", )
102+
103+
parser.add_argument(
104+
"--device",
105+
default="gpu",
106+
type=str,
107+
help="The device to select to train the model, is must be cpu/gpu/xpu.")
108+
args = parser.parse_args()
109+
return args
110+
111+
112+
def convert_example(example,
113+
tokenizer,
114+
label_list,
115+
max_seq_length=512,
116+
is_test=False):
117+
"""convert a glue example into necessary features"""
118+
if not is_test:
119+
# `label_list == None` is for regression task
120+
label_dtype = "int64" if label_list else "float32"
121+
# Get the label
122+
label = example['label']
123+
label = np.array([label], dtype=label_dtype)
124+
# Convert raw text to feature
125+
if 'sentence' in example:
126+
example = tokenizer(example['sentence'], max_seq_len=max_seq_length)
127+
elif 'sentence1' in example:
128+
example = tokenizer(
129+
example['sentence1'],
130+
text_pair=example['sentence2'],
131+
max_seq_len=max_seq_length)
132+
elif 'keyword' in example: # CSL
133+
sentence1 = " ".join(example['keyword'])
134+
example = tokenizer(
135+
sentence1, text_pair=example['abst'], max_seq_len=max_seq_length)
136+
elif 'target' in example: # wsc
137+
text, query, pronoun, query_idx, pronoun_idx = example['text'], example[
138+
'target']['span1_text'], example['target']['span2_text'], example[
139+
'target']['span1_index'], example['target']['span2_index']
140+
text_list = list(text)
141+
assert text[pronoun_idx:(pronoun_idx + len(pronoun)
142+
)] == pronoun, "pronoun: {}".format(pronoun)
143+
assert text[query_idx:(query_idx + len(query)
144+
)] == query, "query: {}".format(query)
145+
if pronoun_idx > query_idx:
146+
text_list.insert(query_idx, "_")
147+
text_list.insert(query_idx + len(query) + 1, "_")
148+
text_list.insert(pronoun_idx + 2, "[")
149+
text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
150+
else:
151+
text_list.insert(pronoun_idx, "[")
152+
text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
153+
text_list.insert(query_idx + 2, "_")
154+
text_list.insert(query_idx + len(query) + 2 + 1, "_")
155+
text = "".join(text_list)
156+
example = tokenizer(text, max_seq_len=max_seq_length)
157+
158+
if not is_test:
159+
return example['input_ids'], example['token_type_ids'], label
160+
else:
161+
return example['input_ids'], example['token_type_ids']
162+
163+
164+
def do_test(args):
165+
paddle.set_device(args.device)
166+
167+
args.task_name = args.task_name.lower()
168+
metric_class = METRIC_CLASSES[args.task_name]
169+
args.model_type = args.model_type.lower()
170+
model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
171+
train_ds, test_ds = load_dataset(
172+
'clue', args.task_name, splits=('train', 'test'))
173+
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
174+
175+
trans_func = partial(
176+
convert_example,
177+
tokenizer=tokenizer,
178+
label_list=train_ds.label_list,
179+
max_seq_length=args.max_seq_length,
180+
is_test=True)
181+
182+
batchify_fn = lambda samples, fn=Tuple(
183+
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
184+
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
185+
): fn(samples)
186+
187+
test_ds = test_ds.map(trans_func, lazy=True)
188+
test_batch_sampler = paddle.io.BatchSampler(
189+
test_ds, batch_size=args.batch_size, shuffle=False)
190+
test_data_loader = DataLoader(
191+
dataset=test_ds,
192+
batch_sampler=test_batch_sampler,
193+
collate_fn=batchify_fn,
194+
num_workers=0,
195+
return_list=True)
196+
197+
num_classes = 1 if train_ds.label_list == None else len(train_ds.label_list)
198+
model_class, _ = MODEL_CLASSES[args.model_type]
199+
model = model_class.from_pretrained(
200+
args.model_name_or_path, num_classes=num_classes)
201+
202+
if not os.path.exists(args.output_dir):
203+
os.makedirs(args.output_dir)
204+
if args.task_name == 'ocnli':
205+
args.task_name = 'ocnli_50k'
206+
f = open(
207+
os.path.join(args.output_dir, args.task_name + "_predict.json"), 'w')
208+
209+
for step, batch in enumerate(test_data_loader):
210+
input_ids, segment_ids = batch
211+
212+
with paddle.no_grad():
213+
logits = model(input_ids, segment_ids)
214+
215+
preds = paddle.argmax(logits, axis=1)
216+
for idx, pred in enumerate(preds):
217+
j = json.dumps({"id": idx, "label": train_ds.label_list[pred]})
218+
f.write(j + "\n")
219+
220+
221+
if __name__ == "__main__":
222+
args = parse_args()
223+
do_test(args)

0 commit comments

Comments
 (0)