Skip to content

Commit 3021098

Browse files
ZHUItianxin
andauthored
[Trainer] Add clue benchmark script with trainer. (#1909)
* add clue for trainer. Co-authored-by: tianxin <[email protected]>
1 parent 7eb0c7a commit 3021098

File tree

6 files changed

+374
-29
lines changed

6 files changed

+374
-29
lines changed

examples/benchmark/clue/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,50 @@ eval loss: 2.572999, acc: 0.112, eval done total : 25.67190170288086 s
100100
global step 400/20010, epoch: 0, batch: 399, rank_id: 0, loss: 2.631579, lr: 0.0000059970, speed: 2.6238 step/s
101101
eval loss: 2.476962, acc: 0.1697, eval done total : 25.794789791107178 s
102102
```
103+
#### 使用Trainer启动 CLUE 分类任务
104+
PaddleNLP提供了Trainer API,本示例新增了`run_clue_classifier_trainer.py`脚本供用户使用。需要从源码安装paddlenlp使用。
105+
```
106+
export CUDA_VISIBLE_DEVICES=0
107+
export TASK_NAME=TNEWS
108+
export LR=3e-5
109+
export BS=32
110+
export EPOCH=6
111+
export MAX_SEQ_LEN=128
112+
export MODEL_PATH=roberta-wwm-ext-large
113+
114+
cd classification
115+
mkdir roberta-wwm-ext-large
116+
117+
python -u ./run_clue_classifier_trainer.py \
118+
--model_name_or_path ${MODEL_PATH} \
119+
--dataset "clue ${TASK_NAME}" \
120+
--max_seq_length ${MAX_SEQ_LEN} \
121+
--per_device_train_batch_size ${BS} \
122+
--per_device_eval_batch_size ${BS} \
123+
--learning_rate ${LR} \
124+
--num_train_epochs ${EPOCH} \
125+
--logging_steps 100 \
126+
--seed 42 \
127+
--save_steps 100 \
128+
--warmup_ratio 0.1 \
129+
--weight_decay 0.01 \
130+
--adam_epsilon 1e-8 \
131+
--output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
132+
--device gpu \
133+
--do_train \
134+
--do_eval \
135+
--metric_for_best_model "eval_accuracy" \
136+
--load_best_model_at_end \
137+
--save_total_limit 3 \
138+
```
139+
大部分参数含义如上文所述,这里简要介绍一些新参数:
140+
- `dataset`, 同上文`task_name`,此处为小写字母。表示 Fine-tuning 的分类任务,当前支持 afamc、tnews、iflytek、ocnli、cmnli、csl、cluewsc2020。
141+
- `per_device_train_batch_size` 同上文`batch_size`。训练时,每次迭代**每张卡**上的样本数目。
142+
- `per_device_eval_batch_size` 同上文`batch_size`。评估时,每次迭代**每张卡**上的样本数目。
143+
- `warmup_ratio` 同上文`warmup_proportion`,warmup步数占总步数的比例。
144+
- `metric_for_best_model` 评估时,最优评估指标。
145+
- `load_best_model_at_end` 训练结束时,时候加载评估结果最好的 ckpt。
146+
- `save_total_limit` 保存的ckpt数量的最大限制
103147

104148
### 启动 CLUE 阅读理解任务
105149
以 CLUE 的 C<sup>3</sup> 任务为例,多卡启动 CLUE 任务进行 Fine-tuning 的方式如下:
Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import sys
17+
from functools import partial
18+
from typing import Optional
19+
from dataclasses import dataclass, field
20+
21+
import numpy as np
22+
import paddle
23+
import paddle.nn as nn
24+
import paddle.nn.functional as F
25+
from paddle.metric import Accuracy
26+
from paddlenlp.data import DataCollatorWithPadding
27+
from paddlenlp.datasets import load_dataset
28+
from paddlenlp.trainer import (
29+
PdArgumentParser,
30+
TrainingArguments,
31+
Trainer, )
32+
from paddlenlp.trainer.trainer_utils import get_last_checkpoint
33+
from paddlenlp.transformers import (
34+
AutoTokenizer,
35+
AutoModelForSequenceClassification, )
36+
from paddlenlp.utils.log import logger
37+
38+
39+
@dataclass
40+
class DataTrainingArguments:
41+
"""
42+
Arguments pertaining to what data we are going to input our model for training and eval.
43+
Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
44+
specify them on the command line.
45+
"""
46+
47+
dataset: str = field(
48+
default=None,
49+
metadata={
50+
"help": "The name of the dataset to use (via the datasets library)."
51+
})
52+
53+
max_seq_length: int = field(
54+
default=128,
55+
metadata={
56+
"help":
57+
"The maximum total input sequence length after tokenization. Sequences longer "
58+
"than this will be truncated, sequences shorter will be padded."
59+
}, )
60+
do_lower_case: bool = field(
61+
default=False,
62+
metadata={
63+
"help":
64+
"Whether to lower case the input text. Should be True for uncased models and False for cased models."
65+
}, )
66+
67+
68+
@dataclass
69+
class ModelArguments:
70+
"""
71+
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
72+
"""
73+
74+
model_name_or_path: str = field(metadata={
75+
"help":
76+
"Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
77+
})
78+
config_name: Optional[str] = field(
79+
default=None,
80+
metadata={
81+
"help":
82+
"Pretrained config name or path if not the same as model_name"
83+
})
84+
tokenizer_name: Optional[str] = field(
85+
default=None,
86+
metadata={
87+
"help":
88+
"Pretrained tokenizer name or path if not the same as model_name"
89+
})
90+
cache_dir: Optional[str] = field(
91+
default=None,
92+
metadata={
93+
"help":
94+
"Path to directory to store the pretrained models downloaded from huggingface.co"
95+
}, )
96+
export_model_dir: Optional[str] = field(
97+
default=None,
98+
metadata={
99+
"help":
100+
"Path to directory to store the pretrained models downloaded from huggingface.co"
101+
}, )
102+
103+
104+
# Data pre-process function for clue benchmark datatset
105+
def convert_clue(example,
106+
label_list,
107+
tokenizer=None,
108+
max_seq_length=512,
109+
**kwargs):
110+
"""convert a glue example into necessary features"""
111+
is_test = False
112+
if 'label' not in example.keys():
113+
is_test = True
114+
115+
if not is_test:
116+
# `label_list == None` is for regression task
117+
label_dtype = "int64" if label_list else "float32"
118+
# print("label_list", label_list)
119+
# Get the label
120+
# example['label'] = np.array(example["label"], dtype="int64")
121+
example['label'] = int(example[
122+
"label"]) if label_dtype != "float32" else float(example["label"])
123+
label = example['label']
124+
# Convert raw text to feature
125+
if 'keyword' in example: # CSL
126+
sentence1 = " ".join(example['keyword'])
127+
example = {
128+
'sentence1': sentence1,
129+
'sentence2': example['abst'],
130+
'label': example['label']
131+
}
132+
elif 'target' in example: # wsc
133+
text, query, pronoun, query_idx, pronoun_idx = example['text'], example[
134+
'target']['span1_text'], example['target']['span2_text'], example[
135+
'target']['span1_index'], example['target']['span2_index']
136+
text_list = list(text)
137+
assert text[pronoun_idx:(pronoun_idx + len(pronoun)
138+
)] == pronoun, "pronoun: {}".format(pronoun)
139+
assert text[query_idx:(query_idx + len(query)
140+
)] == query, "query: {}".format(query)
141+
if pronoun_idx > query_idx:
142+
text_list.insert(query_idx, "_")
143+
text_list.insert(query_idx + len(query) + 1, "_")
144+
text_list.insert(pronoun_idx + 2, "[")
145+
text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
146+
else:
147+
text_list.insert(pronoun_idx, "[")
148+
text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
149+
text_list.insert(query_idx + 2, "_")
150+
text_list.insert(query_idx + len(query) + 2 + 1, "_")
151+
text = "".join(text_list)
152+
example['sentence'] = text
153+
154+
if tokenizer is None:
155+
return example
156+
if 'sentence' in example:
157+
example = tokenizer(example['sentence'], max_seq_len=max_seq_length)
158+
elif 'sentence1' in example:
159+
example = tokenizer(
160+
example['sentence1'],
161+
text_pair=example['sentence2'],
162+
max_seq_len=max_seq_length)
163+
164+
if not is_test:
165+
return {
166+
"input_ids": example['input_ids'],
167+
"token_type_ids": example['token_type_ids'],
168+
"labels": label
169+
}
170+
else:
171+
return {
172+
"input_ids": example['input_ids'],
173+
"token_type_ids": example['token_type_ids']
174+
}
175+
176+
177+
def clue_trans_fn(example, tokenizer, args):
178+
return convert_clue(
179+
example,
180+
tokenizer=tokenizer,
181+
label_list=args.label_list,
182+
max_seq_length=args.max_seq_length)
183+
184+
185+
def main():
186+
parser = PdArgumentParser(
187+
(ModelArguments, DataTrainingArguments, TrainingArguments))
188+
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
189+
190+
# Log model and data config
191+
training_args.print_config(model_args, "Model")
192+
training_args.print_config(data_args, "Data")
193+
194+
paddle.set_device(training_args.device)
195+
196+
# Log on each process the small summary:
197+
logger.warning(
198+
f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
199+
+
200+
f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
201+
)
202+
203+
# Detecting last checkpoint.
204+
last_checkpoint = None
205+
if os.path.isdir(
206+
training_args.output_dir
207+
) and training_args.do_train and not training_args.overwrite_output_dir:
208+
last_checkpoint = get_last_checkpoint(training_args.output_dir)
209+
if last_checkpoint is None and len(
210+
os.listdir(training_args.output_dir)) > 0:
211+
raise ValueError(
212+
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
213+
"Use --overwrite_output_dir to overcome.")
214+
elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
215+
logger.info(
216+
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
217+
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
218+
)
219+
220+
data_args.dataset = data_args.dataset.strip()
221+
222+
dataset_config = data_args.dataset.split(" ")
223+
print(dataset_config)
224+
raw_datasets = load_dataset(
225+
dataset_config[0],
226+
name=None if len(dataset_config) <= 1 else dataset_config[1],
227+
splits=('train', 'dev'))
228+
229+
data_args.label_list = getattr(raw_datasets['train'], "label_list", None)
230+
num_classes = 1 if raw_datasets["train"].label_list == None else len(
231+
raw_datasets['train'].label_list)
232+
233+
# Define tokenizer, model, loss function.
234+
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
235+
model = AutoModelForSequenceClassification.from_pretrained(
236+
model_args.model_name_or_path, num_classes=num_classes)
237+
criterion = nn.loss.CrossEntropyLoss(
238+
) if data_args.label_list else nn.loss.MSELoss()
239+
240+
# Define dataset pre-process function
241+
trans_fn = partial(clue_trans_fn, tokenizer=tokenizer, args=data_args)
242+
243+
# Define data collector
244+
data_collator = DataCollatorWithPadding(tokenizer)
245+
246+
# Dataset pre-process
247+
if training_args.do_train:
248+
train_dataset = raw_datasets["train"].map(trans_fn)
249+
if training_args.do_eval:
250+
eval_dataset = raw_datasets["dev"].map(trans_fn)
251+
if training_args.do_predict:
252+
test_dataset = raw_datasets["test"].map(trans_fn)
253+
254+
# Define the metrics of tasks.
255+
def compute_metrics(p):
256+
preds = p.predictions[0] if isinstance(p.predictions,
257+
tuple) else p.predictions
258+
259+
preds = paddle.to_tensor(preds)
260+
label = paddle.to_tensor(p.label_ids)
261+
262+
probs = F.softmax(preds, axis=1)
263+
metric = Accuracy()
264+
metric.reset()
265+
result = metric.compute(preds, label)
266+
metric.update(result)
267+
accu = metric.accumulate()
268+
metric.reset()
269+
return {"accuracy": accu}
270+
271+
trainer = Trainer(
272+
model=model,
273+
criterion=criterion,
274+
args=training_args,
275+
data_collator=data_collator,
276+
train_dataset=train_dataset if training_args.do_train else None,
277+
eval_dataset=eval_dataset if training_args.do_eval else None,
278+
tokenizer=tokenizer,
279+
compute_metrics=compute_metrics, )
280+
281+
checkpoint = None
282+
if training_args.resume_from_checkpoint is not None:
283+
checkpoint = training_args.resume_from_checkpoint
284+
elif last_checkpoint is not None:
285+
checkpoint = last_checkpoint
286+
287+
# Training
288+
if training_args.do_train:
289+
train_result = trainer.train(resume_from_checkpoint=checkpoint)
290+
metrics = train_result.metrics
291+
trainer.save_model() # Saves the tokenizer too for easy upload
292+
trainer.log_metrics("train", metrics)
293+
trainer.save_metrics("train", metrics)
294+
trainer.save_state()
295+
296+
# Evaluate and tests model
297+
if training_args.do_eval:
298+
eval_metrics = trainer.evaluate()
299+
trainer.log_metrics("eval", eval_metrics)
300+
301+
if training_args.do_predict:
302+
test_ret = trainer.predict(test_dataset)
303+
trainer.log_metrics("test", test_ret.metrics)
304+
if test_ret.label_ids is None:
305+
paddle.save(
306+
test_ret.predictions,
307+
os.path.join(training_args.output_dir, "test_results.pdtensor"),
308+
)
309+
310+
# export inference model
311+
if training_args.do_export:
312+
input_spec = [
313+
paddle.static.InputSpec(
314+
shape=[None, None], dtype="int64"), # input_ids
315+
paddle.static.InputSpec(
316+
shape=[None, None], dtype="int64") # segment_ids
317+
]
318+
trainer.export_model(
319+
input_spec=input_spec,
320+
load_best_model=True,
321+
output_dir=model_args.export_model_dir)
322+
323+
324+
if __name__ == "__main__":
325+
main()

examples/language_model/ernie-1.0/finetune/question_answering.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,8 @@
1616
import time
1717
import json
1818
import os
19-
import sys
20-
from functools import partial
2119

22-
import numpy as np
2320
import paddle
24-
import paddlenlp as ppnlp
25-
from paddlenlp.data import Pad, Stack, Tuple
26-
from paddlenlp.utils.log import logger
2721
from paddlenlp.trainer import Trainer
2822
from paddlenlp.trainer.trainer_utils import PredictionOutput
2923

0 commit comments

Comments
 (0)