PaddlePaddle
diff --git a/‎examples/benchmark/clue/README.md‎
Lines changed: 141 additions & 0 deletions b/‎examples/benchmark/clue/README.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎examples/benchmark/clue/classification/predict_clue_classifier.py‎
Lines changed: 223 additions & 0 deletions b/‎examples/benchmark/clue/classification/predict_clue_classifier.py‎
Lines changed: 223 additions & 0 deletions
@@ -0,0 +1,141 @@
+# CLUE Benchmark
+
+[CLUE](https://www.cluebenchmarks.com/) 自成立以来发布了多项 NLP 评测基准，包括分类榜单，阅读理解榜单和自然语言推断榜单等，在学术界、工业界产生了深远影响。是目前应用最广泛的中文语言测评指标之一。详细可参考 [CLUE论文](https://arxiv.org/abs/2004.05986)。
+
+本项目基于 PaddlePaddle 在 CLUE 数据集上对领先的开源预训练模型模型进行了充分评测，为开发者在预训练模型选择上提供参考，同时开发者基于本项目可以轻松一键复现模型效果，也可以参加 CLUE 竞赛取得好成绩。
+
+## CLUE 评测结果
+
+使用多种中文预训练模型微调在 CLUE 的各验证集上有如下结果：
+
+| Model                 | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   |
+| --------------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- |
+| RoBERTa-wwm-ext-large | 76.20 | 59.50 | 62.10   | 84.02 | 79.15 | 90.79       | 82.03 |
+
+
+其中 AFQMC、TNEWS、 IFLYTEK、CMNLI、OCNLI、CLUEWSC2020 和 CSL 任务使用的评估指标均是 Accuracy。
+
+**NOTE：具体评测方式如下**
+1. 以上所有任务均基于 Grid Search 方式进行超参寻优，训练每间隔 100 steps 评估验证集效果，取验证集最优效果作为表格中的汇报指标。
+
+2. Grid Search 超参范围: batch_size: 16, 32, 64; learning rates: 1e-5, 2e-5, 3e-5, 5e-5;
+
+3. 因为 CLUEWSC2020 数据集效果对 batch_size 较为敏感，对CLUEWSC2020 评测时额外增加了 batch_size = 8 的超参搜索。
+
+
+## 一键复现模型效果
+
+这一小节以 TNEWS 任务为例展示如何一键复现本文的评测结果。
+
+### 启动 CLUE 任务
+以 CLUE 的 TNEWS 任务为例，启动 CLUE 任务进行 Fine-tuning 的方式如下：
+
+#### 单卡训练
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=TNEWS
+export LR=3e-5
+export BS=16
+export EPOCH=6
+export MAX_SEQ_LEN=128
+export MODEL_PATH=ernie-3.0-base
+
+cd classification
+python -u ./run_clue_classifier.py \
+    --model_type ernie  \
+    --model_name_or_path ${MODEL_PATH} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
+
+```
+
+另外，如需评估，传入参数 `--do_eval True` 即可，如果只对读入的 checkpoint 进行评估不训练，可以将 `--do_train` 设为 False。
+
+#### 多卡训练
+
+```shell
+
+unset CUDA_VISIBLE_DEVICES
+export TASK_NAME=TNEWS
+export LR=3e-5
+export BS=32
+export EPOCH=6
+export MAX_SEQ_LEN=128
+export MODEL_PATH=ernie-3.0-base
+
+cd classification
+python -m paddle.distributed.launch --gpus "0,1" run_clue_classifier.py \
+    --model_type ernie  \
+    --model_name_or_path ${MODEL_PATH} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
+
+```
+其中参数释义如下：
+- `model_type` 指示了 Fine-tuning 使用的预训练模型类型，如：ernie、bert 等，因不同类型的预训练模型可能有不同的 Fine-tuning layer 和 tokenizer。
+- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型，可以是 PaddleNLP 提供的预训练模型，可以选择 `model_type` 在[Transformer预训练模型汇总](../../../docs/model_zoo/transformers.rst)中相对应的中文预训练权重。注意这里选择的模型权重要和上面配置的模型类型匹配，例如 model_type 配置的是 ernie，则 model_name_or_path 只能选择 ernie 相关的模型。另，clue 任务应选择中文预训练权重。
+
+- `task_name` 表示 Fine-tuning 的任务，当前支持 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于 learning rate scheduler 产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu' 表示使用GPU, 'xpu' 表示使用百度昆仑卡, 'cpu' 表示使用 CPU。
+
+Fine-tuning 过程将按照 `logging_steps` 和 `save_steps` 的设置打印出如下日志：
+
+```
+global step 100/20010, epoch: 0, batch: 99, rank_id: 0, loss: 2.734340, lr: 0.0000014993, speed: 8.7969 step/s
+eval loss: 2.720359, acc: 0.0827, eval done total : 25.712125062942505 s
+global step 200/20010, epoch: 0, batch: 199, rank_id: 0, loss: 2.608563, lr: 0.0000029985, speed: 2.5921 step/s
+eval loss: 2.652753, acc: 0.0945, eval done total : 25.64827537536621 s
+global step 300/20010, epoch: 0, batch: 299, rank_id: 0, loss: 2.555283, lr: 0.0000044978, speed: 2.6032 step/s
+eval loss: 2.572999, acc: 0.112, eval done total : 25.67190170288086 s
+global step 400/20010, epoch: 0, batch: 399, rank_id: 0, loss: 2.631579, lr: 0.0000059970, speed: 2.6238 step/s
+eval loss: 2.476962, acc: 0.1697, eval done total : 25.794789791107178 s
+```
+
+## 参加 CLUE 竞赛
+
+对于 CLUE 分类任务，可以直接使用本项目中提供的脚本 `classification/predict_clue_classifier.py` 对单个任务进行预测，并将分类结果输出到文件。
+
+以 TNEWS 为例，假设 TNEWS 模型所在路径为 `${TNEWS_MODEL}`，可以运行如下脚本得到模型在测试集上的预测结果，并将预测结果写入地址 `${OUTPUT_DIR}/tnews_predict.json`：
+
+```
+cd classification
+OUTPUT_DIR=results
+mkdir ${OUTPUT_DIR}
+
+python predict_clue_classifier.py \
+    --model_type ernie \
+    --task_name TNEWS \
+    --model_name_or_path ${TNEWS_MODEL}  \
+    --output_dir ${OUTPUT_DIR} \
+```
+
+对各个任务运行预测脚本，汇总多个结果文件压缩之后，即可提交至CLUE官网进行评测。
@@ -0,0 +1,223 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+# import logging
+import os
+import sys
+import random
+import time
+import math
+import json
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.metric import Metric, Accuracy, Precision, Recall
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Stack, Tuple, Pad, Dict
+from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+METRIC_CLASSES = {
+    "afqmc": Accuracy,
+    "tnews": Accuracy,
+    "iflytek": Accuracy,
+    "ocnli": Accuracy,
+    "cmnli": Accuracy,
+    "cluewsc2020": Accuracy,
+    "csl": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+    "roberta": (RobertaForSequenceClassification, RobertaTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " +
+        ", ".join(METRIC_CLASSES.keys()), )
+    parser.add_argument(
+        "--model_type",
+        default="ernie",
+        type=str,
+        help="Model type selected in the list: " +
+        ", ".join(MODEL_CLASSES.keys()), )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([
+                list(classes[-1].pretrained_init_configuration.keys())
+                for classes in MODEL_CLASSES.values()
+            ], [])), )
+    parser.add_argument(
+        "--output_dir",
+        default="tmp",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.", )
+
+    parser.add_argument(
+        "--batch_size",
+        default=128,
+        type=int,
+        help="Batch size per GPU/CPU for training.", )
+
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    args = parser.parse_args()
+    return args
+
+
+def convert_example(example,
+                    tokenizer,
+                    label_list,
+                    max_seq_length=512,
+                    is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example['label']
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if 'sentence' in example:
+        example = tokenizer(example['sentence'], max_seq_len=max_seq_length)
+    elif 'sentence1' in example:
+        example = tokenizer(
+            example['sentence1'],
+            text_pair=example['sentence2'],
+            max_seq_len=max_seq_length)
+    elif 'keyword' in example:  # CSL
+        sentence1 = " ".join(example['keyword'])
+        example = tokenizer(
+            sentence1, text_pair=example['abst'], max_seq_len=max_seq_length)
+    elif 'target' in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = example['text'], example[
+            'target']['span1_text'], example['target']['span2_text'], example[
+                'target']['span1_index'], example['target']['span2_index']
+        text_list = list(text)
+        assert text[pronoun_idx:(pronoun_idx + len(pronoun)
+                                 )] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx:(query_idx + len(query)
+                               )] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example = tokenizer(text, max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example['input_ids'], example['token_type_ids'], label
+    else:
+        return example['input_ids'], example['token_type_ids']
+
+
+def do_test(args):
+    paddle.set_device(args.device)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    train_ds, test_ds = load_dataset(
+        'clue', args.task_name, splits=('train', 'test'))
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=train_ds.label_list,
+        max_seq_length=args.max_seq_length,
+        is_test=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+    ): fn(samples)
+
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = paddle.io.BatchSampler(
+        test_ds, batch_size=args.batch_size, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds,
+        batch_sampler=test_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True)
+
+    num_classes = 1 if train_ds.label_list == None else len(train_ds.label_list)
+    model_class, _ = MODEL_CLASSES[args.model_type]
+    model = model_class.from_pretrained(
+        args.model_name_or_path, num_classes=num_classes)
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    if args.task_name == 'ocnli':
+        args.task_name = 'ocnli_50k'
+    f = open(
+        os.path.join(args.output_dir, args.task_name + "_predict.json"), 'w')
+
+    for step, batch in enumerate(test_data_loader):
+        input_ids, segment_ids = batch
+
+        with paddle.no_grad():
+            logits = model(input_ids, segment_ids)
+
+        preds = paddle.argmax(logits, axis=1)
+        for idx, pred in enumerate(preds):
+            j = json.dumps({"id": idx, "label": train_ds.label_list[pred]})
+            f.write(j + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_test(args)