PaddlePaddle
diff --git a/‎examples/language_model/chinesebert/README.md‎
Lines changed: 172 additions & 0 deletions b/‎examples/language_model/chinesebert/README.md‎
Lines changed: 172 additions & 0 deletions
diff --git a/‎examples/language_model/chinesebert/cmrc_eval.sh‎
Lines changed: 1 addition & 0 deletions b/‎examples/language_model/chinesebert/cmrc_eval.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/language_model/chinesebert/cmrc_evaluate.py‎
Lines changed: 190 additions & 0 deletions b/‎examples/language_model/chinesebert/cmrc_evaluate.py‎
Lines changed: 190 additions & 0 deletions
@@ -0,0 +1,172 @@
+# ChineseBert with PaddleNLP
+
+[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/pdf/2106.16038.pdf)
+
+**摘要：**
+最近的汉语预训练模型忽略了汉语特有的两个重要方面：字形和拼音，它们对语言理解具有重要的语法和语义信息。在本研究中，我们提出了汉语预训练，它将汉字的字形和拼音信息纳入语言模型预训练中。字形嵌入是基于汉字的不同字体获得的，能够从视觉特征中捕捉汉字语义，拼音嵌入代表汉字的发音，处理汉语中高度流行的异义现象（同一汉字具有不同的发音和不同的含义）。在大规模的未标记中文语料库上进行预训练后，所提出的ChineseBERT模型在训练步骤较少的基线模型上产生了显著的性能提高。该模型在广泛的中国自然语言处理任务上实现了新的SOTA性能，包括机器阅读理解、自然语言推理、文本分类、句子对匹配和命名实体识别方面的竞争性能。
+
+本项目是 ChineseBert 在 Paddle 2.x上的开源实现。
+
+## **数据准备**
+涉及到的ChnSentiCorp，crmc2018，XNLI数据
+部分Paddle已提供，其他可参考https://github.com/27182812/ChineseBERT_paddle,
+在data目录下。
+
+
+## **模型预训练**
+模型预训练过程可参考[Electra的README](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/electra/README.md)
+
+## **Fine-tuning**
+
+### 运行Fine-tuning
+
+#### **使用Paddle提供的预训练模型运行 Fine-tuning**
+
+#### 1、ChnSentiCorp
+以ChnSentiCorp数据集为例
+
+#### （1）模型微调：
+```shell
+# 运行训练
+python train_chn.py \
+--data_path './data/ChnSentiCorp' \
+--device 'gpu' \
+--epochs 10 \
+--max_seq_length 512 \
+--batch_size 8 \
+--learning_rate 2e-5 \
+--weight_decay 0.0001 \
+--warmup_proportion 0.1 \
+--seed 2333 \
+--save_dir 'outputs/chn' | tee outputs/train_chn.log
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `epochs` 表示训练轮数。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `warmup_steps` 表示动态学习率热启动的step数。
+- `seed` 指定随机种子。
+- `save_dir` 表示模型保存路径。
+
+#### (2) 评估
+
+在dev和test数据集上acc分别为95.8和96.08，达到论文精度要求。
+
+#### 2、XNLI
+
+#### （1）训练
+
+```bash
+python train_xnli.py \
+--data_path './data/XNLI' \
+--device 'gpu' \
+--epochs 5 \
+--max_seq_len 256 \
+--batch_size 16 \
+--learning_rate 1.3e-5 \
+--weight_decay 0.001 \
+--warmup_proportion 0.1 \
+--seed 2333 \
+--save_dir outputs/xnli | tee outputs/train_xnli.log
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `epochs` 表示训练轮数。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `warmup_steps` 表示动态学习率热启动的step数。
+- `seed` 指定随机种子。
+- `save_dir` 表示模型保存路径。
+
+#### （2）评估
+
+test数据集 acc最好结果为81.657,达到论文精度要求。
+
+#### 3、cmrc2018
+
+#### (1) 训练
+
+```shell
+# 开始训练
+python train_cmrc2018.py \
+    --data_dir "data/cmrc2018" \
+    --model_name_or_path ChineseBERT-large \
+    --max_seq_length 512 \
+    --train_batch_size 8 \
+    --gradient_accumulation_steps 8 \
+    --eval_batch_size 16 \
+    --learning_rate 4e-5 \
+    --max_grad_norm 1.0 \
+    --num_train_epochs 3 \
+    --logging_steps 2 \
+    --save_steps 20 \
+    --warmup_radio 0.1 \
+    --weight_decay 0.01 \
+    --output_dir outputs/cmrc2018 \
+    --seed 1111 \
+    --num_workers 0 \
+    --use_amp
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径。
+- `model_name_or_path` 模型名称或者路径，支持ChineseBERT-base、ChineseBERT-large两种种规格。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `train_batch_size` 表示训练过程中每次迭代**每张卡**上的样本数目。
+- `gradient_accumulation_steps` 梯度累加步数。
+- `eval_batch_size` 表示验证过程中每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `max_grad_norm` 梯度裁剪。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `warmup_radio` 表示动态学习率热启动的比例。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `output_dir` 表示模型保存路径。
+- `seed` 指定随机种子。
+- `num_workers` 表示同时工作进程。
+- `use_amp` 表示是否使用混合精度。
+
+训练过程中模型会在dev数据集进行评估，其中最好的结果如下所示：
+
+```python
+
+{
+    AVERAGE = 82.791
+    F1 = 91.055
+    EM = 74.526
+    TOTAL = 3219
+    SKIP = 0
+}
+
+```
+
+#### （2）运行eval_cmrc.py，生成test数据集预测答案
+
+```bash
+python eval_cmrc.py --model_name_or_path outputs/step-340 --n_best_size 35 --max_answer_length 65
+```
+
+其中，model_name_or_path为模型路径
+
+#### （3）提交CLUE
+
+test数据集 EM为78.55，达到论文精度要求
+
+
+## Reference
+
+```bibtex
+@article{sun2021chinesebert,
+  title={ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information},
+  author={Sun, Zijun and Li, Xiaoya and Sun, Xiaofei and Meng, Yuxian and Ao, Xiang and He, Qing and Wu, Fei and Li, Jiwei},
+  journal={arXiv preprint arXiv:2106.16038},
+  year={2021}
+}
+
+```
@@ -0,0 +1 @@
+python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65
@@ -0,0 +1,190 @@
+#encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+Evaluation script for CMRC 2018
+version: v5 - special
+Note: 
+v5 - special: Evaluate on SQuAD-style CMRC 2018 Datasets
+v5: formatted output, add usage description
+v4: fixed segmentation issues
+'''
+
+import argparse
+import json
+import re
+import sys
+from collections import OrderedDict
+import nltk
+
+
+# split Chinese with English
+def mixed_segmentation(in_str, rm_punc=False):
+    in_str = str(in_str).lower().strip()
+    segs_out = []
+    temp_str = ""
+    sp_char = [
+        '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', '，', '。', '：',
+        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',
+        '）', '－', '～', '『', '』'
+    ]
+    for char in in_str:
+        if rm_punc and char in sp_char:
+            continue
+        if re.search(r'[\u4e00-\u9fa5]', char) or char in sp_char:
+            if temp_str != "":
+                ss = nltk.word_tokenize(temp_str)
+                segs_out.extend(ss)
+                temp_str = ""
+            segs_out.append(char)
+        else:
+            temp_str += char
+
+    # handling last part
+    if temp_str != "":
+        ss = nltk.word_tokenize(temp_str)
+        segs_out.extend(ss)
+
+    return segs_out
+
+
+# remove punctuation
+def remove_punctuation(in_str):
+    in_str = str(in_str).lower().strip()
+    sp_char = [
+        '-', ':', '_', '*', '^', '/', '\\', '~', '`', '+', '=', '，', '。', '：',
+        '？', '！', '“', '”', '；', '’', '《', '》', '……', '·', '、', '「', '」', '（',
+        '）', '－', '～', '『', '』'
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return ''.join(out_segs)
+
+
+# find longest common string
+def find_lcs(s1, s2):
+    m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+    mmax = 0
+    p = 0
+    for i in range(len(s1)):
+        for j in range(len(s2)):
+            if s1[i] == s2[j]:
+                m[i + 1][j + 1] = m[i][j] + 1
+                if m[i + 1][j + 1] > mmax:
+                    mmax = m[i + 1][j + 1]
+                    p = i + 1
+    return s1[p - mmax:p], mmax
+
+
+#
+def evaluate(ground_truth_file, prediction_file):
+    f1 = 0
+    em = 0
+    total_count = 0
+    skip_count = 0
+    for instance in ground_truth_file["data"]:
+        # context_id   = instance['context_id'].strip()
+        # context_text = instance['context_text'].strip()
+        for para in instance["paragraphs"]:
+            for qas in para['qas']:
+                total_count += 1
+                query_id = qas['id'].strip()
+                query_text = qas['question'].strip()
+                answers = [x["text"] for x in qas['answers']]
+
+                if query_id not in prediction_file:
+                    sys.stderr.write('Unanswered question: {}\n'.format(
+                        query_id))
+                    skip_count += 1
+                    continue
+
+                prediction = str(prediction_file[query_id])
+                f1 += calc_f1_score(answers, prediction)
+                em += calc_em_score(answers, prediction)
+
+    f1_score = 100.0 * f1 / total_count
+    em_score = 100.0 * em / total_count
+    return f1_score, em_score, total_count, skip_count
+
+
+def calc_f1_score(answers, prediction):
+    f1_scores = []
+    for ans in answers:
+        ans_segs = mixed_segmentation(ans, rm_punc=True)
+        prediction_segs = mixed_segmentation(prediction, rm_punc=True)
+        lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
+        if lcs_len == 0:
+            f1_scores.append(0)
+            continue
+        precision = 1.0 * lcs_len / len(prediction_segs)
+        recall = 1.0 * lcs_len / len(ans_segs)
+        f1 = (2 * precision * recall) / (precision + recall)
+        f1_scores.append(f1)
+    return max(f1_scores)
+
+
+def calc_em_score(answers, prediction):
+    em = 0
+    for ans in answers:
+        ans_ = remove_punctuation(ans)
+        prediction_ = remove_punctuation(prediction)
+        if ans_ == prediction_:
+            em = 1
+            break
+    return em
+
+
+def get_result(ground_truth_file, prediction_file):
+    ground_truth_file = json.load(open(ground_truth_file, 'rb'))
+    prediction_file = json.load(open(prediction_file, 'rb'))
+    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
+    AVG = (EM + F1) * 0.5
+    output_result = OrderedDict()
+    output_result['AVERAGE'] = '%.3f' % AVG
+    output_result['F1'] = '%.3f' % F1
+    output_result['EM'] = '%.3f' % EM
+    output_result['TOTAL'] = TOTAL
+    output_result['SKIP'] = SKIP
+    print(json.dumps(output_result))
+    return output_result
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Evaluation Script for CMRC 2018')
+    parser.add_argument(
+        '--dataset_file',
+        default="cmrc2018_public/dev.json",
+        help='Official dataset file')
+    parser.add_argument(
+        '--prediction_file',
+        default="all_predictions.json",
+        help='Your prediction File')
+    args = parser.parse_args()
+    ground_truth_file = json.load(open(args.dataset_file, 'rb'))
+    prediction_file = json.load(open(args.prediction_file, 'rb'))
+    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
+    AVG = (EM + F1) * 0.5
+    output_result = OrderedDict()
+    output_result['AVERAGE'] = '%.3f' % AVG
+    output_result['F1'] = '%.3f' % F1
+    output_result['EM'] = '%.3f' % EM
+    output_result['TOTAL'] = TOTAL
+    output_result['SKIP'] = SKIP
+    output_result['FILE'] = args.prediction_file
+    print(json.dumps(output_result))
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65`