PaddlePaddle
diff --git a/‎examples/text_matching/question_matching/README.md‎
Lines changed: 126 additions & 0 deletions b/‎examples/text_matching/question_matching/README.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎examples/text_matching/question_matching/data.py‎
Lines changed: 74 additions & 0 deletions b/‎examples/text_matching/question_matching/data.py‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎examples/text_matching/question_matching/model.py‎
Lines changed: 56 additions & 0 deletions b/‎examples/text_matching/question_matching/model.py‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎examples/text_matching/question_matching/predict.py‎
Lines changed: 120 additions & 0 deletions b/‎examples/text_matching/question_matching/predict.py‎
Lines changed: 120 additions & 0 deletions
@@ -0,0 +1,126 @@
+# 千言-问题匹配鲁棒性评测基线
+
+我们基于预训练模型 ERNIE-Gram 在[千言-问题匹配鲁棒性评测竞赛]() 建立了 Baseline 方案和评测结果.
+
+## 评测效果
+本项目分别基于ERNIE-1.0、Bert-base-chinese、ERNIE-Gram 3 个中文预训练模型训练了单塔 Point-wise 的匹配模型, 基于 ERNIE-Gram 的模型效果显著优于其它 2 个预训练模型。  
+此外，在 ERNIE-Gram 模型基础上我们也对最新的正则化策略 [R-Drop](https://arxiv.org/abs/2106.14448) 进行了相关评测, [R-Drop](https://arxiv.org/abs/2106.14448) 策略的核心思想是针对同 1 个训练样本过多次前向网络得到的输出加上正则化的 Loss 约束。
+
+| 模型  | rdrop_coef | dev acc | test-A acc | test-B acc|
+| ---- | ---- |-----|--------|------- |
+| ernie-1.0-base |0.0| 86.96 |76.20 | 77.50|
+| bert-base-chinese |0.0| 86.93| 76.90 |77.60 |
+| ernie-gram-zh | 0.0 |87.66 | **80.80** | **81.20** |
+| ernie-gram-zh | 0.1 |87.91 | 80.20 | 80.80 |
+| ernie-gram-zh | 0.2 |87.47 | 80.10 | 81.00 |
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+```
+question_matching/
+├── model.py # 匹配模型组网
+├── data.py # 训练样本的数据读取、转换逻辑
+├── predict.py # 模型预测脚本，输出测试集的预测结果: 0,1
+└── train.py # 模型训练评估
+```
+
+### 数据准备
+本项目使用竞赛提供的 LCQMC、BQ、OPPO 这 3 个数据集的训练集合集作为训练集，使用这 3 个数据集的验证集合集作为验证集。  
+
+运行如下命令生成本项目所使用的训练集和验证集，您在参赛过程中可以探索采取其它的训练集和验证集组合，不需要和基线方案完全一致。
+```shell
+cat ./data/train/LCQMC/train ./data/train/BQ/train ./data/train/OPPO/train > train.txt
+cat ./data/train/LCQMC/dev ./data/train/BQ/dev ./data/train/OPPO/dev > dev.txt
+```
+训练集数据格式为 3 列: text_a \t text_b \t label, 样例数据如下:
+```text
+喜欢打篮球的男生喜欢什么样的女生    爱打篮球的男生喜欢什么样的女生  1
+我手机丢了，我想换个手机    我想买个新手机，求推荐  1
+大家觉得她好看吗    大家觉得跑男好看吗？    0
+求秋色之空漫画全集  求秋色之空全集漫画  1
+晚上睡觉带着耳机听音乐有什么害处吗？    孕妇可以戴耳机听音乐吗? 0
+```
+验证集的数据格式和训练集相同，样例如下:
+```
+开初婚未育证明怎么弄？  初婚未育情况证明怎么开？    1
+谁知道她是网络美女吗？  爱情这杯酒谁喝都会醉是什么歌    0
+男孩喝女孩的尿的故事    怎样才知道是生男孩还是女孩  0
+这种图片是用什么软件制作的？    这种图片制作是用什么软件呢？    1
+```
+
+### 模型训练
+运行如下命令，即可复现本项目中基于 ERNIE-Gram 的基线模型:
+
+```shell
+$unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" train.py \
+       --train_set train.txt \
+       --dev_set dev.txt \
+       --device gpu \
+       --eval_step 100 \
+       --save_dir ./checkpoints \
+       --train_batch_size 32 \
+       --learning_rate 2E-5 \
+       --rdrop_coef 0.0
+```
+
+可支持配置的参数：
+* `train_set`: 训练集的文件。
+* `dev_set`：验证集数据文件。
+* `rdrop_coef`：可选，控制 R-Drop 策略正则化 KL-Loss 的系数；默认为 0.0, 即不使用 R-Drop 策略。
+* `train_batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为3。
+* `warmup_proption`：可选，学习率 warmup 策略的比例，如果 0.1，则学习率会在前 10% 训练 step 的过程中从 0 慢慢增长到 learning_rate, 而后再缓慢衰减，默认为 0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000。
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。  
+
+训练过程中每一次在验证集上进行评估之后，程序会根据验证集的评估指标是否优于之前最优的模型指标来决定是否存储当前模型，如果优于之前最优的验证集指标则会存储当前模型，否则则不存储，因此训练过程结束之后，模型存储路径下 step 数最大的模型则对应验证集指标最高的模型, 一般我们选择验证集指标最高的模型进行预测。
+
+如：
+```text
+checkpoints/
+├── model_10000
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+
+
+### 开始预测
+训练完成后，在指定的 checkpoints 路径下会自动存储在验证集评估指标最高的模型，运行如下命令开始生成预测结果:
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u \
+    predict.py \
+    --device gpu \
+    --params_path "./checkpoints/model_10000/model_state.pdparams" \
+    --batch_size 128 \
+    --input_file "${test_set}" \
+    --result_file "predict_result"
+```
+
+输出预测结果示例如下:
+```text
+0
+1
+0
+1
+```
+### 提交进行评测
+提交预测结果进行评测
+
+## Reference
+[1] Liang, Xiaobo, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. “R-Drop: Regularized Dropout for Neural Networks.” ArXiv:2106.14448 [Cs], June 28, 2021. http://arxiv.org/abs/2106.14448.  
@@ -0,0 +1,74 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+
+
+def create_dataloader(dataset,
+                      mode='train',
+                      batch_size=1,
+                      batchify_fn=None,
+                      trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == 'train' else False
+    if mode == 'train':
+        batch_sampler = paddle.io.DistributedBatchSampler(
+            dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(
+            dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=batchify_fn,
+        return_list=True)
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test == False:
+                if len(data) != 3:
+                    continue
+                yield {'query1': data[0], 'query2': data[1], 'label': data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {'query1': data[0], 'query2': data[1]}
+
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query1"], example["query2"]
+
+    encoded_inputs = tokenizer(
+        text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import paddlenlp as ppnlp
+
+
+class QuestionMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, rdrop_coef=0.0):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2)
+        self.rdrop_coef = rdrop_coef
+        self.rdrop_loss = ppnlp.losses.RDropLoss()
+
+    def forward(self,
+                input_ids,
+                token_type_ids=None,
+                position_ids=None,
+                attention_mask=None,
+                do_evaluate=False):
+
+        _, cls_embedding1 = self.ptm(input_ids, token_type_ids, position_ids,
+                                    attention_mask)
+        cls_embedding1 = self.dropout(cls_embedding1)
+        logits1 = self.classifier(cls_embedding1)
+        
+        # For more information about R-drop please refer to this paper: https://arxiv.org/abs/2106.14448
+        # Original implementation please refer to this code: https://github.com/dropreg/R-Drop
+        if self.rdrop_coef > 0 and not do_evaluate:
+            _, cls_embedding2 = self.ptm(input_ids, token_type_ids, position_ids,
+                                    attention_mask)
+            cls_embedding2 = self.dropout(cls_embedding2)
+            logits2 = self.classifier(cls_embedding2)
+            kl_loss = self.rdrop_loss(logits1, logits2)
+        else:
+            kl_loss = 0.0
+
+        return logits1, kl_loss
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+import argparse
+import sys
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+import paddlenlp as ppnlp
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Stack, Tuple, Pad
+
+from data import create_dataloader, read_text_pair, convert_example
+from model import QuestionMatching
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--result_file", type=str, required=True, help="The result file name")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=256, type=int, help="The maximum total input sequence length after tokenization. "
+    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`QuestionMatching`): A model to calculate whether the question pair is semantic similar or not.
+        data_loaer (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_logits = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            input_ids = paddle.to_tensor(input_ids)
+            token_type_ids = paddle.to_tensor(token_type_ids)
+
+            batch_logit, _ = model(
+                input_ids=input_ids, token_type_ids=token_type_ids)
+
+            batch_logits.append(batch_logit.numpy())
+
+        batch_logits = np.concatenate(batch_logits, axis=0)
+
+        return batch_logits
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained(
+        'ernie-gram-zh')
+    tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained(
+        'ernie-gram-zh')
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        is_test=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    test_ds = load_dataset(
+        read_text_pair, data_path=args.input_file, is_test=True, lazy=False)
+
+    test_data_loader = create_dataloader(
+        test_ds,
+        mode='predict',
+        batch_size=args.batch_size,
+        batchify_fn=batchify_fn,
+        trans_fn=trans_func)
+
+    model = QuestionMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError(
+            "Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, test_data_loader)
+    y_preds = np.argmax(y_probs, axis=1)
+    
+    with open(args.result_file, 'w', encoding="utf-8") as f:
+        for y_pred in y_preds:
+            f.write(str(y_pred) + "\n")