Skip to content

Commit 70649b1

Browse files
Beacontownfcgongenlei
andauthored
Add model Rembert (#1701)
* add rembert * add rembert * Update tokenizer.py * update rembert * modify * modify according to gongel * Update tokenizer.py * Update tokenizer.py * Update modeling.py * fix bug Co-authored-by: gongenlei <[email protected]>
1 parent 3021098 commit 70649b1

File tree

8 files changed

+1871
-0
lines changed

8 files changed

+1871
-0
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# RemBert with PaddleNLP
2+
3+
[RemBERT: Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821v1.pdf)
4+
5+
**模型简介:**
6+
作者发现,分离词嵌入为建模语言模型提供更好的灵活性,使我们能够显著提高多语言模型输入词嵌入中参数分
7+
配的效率。通过在transformers层中重新分配输入词嵌入参数,在微调过程中,相比于具有相同数量参数量的
8+
自然语言模型在自然语言理解任务上获得了更好的性能。作者还发现,增大输出词嵌入维度可以提升模型的性能,
9+
即使在预训练结束后,输出词嵌入被丢弃,该模型仍能在微调阶段保持不变。作者分析表明,增大输出词嵌入维度
10+
可以防止模型在预训练数据集上过拟合,并让模型在其他NLP数据集上有更强的泛化能力。利用这些发现,我们能够
11+
训练性能更强大的模型,而无需在微调阶段增加参数。
12+
13+
## 快速开始
14+
15+
### 下游任务微调
16+
17+
####数据集
18+
下载XTREME-XNLI数据集:
19+
训练集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip)
20+
测试集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip)
21+
其中训练集为位于`XNLI-MT-1.0/multinli/multinli.train.en.tsv`, 测试集位于`XNLI-1.0/xnli.test.tsv`
22+
23+
下载XTREME-PAWS-X数据集:
24+
[下载地址](https://storage.googleapis.com/paws/pawsx/x-final.tar.gz)
25+
每个训练集、验证集和测试集分别为`train``dev``test`开头的`tsv`文件, 将所有语言的数据集解压后,请合并所有语言测试集到一个文件(此任务需要在多语言进行测试)
26+
27+
#### 1、XTREME-XNLI
28+
XTREME-XNLI数据集为例:
29+
运行以下两个命令即可训练并评估RemBert在XTREME-XNLI数据集的精度
30+
31+
```shell
32+
python -m paddle.distributed.launch examples/language_model/rembert/main.py \
33+
--model_type rembert \
34+
--data_dir data/
35+
--output_dir output/ \
36+
--device gpu
37+
--learning_rate 1e-5 \
38+
--num_train_epochs 3 \
39+
--train_batch_size 16 \
40+
--do_train \
41+
--do_eval \
42+
--task xnli \
43+
--eval_step 500
44+
```
45+
其中参数释义如下:
46+
- `model_type` 指示了模型类型,当前支持`rembert`
47+
- `data_dir` 数据集路径。
48+
- `train_batch_size` 表示每次迭代**每张卡**上的样本数目。
49+
- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
50+
- `output_dir` 表示模型保存路径。
51+
- `device` 表示使用的设备类型。默认为GPU,可以配置为CPU、GPU、XPU。若希望使用多GPU训练,将其设置为GPU,同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
52+
- `num_train_epochs` 表示需要训练的epoch数量
53+
- `do_train` 表示是否开启训练
54+
- `do_eval` 表示是否开启评估
55+
- `task` 表示训练的任务
56+
- `eval_step` 表示训练多少步评估一次模型
57+
58+
训练结束后模型会对模型进行评估,训练完成后你将看到如下结果:
59+
```bash
60+
Accuracy 0.8089
61+
```
62+
63+
#### 2、XTREME-PAWS-X
64+
在此数据集训练使用如下命令:
65+
66+
```shell
67+
python -m paddle.distributed.launch examples/language_model/rembert/main.py \
68+
--model_type rembert \
69+
--data_dir data/
70+
--output_dir output/ \
71+
--device gpu
72+
--learning_rate 8e-6 \
73+
--num_train_epochs 3 \
74+
--train_batch_size 16 \
75+
--do_train \
76+
--do_eval \
77+
--task paws \
78+
--eval_step 500
79+
```
80+
训练结束后模型会对模型进行评估,其评估在测试集上完成, 训练完成后你将看到如下结果:
81+
```bash
82+
Accuracy 0.8778
83+
```
84+
85+
86+
# Reference
87+
88+
```bibtex
89+
@article{chung2020rethinking,
90+
title={Rethinking embedding coupling in pre-trained language models},
91+
author={Chung, Hyung Won and Fevry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian},
92+
journal={arXiv preprint arXiv:2010.12821},
93+
year={2020}
94+
}
95+
```
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
2+
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
from paddlenlp.transformers import RemBertTokenizer
17+
import csv
18+
from paddle.io import Dataset
19+
20+
tokenization = RemBertTokenizer.from_pretrained('rembert')
21+
22+
23+
class InputExample(object):
24+
"""
25+
Use classes to store each example
26+
"""
27+
28+
def __init__(self, guid, text_a, text_b=None, label=None):
29+
self.guid = guid
30+
self.text_a = text_a
31+
self.text_b = text_b
32+
self.label = label
33+
34+
35+
class MrpcProcessor(object):
36+
"""Load the dataset and convert each example text to ids"""
37+
38+
def get_train_examples(self, data_dir):
39+
return self._create_examples(
40+
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
41+
42+
def get_dev_examples(self, data_dir):
43+
return self._create_examples(
44+
self._read_tsv(os.path.join(data_dir, "dev_2k.tsv")), "dev")
45+
46+
def get_test_examples(self, data_dir):
47+
return self._create_examples(
48+
self._read_tsv(os.path.join(data_dir, "test_2k.tsv")), "test")
49+
50+
def get_labels(self):
51+
return ["0", "1"]
52+
53+
def _create_examples(self, lines, set_type):
54+
examples = []
55+
for (i, line) in enumerate(lines):
56+
if i == 0:
57+
continue
58+
guid = "%s-%s" % (set_type, i)
59+
text_a = tokenization(line[1])['input_ids']
60+
text_b = tokenization(line[2])['input_ids']
61+
label = int(line[3])
62+
examples.append(
63+
InputExample(
64+
guid=guid, text_a=text_a, text_b=text_b, label=label))
65+
return examples
66+
67+
@classmethod
68+
def _read_tsv(cls, input_file, quotechar=None):
69+
"""Reads a tab separated value file."""
70+
with open(input_file, "r", encoding='utf-8') as f:
71+
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
72+
lines = []
73+
for line in reader:
74+
lines.append(line)
75+
return lines
76+
77+
78+
class XNLIProcessor(object):
79+
"""Load the dataset and convert each example text to ids"""
80+
81+
def get_train_examples(self, data_dir):
82+
return self._create_examples(
83+
self._read_tsv(os.path.join(data_dir, "multinli.train.en.tsv")),
84+
"train")
85+
86+
def get_dev_examples(self, data_dir):
87+
return self._create_examples(
88+
self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")), "dev")
89+
90+
def get_test_examples(self, data_dir):
91+
return self._create_examples(
92+
self._read_tsv(os.path.join(data_dir, "xnli.test.tsv")), "test")
93+
94+
def get_labels(self):
95+
return ["neutral", "entailment", "contradictory"]
96+
97+
def _create_examples(self, lines, set_type):
98+
examples = []
99+
for (i, line) in enumerate(lines):
100+
if i == 0:
101+
continue
102+
guid = "%s-%s" % (set_type, i)
103+
if set_type == 'train':
104+
text_a = ' '.join(line[0].strip().split(' '))
105+
text_b = ' '.join(line[1].strip().split(' '))
106+
text_a = tokenization(text_a)['input_ids']
107+
text_b = tokenization(text_b)['input_ids']
108+
label = self.get_labels().index(line[2].strip())
109+
examples.append(
110+
InputExample(
111+
guid=guid, text_a=text_a, text_b=text_b, label=label))
112+
else:
113+
text_a = ' '.join(line[6].strip().split(' '))
114+
text_b = ' '.join(line[7].strip().split(' '))
115+
if line[1] == 'contradiction':
116+
line[1] = 'contradictory'
117+
label = self.get_labels().index(line[1].strip())
118+
text_a = tokenization(text_a)['input_ids']
119+
text_b = tokenization(text_b)['input_ids']
120+
examples.append(
121+
InputExample(
122+
guid=guid, text_a=text_a, text_b=text_b, label=label))
123+
return examples
124+
125+
@classmethod
126+
def _read_tsv(cls, input_file, quotechar=None):
127+
"""Reads a tab separated value file."""
128+
with open(input_file, "r", encoding='utf-8') as f:
129+
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
130+
lines = []
131+
for line in reader:
132+
lines.append(line)
133+
return lines
134+
135+
136+
class DataGenerator(Dataset):
137+
"""Data generator is used to feed features into dataloader."""
138+
139+
def __init__(self, features):
140+
super(DataGenerator, self).__init__()
141+
self.features = features
142+
143+
def __getitem__(self, item):
144+
text_a = self.features[item].text_a
145+
text_b = self.features[item].text_b
146+
text_a_token_type_ids = [0] * len(text_a)
147+
text_b_token_type_ids = [1] * len(text_b)
148+
label = [self.features[item].label]
149+
150+
return dict(
151+
text_a=text_a,
152+
text_b=text_b,
153+
text_a_token_type_ids=text_a_token_type_ids,
154+
text_b_token_type_ids=text_b_token_type_ids,
155+
label=label)
156+
157+
def __len__(self):
158+
return len(self.features)

0 commit comments

Comments
 (0)