Skip to content

Commit ab2bd21

Browse files
authored
bug_fix (#3184)
1 parent 47a2ea5 commit ab2bd21

File tree

17 files changed

+231
-227
lines changed

17 files changed

+231
-227
lines changed

applications/text_classification/hierarchical/README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
wget https://paddlenlp.bj.bcebos.com/datasets/baidu_extract_2020.tar.gz
4848
tar -zxvf baidu_extract_2020.tar.gz
4949
mv baidu_extract_2020 data
50+
rm baidu_extract_2020.tar.gz
5051
```
5152

5253
<div align="center">
@@ -194,6 +195,7 @@ data/
194195
使用CPU/GPU训练,默认为GPU训练,使用CPU训练只需将设备参数配置改为`--device "cpu"`
195196
```shell
196197
python train.py \
198+
--dataset_dir "data" \
197199
--device "gpu" \
198200
--max_seq_length 128 \
199201
--model_name "ernie-3.0-medium-zh" \
@@ -205,6 +207,7 @@ python train.py \
205207
如果在CPU环境下训练,可以指定`nproc_per_node`参数进行多核训练:
206208
```shell
207209
python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \
210+
--dataset_dir "data" \
208211
--device "gpu" \
209212
--max_seq_length 128 \
210213
--model_name "ernie-3.0-medium-zh" \
@@ -217,6 +220,7 @@ python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py
217220
```shell
218221
unset CUDA_VISIBLE_DEVICES
219222
python -m paddle.distributed.launch --gpus "0" train.py \
223+
--dataset_dir "data" \
220224
--device "gpu" \
221225
--max_seq_length 128 \
222226
--model_name "ernie-3.0-medium-zh" \
@@ -260,13 +264,13 @@ checkpoint/
260264
**NOTE:**
261265
* 如需恢复模型训练,则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams`
262266
* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en",更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)
263-
267+
* 英文和中文以外文本分类任务建议使用多语言预训练模型"ernie-m-base","ernie-m-large", 多语言模型暂不支持文本分类模型部署,相关功能正在加速开发中。
264268
#### 2.4.2 训练评估与模型优化
265269

266270
训练后的模型我们可以使用 [模型分析模块](./analysis) 对每个类别分别进行评估,并输出预测错误样本(bad case),默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`:
267271

268272
```shell
269-
python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_path "./bad_case.txt"
273+
python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_path "./bad_case.txt" --dataset_dir "data" --params_path "./checkpoint"
270274
```
271275

272276
输出打印示例:
@@ -307,7 +311,7 @@ Prediction Label Text
307311
训练结束后,输入待预测数据(data.txt)和类别标签对照列表(label.txt),使用训练好的模型进行,默认在GPU环境下使用,在CPU环境下修改参数配置为`--device "cpu"`
308312

309313
```shell
310-
python predict.py --device "gpu" --max_seq_length 128 --batch_size 32
314+
python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_dir "data"
311315
```
312316

313317
可支持配置的参数:
@@ -361,10 +365,14 @@ pip install paddleslim==2.2.2
361365
```shell
362366
python prune.py \
363367
--device "gpu" \
368+
--dataset_dir "data" \
369+
--output_dir "prune" \
364370
--per_device_train_batch_size 32 \
365371
--per_device_eval_batch_size 32 \
366372
--num_train_epochs 10 \
367373
--max_seq_length 128 \
374+
--logging_steps 5 \
375+
--save_steps 100 \
368376
--width_mult_list '3/4' '2/3' '1/2'
369377
```
370378

@@ -376,7 +384,7 @@ python prune.py \
376384
* `per_device_eval_batch_size`:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
377385
* `learning_rate`:训练最大学习率;默认为3e-5。
378386
* `num_train_epochs`: 训练轮次,使用早停法时可以选择100;默认为10。
379-
* `logging_steps`: 训练过程中日志打印的间隔steps数,默认5
387+
* `logging_steps`: 训练过程中日志打印的间隔steps数,默认100
380388
* `save_steps`: 训练过程中保存模型checkpoint的间隔steps数,默认100。
381389
* `seed`:随机种子,默认为3。
382390
* `width_mult_list`:裁剪宽度(multi head)保留的比例列表,表示对self_attention中的 `q``k``v` 以及 `ffn` 权重宽度的保留比例,保留比例乘以宽度(multi haed数量)应为整数;默认是None。

applications/text_classification/hierarchical/analysis/evaluate.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -142,9 +142,8 @@ def evaluate():
142142
probs = []
143143
labels = []
144144
for batch in train_data_loader:
145-
input_ids, token_type_ids, label = batch['input_ids'], batch[
146-
'token_type_ids'], batch['labels']
147-
logits = model(input_ids, token_type_ids)
145+
label = batch.pop("labels")
146+
logits = model(**batch)
148147
labels.extend(label.numpy())
149148
probs.extend(F.sigmoid(logits).numpy())
150149
probs = np.array(probs)
@@ -158,9 +157,8 @@ def evaluate():
158157
probs = []
159158
labels = []
160159
for batch in dev_data_loader:
161-
input_ids, token_type_ids, label = batch['input_ids'], batch[
162-
'token_type_ids'], batch['labels']
163-
logits = model(input_ids, token_type_ids)
160+
label = batch.pop("labels")
161+
logits = model(**batch)
164162
labels.extend(label.numpy())
165163
probs.extend(F.sigmoid(logits).numpy())
166164
probs = np.array(probs)

applications/text_classification/hierarchical/predict.py

Lines changed: 39 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,19 @@
1414

1515
import os
1616
import argparse
17-
17+
import functools
1818
import numpy as np
1919

2020
import paddle
2121
import paddle.nn.functional as F
2222
from paddlenlp.utils.log import logger
23-
from paddlenlp.data import Tuple, Pad
23+
from paddle.io import DataLoader, BatchSampler
24+
from paddlenlp.data import DataCollatorWithPadding
25+
from paddlenlp.datasets import load_dataset
2426
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
2527

28+
from utils import preprocess_function, read_local_dataset
29+
2630
# yapf: disable
2731
parser = argparse.ArgumentParser()
2832
parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
@@ -37,42 +41,47 @@
3741

3842

3943
@paddle.no_grad()
40-
def predict(data, label_list):
44+
def predict():
4145
"""
42-
Predicts the data labels.
43-
Args:
44-
45-
data (obj:`List`): The processed data whose each element is one sequence.
46-
label_map(obj:`List`): The label id (key) to label str (value) map.
47-
46+
Predicts the data labels.
4847
"""
4948
paddle.set_device(args.device)
5049
model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
5150
tokenizer = AutoTokenizer.from_pretrained(args.params_path)
5251

53-
examples = []
54-
for text in data:
55-
result = tokenizer(text=text, max_seq_len=args.max_seq_length)
56-
examples.append((result['input_ids'], result['token_type_ids']))
52+
label_list = []
53+
label_path = os.path.join(args.dataset_dir, args.label_file)
54+
with open(label_path, 'r', encoding='utf-8') as f:
55+
for i, line in enumerate(f):
56+
label_list.append(line.strip())
57+
58+
data_ds = load_dataset(read_local_dataset,
59+
path=os.path.join(args.dataset_dir, args.data_file),
60+
is_test=True,
61+
lazy=False)
62+
63+
trans_func = functools.partial(preprocess_function,
64+
tokenizer=tokenizer,
65+
max_seq_length=args.max_seq_length,
66+
label_nums=len(label_list),
67+
is_test=True)
5768

58-
# Seperates data into some batches.
59-
batches = [
60-
examples[i:i + args.batch_size]
61-
for i in range(0, len(examples), args.batch_size)
62-
]
69+
data_ds = data_ds.map(trans_func)
6370

64-
batchify_fn = lambda samples, fn=Tuple(
65-
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input
66-
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment
67-
): fn(samples)
71+
# batchify dataset
72+
collate_fn = DataCollatorWithPadding(tokenizer)
73+
data_batch_sampler = BatchSampler(data_ds,
74+
batch_size=args.batch_size,
75+
shuffle=False)
76+
77+
data_data_loader = DataLoader(dataset=data_ds,
78+
batch_sampler=data_batch_sampler,
79+
collate_fn=collate_fn)
6880

6981
results = []
7082
model.eval()
71-
for batch in batches:
72-
input_ids, token_type_ids = batchify_fn(batch)
73-
input_ids = paddle.to_tensor(input_ids)
74-
token_type_ids = paddle.to_tensor(token_type_ids)
75-
logits = model(input_ids, token_type_ids)
83+
for batch in data_data_loader:
84+
logits = model(**batch)
7685
probs = F.sigmoid(logits).numpy()
7786
for prob in probs:
7887
labels = []
@@ -81,9 +90,9 @@ def predict(data, label_list):
8190
labels.append(label_list[i])
8291
results.append(labels)
8392

84-
for text, labels in zip(data, results):
93+
for t, labels in zip(data_ds.data, results):
8594
hierarchical_labels = {}
86-
logger.info("text: {}".format(text))
95+
logger.info("text: {}".format(t["sentence"]))
8796
logger.info("prediction result: {}".format(",".join(labels)))
8897
for label in labels:
8998
for i, l in enumerate(label.split('##')):
@@ -100,22 +109,4 @@ def predict(data, label_list):
100109

101110
if __name__ == "__main__":
102111

103-
data_dir = os.path.join(args.dataset_dir, args.data_file)
104-
label_dir = os.path.join(args.dataset_dir, args.label_file)
105-
106-
data = []
107-
label_list = []
108-
109-
with open(data_dir, 'r', encoding='utf-8') as f:
110-
lines = f.readlines()
111-
for i, line in enumerate(lines):
112-
data.append(line.strip())
113-
f.close()
114-
115-
with open(label_dir, 'r', encoding='utf-8') as f:
116-
lines = f.readlines()
117-
for i, line in enumerate(lines):
118-
label_list.append(line.strip())
119-
f.close()
120-
121-
predict(data, label_list)
112+
predict()

applications/text_classification/hierarchical/train.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.")
4141
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
4242
parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
43-
choices=["ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en"])
43+
choices=["ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"])
4444
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
4545
parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
4646
parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs to perform.")
@@ -178,10 +178,8 @@ def train():
178178

179179
for step, batch in enumerate(train_data_loader, start=1):
180180

181-
input_ids, token_type_ids, labels = batch['input_ids'], batch[
182-
'token_type_ids'], batch['labels']
183-
184-
logits = model(input_ids, token_type_ids)
181+
labels = batch.pop("labels")
182+
logits = model(**batch)
185183
loss = criterion(logits, labels)
186184

187185
probs = F.sigmoid(logits)

applications/text_classification/hierarchical/utils.py

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,8 @@ def evaluate(model, criterion, metric, data_loader):
3434
metric.reset()
3535
losses = []
3636
for batch in data_loader:
37-
input_ids, token_type_ids, labels = batch['input_ids'], batch[
38-
'token_type_ids'], batch['labels']
39-
logits = model(input_ids, token_type_ids)
37+
labels = batch.pop("labels")
38+
logits = model(**batch)
4039
loss = criterion(logits, labels)
4140
probs = F.sigmoid(logits)
4241
losses.append(loss.numpy())
@@ -51,7 +50,11 @@ def evaluate(model, criterion, metric, data_loader):
5150
return micro_f1_score, macro_f1_score
5251

5352

54-
def preprocess_function(examples, tokenizer, max_seq_length, label_nums):
53+
def preprocess_function(examples,
54+
tokenizer,
55+
max_seq_length,
56+
label_nums,
57+
is_test=False):
5558
"""
5659
Builds model inputs from a sequence for sequence classification tasks
5760
by concatenating and adding special tokens.
@@ -68,21 +71,27 @@ def preprocess_function(examples, tokenizer, max_seq_length, label_nums):
6871
"""
6972
result = tokenizer(text=examples["sentence"], max_seq_len=max_seq_length)
7073
# One-Hot label
71-
result["labels"] = [
72-
float(1) if i in examples["label"] else float(0)
73-
for i in range(label_nums)
74-
]
74+
if not is_test:
75+
result["labels"] = [
76+
float(1) if i in examples["label"] else float(0)
77+
for i in range(label_nums)
78+
]
7579
return result
7680

7781

78-
def read_local_dataset(path, label_list):
82+
def read_local_dataset(path, label_list=None, is_test=False):
7983
"""
8084
Read dataset
8185
"""
8286
with open(path, 'r', encoding='utf-8') as f:
8387
for line in f:
84-
items = line.strip().split('\t')
85-
sentence = ''.join(items[:-1])
86-
label = items[-1]
87-
labels = [label_list[l] for l in label.split(',')]
88-
yield {'sentence': sentence, 'label': labels}
88+
if is_test:
89+
items = line.strip().split('\t')
90+
sentence = ''.join(items)
91+
yield {'sentence': sentence}
92+
else:
93+
items = line.strip().split('\t')
94+
sentence = ''.join(items[:-1])
95+
label = items[-1]
96+
labels = [label_list[l] for l in label.split(',')]
97+
yield {'sentence': sentence, 'label': labels}

0 commit comments

Comments
 (0)