Skip to content

Commit 5064b11

Browse files
authored
update text classification (#4279)
* update_text_classification * update_according_to_the_comments
1 parent df0bd27 commit 5064b11

34 files changed

+225
-467
lines changed

applications/text_classification/hierarchical/README.md

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ rm baidu_extract_2020.tar.gz
6565

6666
- python >= 3.6
6767
- paddlepaddle >= 2.3
68-
- paddlenlp >= 2.4
68+
- paddlenlp >= 2.4.8
6969
- scikit-learn >= 1.0.2
7070

7171
**安装PaddlePaddle:**
@@ -183,7 +183,7 @@ data/
183183

184184
#### 2.4.1 预训练模型微调
185185

186-
使用CPU/GPU训练,默认为GPU训练使用CPU训练只需将设备参数配置改为`--device "cpu"`
186+
使用CPU/GPU训练,默认为GPU训练使用CPU训练只需将设备参数配置改为`--device cpu`,可以使用`--device gpu:0`指定GPU卡号
187187
```shell
188188
python train.py \
189189
--dataset_dir "data" \
@@ -195,18 +195,6 @@ python train.py \
195195
--epochs 100
196196
```
197197

198-
如果在CPU环境下训练,可以指定`nproc_per_node`参数进行多核训练:
199-
```shell
200-
python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \
201-
--dataset_dir "data" \
202-
--device "cpu" \
203-
--max_seq_length 128 \
204-
--model_name "ernie-3.0-medium-zh" \
205-
--batch_size 32 \
206-
--early_stop \
207-
--epochs 100
208-
```
209-
210198
如果在GPU环境中使用,可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0,可使用`nvidia-smi`命令查看GPU使用情况。
211199

212200
```shell
@@ -248,12 +236,12 @@ python -m paddle.distributed.launch --gpus "0" train.py \
248236

249237
```text
250238
checkpoint/
251-
├── model_config.json
252-
├── model_state.pdparams
253-
├── tokenizer_config.json
254-
└── vocab.txt
239+
├── config.json # 模型配置文件,paddlenlp 2.4.5以前为model_config.json
240+
├── model_state.pdparams # 模型参数文件
241+
├── tokenizer_config.json # 分词器配置文件
242+
├── vocab.txt
243+
└── ...
255244
```
256-
257245
**NOTE:**
258246
* 如需恢复模型训练,则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams`
259247
* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。
@@ -276,19 +264,16 @@ python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32
276264

277265
```text
278266
[2022-08-11 03:10:14,058] [ INFO] - -----Evaluate model-------
279-
[2022-08-11 03:10:14,059] [ INFO] - Train dataset size: 11958
280267
[2022-08-11 03:10:14,059] [ INFO] - Dev dataset size: 1498
281268
[2022-08-11 03:10:14,059] [ INFO] - Accuracy in dev dataset: 89.19%
282269
[2022-08-11 03:10:14,059] [ INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
283270
[2022-08-11 03:10:14,059] [ INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
284271
[2022-08-11 03:10:14,095] [ INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
285272
[2022-08-11 03:10:14,255] [ INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
286273
[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往
287-
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in train dataset: 471(3.9%) | precision: 99.57 | recall: 98.94 | F1 score 99.25
288274
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
289275
[2022-08-11 03:10:14,256] [ INFO] - ----------------------------
290276
[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往##会见
291-
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in train dataset: 98(0.8%) | precision: 100.00 | recall: 100.00 | F1 score 100.00
292277
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
293278
...
294279
```
@@ -444,7 +429,7 @@ prune/
444429

445430
- 离线部署搭建请参考[离线部署](deploy/predictor/README.md)
446431

447-
- 在线服务化部署搭建请参考 [Paddle Serving部署指南](deploy/paddle_serving/README.md) (Paddle Serving支持X86、Arm CPU、NVIDIA GPU、昆仑/昇腾等多种硬件)或[Triton部署指南](deploy/triton_serving/README.md)
432+
- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) [Triton部署指南](deploy/triton_serving/README.md)
448433

449434
<a name="模型效果"></a>
450435

applications/text_classification/hierarchical/analysis/README.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,10 @@ python evaluate.py \
7373
可支持配置的参数:
7474

7575
* `device`: 选用什么设备进行训练,可选择cpu、gpu、xpu、npu;默认为"gpu"。
76-
* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含train.txt、dev.txt和label.txt文件;默认为None。
76+
* `dataset_dir`:必须,本地数据集路径,数据集路径中应包含dev.txt和label.txt文件;默认为None。
7777
* `params_path`:保存训练模型的目录;默认为"../checkpoint/"。
7878
* `max_seq_length`:分词器tokenizer使用的最大序列长度,ERNIE模型最大不能超过2048。请根据文本长度选择,通常推荐128、256或512,若出现显存不足,请适当调低这一参数;默认为128。
7979
* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
80-
* `train_file`:本地数据集中开发集文件名;默认为"train.txt"。
8180
* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。
8281
* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。
8382
* `bad_case_path`:开发集中预测错误样本保存路径;默认为"/bad_case.txt"。
@@ -87,19 +86,17 @@ python evaluate.py \
8786

8887
```text
8988
[2022-08-11 03:10:14,058] [ INFO] - -----Evaluate model-------
90-
[2022-08-11 03:10:14,059] [ INFO] - Train dataset size: 11958
89+
9190
[2022-08-11 03:10:14,059] [ INFO] - Dev dataset size: 1498
9291
[2022-08-11 03:10:14,059] [ INFO] - Accuracy in dev dataset: 89.19%
9392
[2022-08-11 03:10:14,059] [ INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
9493
[2022-08-11 03:10:14,059] [ INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
9594
[2022-08-11 03:10:14,095] [ INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
9695
[2022-08-11 03:10:14,255] [ INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
9796
[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往
98-
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in train dataset: 471(3.9%) | precision: 99.57 | recall: 98.94 | F1 score 99.25
9997
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
10098
[2022-08-11 03:10:14,256] [ INFO] - ----------------------------
10199
[2022-08-11 03:10:14,256] [ INFO] - Class name: 交往##会见
102-
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in train dataset: 98(0.8%) | precision: 100.00 | recall: 100.00 | F1 score 100.00
103100
[2022-08-11 03:10:14,256] [ INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
104101
...
105102
```

applications/text_classification/hierarchical/analysis/aug.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,11 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
import os
1615
import argparse
16+
1717
import paddle
18-
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap
18+
19+
from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
1920

2021
# yapf: disable
2122
parser = argparse.ArgumentParser()
@@ -49,6 +50,8 @@ def aug():
4950
s, l = line.strip().split("\t")
5051

5152
augs = aug.augment(s)
53+
if not isinstance(augs[0], str):
54+
augs = augs[0]
5255
for a in augs:
5356
f2.write(a + "\t" + l + "\n")
5457
f1.close(), f2.close()
@@ -67,6 +70,8 @@ def aug():
6770
for i in range(args.create_n):
6871
i = count % len(aug)
6972
augs = aug[i].augment(s)
73+
if not isinstance(augs[0], str):
74+
augs = augs[0]
7075
count += 1
7176
for a in augs:
7277
f2.write(a + "\t" + l + "\n")

applications/text_classification/hierarchical/analysis/dirty.py

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -12,25 +12,19 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
import re
16-
import json
15+
import argparse
1716
import functools
18-
import random
19-
import time
2017
import os
21-
import argparse
18+
import random
2219

2320
import numpy as np
24-
2521
import paddle
26-
import paddle.nn.functional as F
27-
from paddle.metric import Accuracy
28-
from paddle.io import DataLoader, BatchSampler, DistributedBatchSampler
22+
from paddle.io import BatchSampler, DataLoader
23+
from trustai.interpretation import RepresenterPointModel
24+
2925
from paddlenlp.data import DataCollatorWithPadding
3026
from paddlenlp.datasets import load_dataset
31-
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer, LinearDecayWithWarmup
32-
from paddlenlp.utils.log import logger
33-
from trustai.interpretation import RepresenterPointModel
27+
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
3428

3529
# yapf: disable
3630
parser = argparse.ArgumentParser()
@@ -117,11 +111,7 @@ def run():
117111
set_seed(args.seed)
118112
paddle.set_device(args.device)
119113
# Define model & tokenizer
120-
if (
121-
os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
122-
and os.path.exists(os.path.join(args.params_path, "model_config.json"))
123-
and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
124-
):
114+
if os.path.exists(args.params_path):
125115
model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
126116
tokenizer = AutoTokenizer.from_pretrained(args.params_path)
127117
else:

applications/text_classification/hierarchical/analysis/evaluate.py

Lines changed: 7 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
import argparse
1516
import functools
1617
import os
17-
import argparse
1818

1919
import numpy as np
20-
from sklearn.metrics import accuracy_score, classification_report, f1_score
21-
2220
import paddle
23-
from paddle.io import DataLoader, BatchSampler
2421
import paddle.nn.functional as F
22+
from paddle.io import BatchSampler, DataLoader
23+
from sklearn.metrics import accuracy_score, classification_report, f1_score
24+
2525
from paddlenlp.data import DataCollatorWithPadding
2626
from paddlenlp.datasets import load_dataset
2727
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
@@ -30,11 +30,10 @@
3030
# yapf: disable
3131
parser = argparse.ArgumentParser()
3232
parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.")
33-
parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt")
33+
parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt")
3434
parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
3535
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
3636
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.")
37-
parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
3837
parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
3938
parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
4039
parser.add_argument("--bad_case_file", type=str, default="./bad_case.txt", help="Bad case saving file path")
@@ -78,19 +77,15 @@ def evaluate():
7877
Evaluate the model performance
7978
"""
8079
paddle.set_device(args.device)
81-
if (
82-
os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
83-
and os.path.exists(os.path.join(args.params_path, "model_config.json"))
84-
and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
85-
):
80+
# Define model & tokenizer
81+
if os.path.exists(args.params_path):
8682
model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
8783
tokenizer = AutoTokenizer.from_pretrained(args.params_path)
8884
else:
8985
raise ValueError("The {} should exist.".format(args.params_path))
9086

9187
# load and preprocess dataset
9288
label_path = os.path.join(args.dataset_dir, args.label_file)
93-
train_path = os.path.join(args.dataset_dir, args.train_file)
9489
dev_path = os.path.join(args.dataset_dir, args.dev_file)
9590

9691
label_list = {}
@@ -107,35 +102,18 @@ def evaluate():
107102
if ll not in label_map_dict[ii]:
108103
iii = len(label_map_dict[ii])
109104
label_map_dict[ii][ll] = iii
110-
train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False)
111105
dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
112106
trans_func = functools.partial(
113107
preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
114108
)
115-
train_ds = train_ds.map(trans_func)
116109
dev_ds = dev_ds.map(trans_func)
117110

118111
# batchify dataset
119112
collate_fn = DataCollatorWithPadding(tokenizer)
120-
train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
121-
train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
122113
dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
123114
dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
124115

125116
model.eval()
126-
127-
probs = []
128-
labels = []
129-
for batch in train_data_loader:
130-
label = batch.pop("labels")
131-
logits = model(**batch)
132-
labels.extend(label.numpy())
133-
probs.extend(F.sigmoid(logits).numpy())
134-
probs = np.array(probs)
135-
labels = np.array(labels)
136-
preds = probs > 0.5
137-
report_train = classification_report(labels, preds, digits=4, output_dict=True)
138-
139117
probs = []
140118
labels = []
141119
for batch in dev_data_loader:
@@ -166,7 +144,6 @@ def evaluate():
166144
preds_dict[ii][-1][label_map_dict[ii][sub_l]] = 1
167145

168146
logger.info("-----Evaluate model-------")
169-
logger.info("Train dataset size: {}".format(len(train_ds)))
170147
logger.info("Dev dataset size: {}".format(len(dev_ds)))
171148
logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100))
172149
logger.info(
@@ -195,15 +172,6 @@ def evaluate():
195172

196173
for i in label_map:
197174
logger.info("Class name: {}".format(label_map[i]))
198-
logger.info(
199-
"Evaluation examples in train dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
200-
report_train[str(i)]["support"],
201-
100 * report_train[str(i)]["support"] / len(train_ds),
202-
report_train[str(i)]["precision"] * 100,
203-
report_train[str(i)]["recall"] * 100,
204-
report_train[str(i)]["f1-score"] * 100,
205-
)
206-
)
207175
logger.info(
208176
"Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
209177
report[str(i)]["support"],

applications/text_classification/hierarchical/analysis/sent_interpret.py

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,20 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
import argparse
1516
import functools
16-
import random
1717
import os
18-
import argparse
19-
import numpy as np
18+
import random
2019

20+
import numpy as np
2121
import paddle
22-
import paddle.nn.functional as F
23-
from paddle.io import DataLoader, BatchSampler
22+
from paddle.io import BatchSampler, DataLoader
23+
from trustai.interpretation import FeatureSimilarityModel
24+
2425
from paddlenlp.data import DataCollatorWithPadding
2526
from paddlenlp.datasets import load_dataset
26-
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer, LinearDecayWithWarmup
27+
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
2728
from paddlenlp.utils.log import logger
28-
from trustai.interpretation import FeatureSimilarityModel
2929

3030
# yapf: disable
3131
parser = argparse.ArgumentParser()
@@ -98,11 +98,7 @@ def find_positive_influence_data():
9898
paddle.set_device(args.device)
9999

100100
# Define model & tokenizer
101-
if (
102-
os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
103-
and os.path.exists(os.path.join(args.params_path, "model_config.json"))
104-
and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
105-
):
101+
if os.path.exists(args.params_path):
106102
model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
107103
tokenizer = AutoTokenizer.from_pretrained(args.params_path)
108104
else:

0 commit comments

Comments
 (0)