PaddlePaddle
diff --git a/‎applications/text_classification/hierarchical/README.md‎
Lines changed: 8 additions & 23 deletions b/‎applications/text_classification/hierarchical/README.md‎
Lines changed: 8 additions & 23 deletions
diff --git a/‎applications/text_classification/hierarchical/analysis/README.md‎
Lines changed: 2 additions & 5 deletions b/‎applications/text_classification/hierarchical/analysis/README.md‎
Lines changed: 2 additions & 5 deletions
diff --git a/‎applications/text_classification/hierarchical/analysis/aug.py‎
Lines changed: 7 additions & 2 deletions b/‎applications/text_classification/hierarchical/analysis/aug.py‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎applications/text_classification/hierarchical/analysis/dirty.py‎
Lines changed: 7 additions & 17 deletions b/‎applications/text_classification/hierarchical/analysis/dirty.py‎
Lines changed: 7 additions & 17 deletions
diff --git a/‎applications/text_classification/hierarchical/analysis/evaluate.py‎
Lines changed: 7 additions & 39 deletions b/‎applications/text_classification/hierarchical/analysis/evaluate.py‎
Lines changed: 7 additions & 39 deletions
diff --git a/‎applications/text_classification/hierarchical/analysis/sent_interpret.py‎
Lines changed: 8 additions & 12 deletions b/‎applications/text_classification/hierarchical/analysis/sent_interpret.py‎
Lines changed: 8 additions & 12 deletions
@@ -65,7 +65,7 @@ rm baidu_extract_2020.tar.gz
 
 - python >= 3.6
 - paddlepaddle >= 2.3
-- paddlenlp >= 2.4
+- paddlenlp >= 2.4.8
 - scikit-learn >= 1.0.2
 
 **安装PaddlePaddle：**
@@ -183,7 +183,7 @@ data/
 
 #### 2.4.1 预训练模型微调
 
-使用CPU/GPU训练，默认为GPU训练，使用CPU训练只需将设备参数配置改为`--device "cpu"`：
+使用CPU/GPU训练，默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`，可以使用`--device gpu:0`指定GPU卡号：
 ```shell
 python train.py \
     --dataset_dir "data" \
@@ -195,18 +195,6 @@ python train.py \
     --epochs 100
 ```
 
-如果在CPU环境下训练，可以指定`nproc_per_node`参数进行多核训练：
-```shell
-python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" train.py \
-    --dataset_dir "data" \
-    --device "cpu" \
-    --max_seq_length 128 \
-    --model_name "ernie-3.0-medium-zh" \
-    --batch_size 32 \
-    --early_stop \
-    --epochs 100
-```
-
 如果在GPU环境中使用，可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0，可使用`nvidia-smi`命令查看GPU使用情况。
 
 ```shell
@@ -248,12 +236,12 @@ python -m paddle.distributed.launch --gpus "0" train.py \
 
 ```text
 checkpoint/
-├── model_config.json
-├── model_state.pdparams
-├── tokenizer_config.json
-└── vocab.txt
+├── config.json # 模型配置文件，paddlenlp 2.4.5以前为model_config.json
+├── model_state.pdparams # 模型参数文件
+├── tokenizer_config.json # 分词器配置文件
+├── vocab.txt
+└── ...
 ```
-
 **NOTE:**
 * 如需恢复模型训练，则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams` 。
 * 如需训练英文文本分类任务，只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。
@@ -276,19 +264,16 @@ python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32
 
 ```text
 [2022-08-11 03:10:14,058] [    INFO] - -----Evaluate model-------
-[2022-08-11 03:10:14,059] [    INFO] - Train dataset size: 11958
 [2022-08-11 03:10:14,059] [    INFO] - Dev dataset size: 1498
 [2022-08-11 03:10:14,059] [    INFO] - Accuracy in dev dataset: 89.19%
 [2022-08-11 03:10:14,059] [    INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
 [2022-08-11 03:10:14,059] [    INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
 [2022-08-11 03:10:14,095] [    INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
 [2022-08-11 03:10:14,255] [    INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
 [2022-08-11 03:10:14,256] [    INFO] - Class name: 交往
-[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in train dataset: 471(3.9%) | precision: 99.57 | recall: 98.94 | F1 score 99.25
 [2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
 [2022-08-11 03:10:14,256] [    INFO] - ----------------------------
 [2022-08-11 03:10:14,256] [    INFO] - Class name: 交往##会见
-[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in train dataset: 98(0.8%) | precision: 100.00 | recall: 100.00 | F1 score 100.00
 [2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
 ...
 ```
@@ -444,7 +429,7 @@ prune/
 
 - 离线部署搭建请参考[离线部署](deploy/predictor/README.md)。
 
-- 在线服务化部署搭建请参考 [Paddle Serving部署指南](deploy/paddle_serving/README.md) (Paddle Serving支持X86、Arm CPU、NVIDIA GPU、昆仑/昇腾等多种硬件)或[Triton部署指南](deploy/triton_serving/README.md)。
+- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) 或 [Triton部署指南](deploy/triton_serving/README.md)。
 
 <a name="模型效果"></a>
 
 
@@ -73,11 +73,10 @@ python evaluate.py \
 可支持配置的参数：
 
 * `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
-* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt、dev.txt和label.txt文件;默认为None。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
 * `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
 * `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
 * `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
-* `train_file`：本地数据集中开发集文件名；默认为"train.txt"。
 * `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
 * `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
 * `bad_case_path`：开发集中预测错误样本保存路径；默认为"/bad_case.txt"。
@@ -87,19 +86,17 @@ python evaluate.py \
 
 ```text
 [2022-08-11 03:10:14,058] [    INFO] - -----Evaluate model-------
-[2022-08-11 03:10:14,059] [    INFO] - Train dataset size: 11958
+
 [2022-08-11 03:10:14,059] [    INFO] - Dev dataset size: 1498
 [2022-08-11 03:10:14,059] [    INFO] - Accuracy in dev dataset: 89.19%
 [2022-08-11 03:10:14,059] [    INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
 [2022-08-11 03:10:14,059] [    INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
 [2022-08-11 03:10:14,095] [    INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
 [2022-08-11 03:10:14,255] [    INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
 [2022-08-11 03:10:14,256] [    INFO] - Class name: 交往
-[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in train dataset: 471(3.9%) | precision: 99.57 | recall: 98.94 | F1 score 99.25
 [2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
 [2022-08-11 03:10:14,256] [    INFO] - ----------------------------
 [2022-08-11 03:10:14,256] [    INFO] - Class name: 交往##会见
-[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in train dataset: 98(0.8%) | precision: 100.00 | recall: 100.00 | F1 score 100.00
 [2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
 ...
 ```
 
@@ -12,10 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import os
 import argparse
+
 import paddle
-from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
 
 # yapf: disable
 parser = argparse.ArgumentParser()
@@ -49,6 +50,8 @@ def aug():
                 s, l = line.strip().split("\t")
 
                 augs = aug.augment(s)
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
                 for a in augs:
                     f2.write(a + "\t" + l + "\n")
         f1.close(), f2.close()
@@ -67,6 +70,8 @@ def aug():
                 for i in range(args.create_n):
                     i = count % len(aug)
                     augs = aug[i].augment(s)
+                    if not isinstance(augs[0], str):
+                        augs = augs[0]
                     count += 1
                     for a in augs:
                         f2.write(a + "\t" + l + "\n")
 
@@ -12,25 +12,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import re
-import json
+import argparse
 import functools
-import random
-import time
 import os
-import argparse
+import random
 
 import numpy as np
-
 import paddle
-import paddle.nn.functional as F
-from paddle.metric import Accuracy
-from paddle.io import DataLoader, BatchSampler, DistributedBatchSampler
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import RepresenterPointModel
+
 from paddlenlp.data import DataCollatorWithPadding
 from paddlenlp.datasets import load_dataset
-from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer, LinearDecayWithWarmup
-from paddlenlp.utils.log import logger
-from trustai.interpretation import RepresenterPointModel
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
 
 # yapf: disable
 parser = argparse.ArgumentParser()
@@ -117,11 +111,7 @@ def run():
     set_seed(args.seed)
     paddle.set_device(args.device)
     # Define model & tokenizer
-    if (
-        os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
-        and os.path.exists(os.path.join(args.params_path, "model_config.json"))
-        and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
-    ):
+    if os.path.exists(args.params_path):
         model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
         tokenizer = AutoTokenizer.from_pretrained(args.params_path)
     else:
 
@@ -12,16 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import argparse
 import functools
 import os
-import argparse
 
 import numpy as np
-from sklearn.metrics import accuracy_score, classification_report, f1_score
-
 import paddle
-from paddle.io import DataLoader, BatchSampler
 import paddle.nn.functional as F
+from paddle.io import BatchSampler, DataLoader
+from sklearn.metrics import accuracy_score, classification_report, f1_score
+
 from paddlenlp.data import DataCollatorWithPadding
 from paddlenlp.datasets import load_dataset
 from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
@@ -30,11 +30,10 @@
 # yapf: disable
 parser = argparse.ArgumentParser()
 parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.")
-parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt")
+parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt")
 parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
 parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
 parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.")
-parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
 parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
 parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
 parser.add_argument("--bad_case_file", type=str, default="./bad_case.txt", help="Bad case saving file path")
@@ -78,19 +77,15 @@ def evaluate():
     Evaluate the model performance
     """
     paddle.set_device(args.device)
-    if (
-        os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
-        and os.path.exists(os.path.join(args.params_path, "model_config.json"))
-        and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
-    ):
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
         model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
         tokenizer = AutoTokenizer.from_pretrained(args.params_path)
     else:
         raise ValueError("The {} should exist.".format(args.params_path))
 
     # load and preprocess dataset
     label_path = os.path.join(args.dataset_dir, args.label_file)
-    train_path = os.path.join(args.dataset_dir, args.train_file)
     dev_path = os.path.join(args.dataset_dir, args.dev_file)
 
     label_list = {}
@@ -107,35 +102,18 @@ def evaluate():
                 if ll not in label_map_dict[ii]:
                     iii = len(label_map_dict[ii])
                     label_map_dict[ii][ll] = iii
-    train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False)
     dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
     trans_func = functools.partial(
         preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
     )
-    train_ds = train_ds.map(trans_func)
     dev_ds = dev_ds.map(trans_func)
 
     # batchify dataset
     collate_fn = DataCollatorWithPadding(tokenizer)
-    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
-    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
     dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
     dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
 
     model.eval()
-
-    probs = []
-    labels = []
-    for batch in train_data_loader:
-        label = batch.pop("labels")
-        logits = model(**batch)
-        labels.extend(label.numpy())
-        probs.extend(F.sigmoid(logits).numpy())
-    probs = np.array(probs)
-    labels = np.array(labels)
-    preds = probs > 0.5
-    report_train = classification_report(labels, preds, digits=4, output_dict=True)
-
     probs = []
     labels = []
     for batch in dev_data_loader:
@@ -166,7 +144,6 @@ def evaluate():
                 preds_dict[ii][-1][label_map_dict[ii][sub_l]] = 1
 
     logger.info("-----Evaluate model-------")
-    logger.info("Train dataset size: {}".format(len(train_ds)))
     logger.info("Dev dataset size: {}".format(len(dev_ds)))
     logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100))
     logger.info(
@@ -195,15 +172,6 @@ def evaluate():
 
     for i in label_map:
         logger.info("Class name: {}".format(label_map[i]))
-        logger.info(
-            "Evaluation examples in train dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
-                report_train[str(i)]["support"],
-                100 * report_train[str(i)]["support"] / len(train_ds),
-                report_train[str(i)]["precision"] * 100,
-                report_train[str(i)]["recall"] * 100,
-                report_train[str(i)]["f1-score"] * 100,
-            )
-        )
         logger.info(
             "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
                 report[str(i)]["support"],
 
@@ -12,20 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import argparse
 import functools
-import random
 import os
-import argparse
-import numpy as np
+import random
 
+import numpy as np
 import paddle
-import paddle.nn.functional as F
-from paddle.io import DataLoader, BatchSampler
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
 from paddlenlp.data import DataCollatorWithPadding
 from paddlenlp.datasets import load_dataset
-from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer, LinearDecayWithWarmup
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
 from paddlenlp.utils.log import logger
-from trustai.interpretation import FeatureSimilarityModel
 
 # yapf: disable
 parser = argparse.ArgumentParser()
@@ -98,11 +98,7 @@ def find_positive_influence_data():
     paddle.set_device(args.device)
 
     # Define model & tokenizer
-    if (
-        os.path.exists(os.path.join(args.params_path, "model_state.pdparams"))
-        and os.path.exists(os.path.join(args.params_path, "model_config.json"))
-        and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json"))
-    ):
+    if os.path.exists(args.params_path):
         model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
         tokenizer = AutoTokenizer.from_pretrained(args.params_path)
     else: