使用Bert的demo训练完后，每次调用predict都会触发build_classification_dataset，效率极低

参照demo进行修改，使用大约8000条记录进行train，大约耗时22小时（8核cpu）
```
    # 初始化训练
    # m = BertClassifier(output_dir='./Bert', num_classes=12,
    #                 model_type='bert', model_name='./bert-base-chinese', num_epochs=10)

    # 使用清洗后的数据进行训练
    #m.train(data, test_size=0.001)
```


生成了./Bert目录，且有正常的模型文件和config.json等文件。

```
ls -lh Bert/
总用量 400164
drwxr-xr-x. 2 root root      4096 6月   3 14:11 best_model
-rw-r--r--. 1 root root      1389 6月   1 16:01 config.json
-rw-r--r--. 1 root root       519 6月   1 16:01 eval_results.txt
-rw-r--r--. 1 root root       150 5月  30 19:49 label_vocab.json
-rw-r--r--. 1 root root      2745 6月   1 16:01 model_args.json
-rw-r--r--. 1 root root 409175661 6月   1 16:01 pytorch_model.bin
-rw-r--r--. 1 root root       112 6月   1 16:01 special_tokens_map.json
-rw-r--r--. 1 root root       347 6月   1 16:01 tokenizer_config.json
-rw-r--r--. 1 root root    439390 6月   1 16:01 tokenizer.json
-rw-r--r--. 1 root root      3311 6月   1 16:01 training_args.bin
-rw-r--r--. 1 root root      2924 6月   1 16:01 training_progress_scores.csv
-rw-r--r--. 1 root root    109540 6月   1 16:01 vocab.txt
```

随后使用大约2000条的记录进行验证，但每次load_model的时候，都会触发build_classification_dataset，每次耗时约3小时，并且会重复循环build_classification_dataset。
```
    # 修改模型初始化部分，加载已训练好的模型
    m = BertClassifier(output_dir='./Bert/',model_name='./Bert/', num_classes=12)
    m.load_model()
    # 预测
    with open('test.txt', 'r', encoding='utf-8') as f:
        texts = [line.strip() for line in f.readlines()]
    predict_labels, predict_probas = m.predict(texts)
```


输出类似于：
```
2025-06-02 00:52:33.136 | DEBUG    | pytextclassifier.bert_classification_model:__init__:323 - Device: cpu
2025-06-02 00:52:36.140 | DEBUG    | pytextclassifier.bert_classification_model:__init__:323 - Device: cpu
2025-06-02 00:52:39.372 | DEBUG    | pytextclassifier.bert_classfication_utils:build_classification_dataset:309 -  Converting to features started. Cache is not used.
100%|██████████████████████████████████████████████████████████████████████████████████]100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [xx:00<xx:00,  xxit/s]
2025-06-02 04:00:12.312 | DEBUG    | pytextclassifier.bert_classfication_utils:build_classification_dataset:309 -  Converting to features started. Cache is not used.
100%|██████████████████████████████████████████████████████████████████████████████████]100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [xx:00<xx:00,  xxit/s]
....
依次循环..
```

不知道我哪里搞错了，初学者，请谅解，感谢。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用Bert的demo训练完后，每次调用predict都会触发build_classification_dataset，效率极低 #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

使用Bert的demo训练完后，每次调用predict都会触发build_classification_dataset，效率极低 #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions