en_core_web_trf doesn't train when use the same custom .spacy which works in en_core_web_lg #13278
Unanswered
Meiling-Sun
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I use my custom data and create the train.spacy. train.spacy.has 50 docs, but each doc has more than 60000 tokens, because the annotation based on the document level. I use this train.spacy successfully train en_core_web_lg model. But when i use the same train.spacy file to en_core_web_trf model, even though there is no error, but it looks model didn't do anything. I would like to ask if there is max_length of token in each doc in en_core_web_trf? or what is the reason of this kind of error? the output looks as following, and there is no model_best output.
[2024-01-26 19:01:59,830] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory:
/scratch/global_1/msun/output_gpu_acc_chunk10
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
[2024-01-26 19:02:02,738] [INFO] Set up nlp object from config
[2024-01-26 19:02:02,748] [DEBUG] Loading corpus from path: ../spacy_files/half_half/test.spacy
[2024-01-26 19:02:02,748] [DEBUG] Loading corpus from path: ../spacy_files/half_half/train.spacy
[2024-01-26 19:02:02,749] [INFO] Pipeline: ['transformer', 'ner']
[2024-01-26 19:02:02,749] [INFO] Resuming training for: ['ner', 'transformer']
[2024-01-26 19:02:02,753] [INFO] Created vocabulary
[2024-01-26 19:02:02,754] [INFO] Finished initializing nlp object
[2024-01-26 19:02:02,754] [INFO] Initialized pipeline components: []
✔ Initialized pipeline
============================= Training pipeline =============================
[2024-01-26 19:02:02,763] [DEBUG] Loading corpus from path: ../spacy_files/half_half/test.spacy
[2024-01-26 19:02:02,764] [DEBUG] Loading corpus from path: ../spacy_files/half_half/train.spacy
[2024-01-26 19:02:02,813] [DEBUG] Removed existing output directory: /scratch/global_1/msun/output_gpu_acc_chunk10/model-last
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE
✔ Saved pipeline to output directory
/scratch/global_1/msun/output_gpu_acc_chunk10/model-last
debug data results are following:
============================ Data file validation ============================
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
✔ Pipeline can be initialized with data
✔ Corpus is loadable
=============================== Training stats ===============================
Language: en
Training pipeline: transformer, ner
50 training docs
52 evaluation docs
✔ No overlap between training and evaluation data
✘ Low number of examples to train a new pipeline (50)
============================== Vocab & Vectors ==============================
ℹ 503866 total word(s) in the data (40884 unique)
ℹ No word vectors present in the package
========================== Named Entity Recognition ==========================
ℹ 3 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities crossing sentence boundaries
================================== Summary ==================================
✔ 7 checks passed
✘ 1 error
Beta Was this translation helpful? Give feedback.
All reactions