Skip to content

Training custom NER model but cannot get into the training loop #12413

@Abe410

Description

@Abe410

Hi

So I am trying to create a custom NER model, and following the steps as follows:

I have got the training date with the text examples and the tags along with start and end indices.

Now I run the following code:

from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()

from spacy.util import filter_spans

for training_example  in tqdm(training_data): 
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents 
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy") 

!python -m spacy init fill-config base_config.cfg config.cfg

!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy 

The output that I should be geting is:

ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-07-01 18:31:37,021] [INFO] Set up nlp object from config
[2022-07-01 18:31:37,041] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-07-01 18:31:37,047] [INFO] Created vocabulary
[2022-07-01 18:31:40,116] [INFO] Added vectors: en_core_web_lg
[2022-07-01 18:31:43,239] [INFO] Finished initializing nlp object
[2022-07-01 18:31:45,876] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.29    0.49    0.64    0.39    0.00
  7     200        501.32   3113.23   78.43   78.12   78.74    0.78
✔ Saved pipeline to output directory
model-last

But what I get is

ℹ Saving to output directory: .
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-03-14 02:40:38,422] [INFO] Set up nlp object from config
[2023-03-14 02:40:38,441] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-03-14 02:40:38,445] [INFO] Created vocabulary
[2023-03-14 02:42:09,609] [INFO] Added vectors: en_core_web_lg

And then the cell stops executing for the jupyter notebook. What could be the case here? I do not get any error messages or anything.

The only change I did to the config file is the batch size to 80 and training epochs to 300.

Any help?

Your Environment

`- spaCy version: 3.5.1

  • Platform: Linux-4.14.304-226.531.amzn2.x86_64-x86_64-with-glibc2.31
  • Python version: 3.10.6
  • Pipelines: en_core_web_sm (3.5.0), en_core_web_lg (3.5.0)`

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / nerFeature: Named Entity RecognizertrainingTraining and updating models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions