-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Labels
feat / nerFeature: Named Entity RecognizerFeature: Named Entity RecognizertrainingTraining and updating modelsTraining and updating models
Description
Hi
So I am trying to create a custom NER model, and following the steps as follows:
I have got the training date with the text examples and the tags along with start and end indices.
Now I run the following code:
from spacy.tokens import DocBin
from tqdm import tqdm
nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()
from spacy.util import filter_spans
for training_example in tqdm(training_data):
text = training_example['text']
labels = training_example['entities']
doc = nlp.make_doc(text)
ents = []
for start, end, label in labels:
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
filtered_ents = filter_spans(ents)
doc.ents = filtered_ents
doc_bin.add(doc)
doc_bin.to_disk("train.spacy")
!python -m spacy init fill-config base_config.cfg config.cfg
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy
The output that I should be geting is:
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2022-07-01 18:31:37,021] [INFO] Set up nlp object from config
[2022-07-01 18:31:37,041] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-07-01 18:31:37,047] [INFO] Created vocabulary
[2022-07-01 18:31:40,116] [INFO] Added vectors: en_core_web_lg
[2022-07-01 18:31:43,239] [INFO] Finished initializing nlp object
[2022-07-01 18:31:45,876] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 153.29 0.49 0.64 0.39 0.00
7 200 501.32 3113.23 78.43 78.12 78.74 0.78
✔ Saved pipeline to output directory
model-last
But what I get is
ℹ Saving to output directory: .
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2023-03-14 02:40:38,422] [INFO] Set up nlp object from config
[2023-03-14 02:40:38,441] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-03-14 02:40:38,445] [INFO] Created vocabulary
[2023-03-14 02:42:09,609] [INFO] Added vectors: en_core_web_lg
And then the cell stops executing for the jupyter notebook. What could be the case here? I do not get any error messages or anything.
The only change I did to the config file is the batch size to 80 and training epochs to 300.
Any help?
Your Environment
`- spaCy version: 3.5.1
- Platform: Linux-4.14.304-226.531.amzn2.x86_64-x86_64-with-glibc2.31
- Python version: 3.10.6
- Pipelines: en_core_web_sm (3.5.0), en_core_web_lg (3.5.0)`
Metadata
Metadata
Assignees
Labels
feat / nerFeature: Named Entity RecognizerFeature: Named Entity RecognizertrainingTraining and updating modelsTraining and updating models