Training NER with custom tokenizer #13061
-
I am trying to implement a custom NER for parsing academic references like
I need it to detect authors, article title and things like volume (58) and year. I have a large dataset used before for training a different model, but I'm having trouble converting it to satisfy spaCy. As far as I can see, the main problem is the tokenization rules of spaCy that do not split tokens like
I've saved a model with custom tokenizer as follows: from __future__ import unicode_literals, print_function
from pathlib import Path
from spacy.symbols import ORTH
import spacy
model = None
output_dir = Path('ner')
n_iter = 100
# load the model
if model is not None:
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en')
print("Created blank 'en' model")
# set up the pipeline
if 'ner' not in nlp.pipe_names:
nlp.add_pipe('ner', last=True)
ner = nlp.get_pipe('ner')
suffixes = nlp.Defaults.suffixes + [r"\.",]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
nlp.tokenizer.url_match = None
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex([':', '\-', '–', ';', '\(', '\)', '/', '\.', '\n', ',']).finditer
nlp.tokenizer.add_special_case('://', [{ORTH: ':'}, {ORTH: '/'}, {ORTH: '/'}])
nlp.to_disk('custom_tokenizer_core_en_web_sm.spacy') Then I run
Initially I used Python API for this and discovered the misalignment problems in that way. After fixing them in the dataset, I start training and at a random moment I get:
I've definitely added all the labels and filtered all the examples such that they start and end with How do I debug this further? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
The The example above converts fine, so maybe there's a problem further down in your file. You may have a whitespace token in the first column or a tag that it can't parse? You can split the file up into smaller segments to try to figure out where the problem is? And it's possible that you'll run into problems with labels like
Instead of customizing the tokenizer for |
Beta Was this translation helpful? Give feedback.
The
parser
andner
models can run into this issue if there's not much training data, or in this case I think it's also due to updating on individual examples rather than larger batches.Punctuation shouldn't matter (I see that this error message is a bit out-of-date and still refers to some spacy v2 features), but whitespace does matter. It is hard-coded in the
ner
component that entity spans can't start or end with whitespace.I strongly strongly recommend using
spacy train
instead of a minimal hand-written training loop. A hand-written loop is useful pedagogically to understand how training works, but you can easily run into problems once you move away from toy examples to real data.So…