Training NER with custom tokenizer #13061

ckald · 2023-10-11T19:34:11Z

ckald
Oct 11, 2023

I am trying to implement a custom NER for parsing academic references like

Padovan-Merhar, O. et al. Single mammalian cells compensate for differences in cellular volume and DNA copy number through independent global transcriptional mechanisms. Mol. Cell 58, 339–352 (2015).

I need it to detect authors, article title and things like volume (58) and year.

I have a large dataset used before for training a different model, but I'm having trouble converting it to satisfy spaCy.

As far as I can see, the main problem is the tokenization rules of spaCy that do not split tokens like 339-352. I've updated the tokenizer to match the tokens in my dataset and converted it into IOB toker-per-line format as follows:

Gilgun	O
-	O
Sherki	O
Y	B-<author>
,	I-<author>
Rosenbaum	I-<author>
Z	I-<author>
,	I-<author>
Melamed	I-<author>
E	I-<author>
et	O
al	O
(	O
2002	B-<date>
)	O
Antioxidant	B-<title>
therapy	I-<title>
in	I-<title>
acute	I-<title>
central	I-<title>
nervous	I-<title>
system	I-<title>
injury	I-<title>
:	I-<title>
current	I-<title>
state	I-<title>
.	I-<title>
Pharmacol	O
Rev	O
54	B-<volume>
:	O
271	O
â	O
284	O

<...next example after a blank line...>

I've saved a model with custom tokenizer as follows:

from __future__ import unicode_literals, print_function
from pathlib import Path
from spacy.symbols import ORTH
import spacy

model = None
output_dir = Path('ner')
n_iter = 100

# load the model
if model is not None:
    nlp = spacy.load(model)
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank('en')
    print("Created blank 'en' model")

# set up the pipeline
if 'ner' not in nlp.pipe_names:
    nlp.add_pipe('ner', last=True)

ner = nlp.get_pipe('ner')

suffixes = nlp.Defaults.suffixes + [r"\.",]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
nlp.tokenizer.url_match = None
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex([':', '\-', '–', ';', '\(', '\)', '/', '\.', '\n', ',']).finditer
nlp.tokenizer.add_special_case('://', [{ORTH: ':'}, {ORTH: '/'}, {ORTH: '/'}])
nlp.to_disk('custom_tokenizer_core_en_web_sm.spacy')

Then I run python -m spacy convert -c iob -b custom_tokenizer_core_en_web_sm.spacy -n 10 references.iob . I get:
ValueError: [E903] The token-per-line NER file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert
Which makes me think that the saved model is not using the tokenizer rules I provided.

python -m spacy debug data custom_tokenizer_core_en_web_sm.spacy/config.cfg --paths.dev references.iob --paths.train references.iob says there are no samples in my dataset.

Initially I used Python API for this and discovered the misalignment problems in that way. After fixing them in the dataset, I start training and at a random moment I get:

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data. For details, run:
python -m spacy debug data --help

I've definitely added all the labels and filtered all the examples such that they start and end with alphanum characters.

How do I debug this further?

Answered by adrianeboyd

Oct 18, 2023

The parser and ner models can run into this issue if there's not much training data, or in this case I think it's also due to updating on individual examples rather than larger batches.

Punctuation shouldn't matter (I see that this error message is a bit out-of-date and still refers to some spacy v2 features), but whitespace does matter. It is hard-coded in the ner component that entity spans can't start or end with whitespace.

I strongly strongly recommend using spacy train instead of a minimal hand-written training loop. A hand-written loop is useful pedagogically to understand how training works, but you can easily run into problems once you move away from toy examples to real data.

So…

View full answer

adrianeboyd · 2023-10-12T05:54:30Z

adrianeboyd
Oct 12, 2023

The -b option for spacy convert is only used for sentence segmentation, not tokenization, so you don't need it if you have blank lines between sentences. The converter keeps the tokenization from the input file.

The example above converts fine, so maybe there's a problem further down in your file. You may have a whitespace token in the first column or a tag that it can't parse? You can split the file up into smaller segments to try to figure out where the problem is?

And it's possible that you'll run into problems with labels like <author> instead of AUTHOR because there's some code that doesn't expect this kind of punctuation in the label. In theory it should work, but just as a potential warning.

spacy debug data only works with the converted .spacy input, not other formats.

Instead of customizing the tokenizer for spacy convert, you want to be sure you've customized the tokenizer for spacy train as described here: https://spacy.io/usage/training#custom-tokenizer

4 replies

ckald Oct 17, 2023
Author

Thanks for a speedy answer and clarifications, I've been trying to follow your advice.

I checked that labels are not the issue — AUTHOR or <author> make no difference.

I've selected a small sample of my dataset that does not produce misalignment errors. I use Python API to feed it into NER with custom tokenizer. The training passes the first epoch successfully unlike my previous attempts. But still inevitably fails at a random epoch later:

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data. For details, run:
python -m spacy debug data --help

Here's the code: https://paste.in.ua/kkc45/
Here's the data: https://paste.in.ua/40fxd/

The error message clearly complains about leading/trailing punctuation/whitespace. But the data I'm working with is supposed to have this! I'm basically working on a string segmentation problem, where each entity might contain anything and the model should not make assumptions about the non-alphanumeric characters.

I returned to the Python API because token-per-line input data seems contradictory — I need to supply tokenized data while spaCy does not support my tokenization rules.

adrianeboyd Oct 18, 2023

The parser and ner models can run into this issue if there's not much training data, or in this case I think it's also due to updating on individual examples rather than larger batches.

Punctuation shouldn't matter (I see that this error message is a bit out-of-date and still refers to some spacy v2 features), but whitespace does matter. It is hard-coded in the ner component that entity spans can't start or end with whitespace.

I strongly strongly recommend using spacy train instead of a minimal hand-written training loop. A hand-written loop is useful pedagogically to understand how training works, but you can easily run into problems once you move away from toy examples to real data.

So this feels like a lot of steps the first time you go through it, but each step only needs to be done once and then you've saved the results for the next step:

Converting your data

You don't need to use the spacy convert converters, you can convert your data with your own script and save it to a .spacy file, which could look like this for your current script:

from spacy.tokens import DocBin
doc_bin = DocBin()
for text, annotations in TRAIN_DATA:
    eg = Example.from_dict(nlp.make_doc(text), annotations)
    doc_bin.add(eg.reference)
doc_bin.to_disk("data.spacy")

Once the basics are working, you'll want to split your data in to train/dev/test sets.

Customizing your tokenizer

In your script, after modifying the tokenizer, save the model and the config:

nlp.to_disk("/path/to/model_with_modified_tokenizer")
nlp.config.to_disk("/path/to/ner.cfg")

Then modify the [initialize] block in your config to load the saved tokenizer, which you can do with a built-in callback:

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "/path/to/model_with_modified_tokenizer"

Checking your data (optional but recommended)

spacy debug data ner.cfg --paths.train train.spacy --paths.dev dev.spacy

This will show errors if there is unsupported whitespace in the training data.

Training

spacy train ner.cfg --paths.train train.spacy --paths.dev dev.spacy -o output_dir

With your sample data as both train and dev (just to check that the model can overfit on the training data), this trains fine:

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  --------  ------  ------  ------  ------
  0       0     33.80    1.72    1.96    1.54    0.02
  8     200   1996.55   78.46   78.46   78.46    0.78
 20     400    130.12  100.00  100.00  100.00    1.00
 34     600      0.95  100.00  100.00  100.00    1.00
 51     800      0.00  100.00  100.00  100.00    1.00

Answer selected by ckald

ckald Oct 18, 2023
Author

Whooa! This tutorial was extremely useful. Indeed the same data works for training a model this way. It also produces a reasonable performance on some of the labels from a small dataset.

I was encouraged and tried to repeat the same on a big dataset. It trained for a while, but still finished with an error during the first epoch:

(env) ➜  spacy-ner spacy train ner.cfg -g 0 --paths.train train.spacy --paths.dev test.spacy -o output_large
ℹ Saving to output directory: output_large
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['ner']
ℹ Initial learn rate: 0.001
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  --------  ------  ------  ------  ------
  0       0     36.56    0.00    0.00    0.00    0.00
  0     200   2573.49   76.92   85.50   69.90    0.77
  0     400    923.00   83.22   84.01   82.44    0.83
  0     600   1007.76   83.59   85.11   82.13    0.84
  0     800   1102.92   82.74   82.10   83.39    0.83
  0    1000   1377.79   84.62   84.59   84.64    0.85
  0    1200   1567.05   86.77   87.46   86.09    0.87
  0    1400   1976.39   85.69   85.77   85.60    0.86
  0    1600   2254.18   87.23   88.61   85.89    0.87
  0    1800   2752.79   87.04   87.47   86.62    0.87
  0    2000   3237.44   87.50   88.35   86.67    0.88
  0    2200   3823.32   87.30   88.30   86.33    0.87
  0    2400   4440.46   87.87   88.36   87.38    0.88
  0    2600   4680.68   87.69   87.44   87.94    0.88
  0    2800   4642.65   88.11   89.07   87.17    0.88
  0    3000   4547.49   88.33   88.59   88.08    0.88
  0    3200   4506.22   88.34   90.03   86.72    0.88
  0    3400   4499.05   87.93   89.33   86.58    0.88
  0    3600   4463.30   87.62   87.36   87.89    0.88
  0    3800   4468.24   88.16   88.78   87.55    0.88
  0    4000   4500.78   88.57   89.40   87.77    0.89
  0    4200   4309.56   88.27   88.88   87.68    0.88
  0    4400   4330.60   88.45   89.00   87.90    0.88
  0    4600   4314.74   85.28   82.67   88.06    0.85
  0    4800   4233.82   88.55   88.95   88.16    0.89
  0    5000   4393.70   88.81   89.90   87.74    0.89
  0    5200   4217.31   88.23   89.15   87.33    0.88
  0    5400   4292.22   88.56   88.95   88.17    0.89
  0    5600   4138.18   88.89   91.08   86.81    0.89
  0    5800   4181.65   87.63   89.93   85.44    0.88
  0    6000   4306.19   88.76   89.46   88.07    0.89
⚠ Aborting and saving the final best model. Encountered exception:
ValueError("[E024] Could not find an optimal move to supervise the parser.
Usually, this means that the model can't be updated in a way that's valid and
satisfies the correct annotations specified in the GoldParse. For example, are
all labels added to the model? If you're training a named entity recognizer,
also make sure that none of your annotated entity spans have leading or trailing
whitespace or punctuation. You can also use the `debug data` command to validate
your JSON-formatted training data. For details, run:\npython -m spacy debug data
--help")

Looks like the loss wasn't happy with the training. debug data also mentioned that there are about 400 samples with non-stripped whitespace.

I've recreated a dataset and started training again

ckald Oct 19, 2023
Author

Thank you, that worked! I have bad data for one of the classes but otherwise this is great. The amount of tools to learn was pretty high, would've spent lots more time without your help.

(env) ➜  spacy-ner spacy evaluate -g 0 output_stripped/model-last ./test.spacy
ℹ Using GPU: 0

================================== Results ==================================

TOK     100.00
NER P   91.56
NER R   85.36
NER F   88.35
SPEED   114940


=============================== NER (per type) ===============================

              P       R       F
AUTHORS   38.14    5.99   10.35
YEAR      95.80   96.65   96.22
TITLE     89.01   91.36   90.17
VOLUME    92.06   91.21   91.63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training NER with custom tokenizer #13061

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training NER with custom tokenizer #13061

Uh oh!

Uh oh!

ckald Oct 11, 2023

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd Oct 12, 2023

Uh oh!

ckald Oct 17, 2023 Author

Uh oh!

adrianeboyd Oct 18, 2023

Uh oh!

ckald Oct 18, 2023 Author

Uh oh!

ckald Oct 19, 2023 Author

ckald
Oct 11, 2023

Replies: 1 comment 4 replies

adrianeboyd
Oct 12, 2023

ckald Oct 17, 2023
Author

ckald Oct 18, 2023
Author

ckald Oct 19, 2023
Author