NER training not doing well with custom labels? #6240

justinbhopper · 2020-10-11T20:15:25Z

justinbhopper
Oct 11, 2020

I'm at a loss as to why this training data is failing to properly train a blank model. As far as I can tell, my code follows the typical training approach.

Some of my tests are even the exact same sentence as found in the training data!

Code

import spacy
import random
from spacy.util import minibatch, compounding

nlp = spacy.blank("en")

ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

ner.add_label("CONDITION")
ner.add_label("PERSON")

optimizer = nlp.begin_training()

TRAIN_DATA = [
    ("Johnny says he has acute depression and mild schizophrenia.", { "entities": [(0, 6, "PERSON"), (26, 36, "CONDITION"), (46, 59, "CONDITION")] }),
    ("Patient diagnosed with severe schizophrenia.", { "entities": [(31, 44, "CONDITION")] }),
    ("I think he has schizophrenia.", { "entities": [(16, 29, "CONDITION")] }),
    ("Patient was diagnosed with schizophrenia.", { "entities": [(28, 41, "CONDITION")] }),
    ("John was diagnosed with depression.", { "entities": [(0, 4, "PERSON"), (25, 35, "CONDITION")] }),
    ("Schizophrenia is a condition that is hard to treat.", { "entities": [(0, 14, "CONDITION")] }),
    ("Schizophrenia is a medical term.", { "entities": [(0, 14, "CONDITION")] }),
    ("My mother had schizophrenia when she turned 50.", { "entities": [(4, 10, "PERSON"), (15, 28, "CONDITION")] }),
    ("My aunt has schizophrenia.", { "entities": [(4, 8, "PERSON"), (13, 26, "CONDITION")] }),
    ("I have schizophrenia.", { "entities": [(8, 21, "CONDITION")] }),
    ("I don't have schizophrenia.", { "entities": [(14, 27, "CONDITION")] }),
    ("I'm sorry but you have schizophrenia.", { "entities": [(24, 37, "CONDITION")] }),
]

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*unaffected_pipes):
    for itn in range(20):
        random.shuffle(TRAIN_DATA)
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35)

# Test some examples
TESTS = [
  "John loves New York.",
  "I think he has schizophrenia.",
  "John has schizophrenia.",
  "Mary has schizophrenia.",
  "Johnny has schizophrenia.",
  "Johnny has mild schizophrenia.",
  "My mother has schizophrenia.",
  "John was diagnosed with schizophrenia.",
  "Schizophrenia is a condition that is hard to treat."
]

for test in TESTS:
    print("> " + test)
    doc = nlp(test)

    for ent in doc.ents:
        print('  "' + ent.text + ''" {" + ent.label_ + "}")

Results

Note how it only recognizes PERSON, but not CONDITION.

> John loves New York.
  "John {PERSON}
> I think he has schizophrenia.
> John has schizophrenia.
  "John {PERSON}
> Mary has schizophrenia.
> Johnny has schizophrenia.
  "Johnny {PERSON}
> Johnny has mild schizophrenia.
  "Johnny {PERSON}
> My mother has schizophrenia.
> John was diagnosed with schizophrenia.
  "John {PERSON}
> Schizophrenia is a condition that is hard to treat.

Python Version Used: 3.8
spaCy Version Used: 2.2.3

Answered by svlandeg

Oct 11, 2020

@justinbhopper : could you check whether you got any warnings? I tried running your code, and got

...\language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "I think he has schizophrenia." with entities "[(16, 29, 'CONDITION')]". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)

Which I think is caused by

("I think he has schizophrenia.", {"entities": [(16, 29, "CONDITION")]}),

starting at char index 15, not 16. If you happen to have the indices wrong for all CONDITION entities, those will be skipped, and your m…

View full answer

justinbhopper · 2020-10-11T21:28:15Z

justinbhopper
Oct 11, 2020
Author

I figured out my problem. ~~I thought by only giving training data consisting mostly of "schizophrenia", that it would help the NER create a model that, although overfitted, could recognize those same examples.~~

~~Turns out the lack of variety in different condition examples actually hurts the model.~~

Edit: Actually turns out my indexes were off the whole time. The code below fixes the issue because it automates the positioning calculations.

Fixed code snippet

Using the above code but replacing TRAINING_DATA with data that contains different CONDITION entries fixed the problem:

CONDITIONS = ['schizophrenia', 'depression', 'high blood pressure', 'scoliosis']

TEMPLATES = [
    ("Johnny says he has mild %s.", [(0, 6, "PERSON")]),
    ("Patient diagnosed with severe %s.", []),
    ("I think he has %s.", []),
    ("Patient was diagnosed with %s.", []),
    ("John was diagnosed with %s.", [(0, 4, "PERSON")]),
    ("John has mild %s.", [(0, 4, "PERSON")]),
    ("%s is a condition that is hard to treat.", []),
    ("%s is a medical term.", []),
    ("My mother had %s when she turned 50.", [(4, 10, "PERSON")]),
    ("My aunt has %s.", [(4, 8, "PERSON")]),
    ("I have %s.", []),
    ("I don't have %s.", []),
    ("I'm sorry but you have %s.", []),
]

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

training = []

for condition in CONDITIONS:
    for template in TEMPLATES:
        text = template[0]
        entities = template[1]
        index = text.index("%s")
        result = text.replace("%s", condition)
        training.append((result, { "entities": entities + [(index, index + len(condition), "CONDITION")] }))

New Results

> John loves New York.
  "John {PERSON}
  "York {CONDITION}
> I think he has schizophrenia.
  "schizophrenia {CONDITION}
> John has schizophrenia.
  "John {PERSON}
  "schizophrenia {CONDITION}
> Mary has schizophrenia.
  "schizophrenia {CONDITION}
> Mary has depression.
  "depression {CONDITION}
> My aunt has high blood pressure.
  "high blood pressure {CONDITION}
> Johnny has schizophrenia.
  "Johnny {PERSON}
  "schizophrenia {CONDITION}
> Johnny has mild schizophrenia.
  "Johnny {PERSON}
  "schizophrenia {CONDITION}
> Johnny has mild scoliosis.
  "Johnny {PERSON}
  "scoliosis {CONDITION}
> My mother has schizophrenia.
  "schizophrenia {CONDITION}
> John was diagnosed with schizophrenia.
  "John {PERSON}
  "schizophrenia {CONDITION}
> Schizophrenia is a condition that is hard to treat.
  "Schizophrenia {CONDITION}

0 replies

svlandeg · 2020-10-11T21:33:36Z

svlandeg
Oct 11, 2020

@justinbhopper : could you check whether you got any warnings? I tried running your code, and got

...\language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "I think he has schizophrenia." with entities "[(16, 29, 'CONDITION')]". Use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)

Which I think is caused by

("I think he has schizophrenia.", {"entities": [(16, 29, "CONDITION")]}),

starting at char index 15, not 16. If you happen to have the indices wrong for all CONDITION entities, those will be skipped, and your model won't be aware of them.

0 replies

justinbhopper · 2020-10-11T21:42:28Z

justinbhopper
Oct 11, 2020
Author

@svlandeg Wow you're right, my indexes have been off by 1 position the whole time.

So my "fix" above actually fixed because it uses string.index() to correctly get the position in the string.

0 replies

svlandeg · 2020-10-11T21:44:27Z

svlandeg
Oct 11, 2020

Haha ;-) Either way - I hope this helps you get going again ;-)

0 replies

justinbhopper · 2020-10-11T21:55:09Z

justinbhopper
Oct 11, 2020
Author

@svlandeg Is there any obvious reason that the misalignment UserWarning doesn't print out for me? I'm new to python so maybe its something with my settings?

0 replies

svlandeg · 2020-10-11T22:57:00Z

svlandeg
Oct 11, 2020

Hm, good question. They should print by default. You can disable them specifically, but it looks like you didn't change any default settings. How are you executing your python script?

0 replies

adrianeboyd · 2020-10-12T06:02:22Z

adrianeboyd
Oct 12, 2020

The warning wasn't added until v2.3.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NER training not doing well with custom labels? #6240

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER training not doing well with custom labels? #6240

Uh oh!

justinbhopper Oct 11, 2020

Code

Results

Replies: 7 comments

Uh oh!

Uh oh!

justinbhopper Oct 11, 2020 Author

Fixed code snippet

New Results

Uh oh!

svlandeg Oct 11, 2020

Uh oh!

justinbhopper Oct 11, 2020 Author

Uh oh!

svlandeg Oct 11, 2020

Uh oh!

justinbhopper Oct 11, 2020 Author

Uh oh!

svlandeg Oct 11, 2020

Uh oh!

adrianeboyd Oct 12, 2020

justinbhopper
Oct 11, 2020

justinbhopper
Oct 11, 2020
Author

svlandeg
Oct 11, 2020

justinbhopper
Oct 11, 2020
Author

svlandeg
Oct 11, 2020

justinbhopper
Oct 11, 2020
Author

svlandeg
Oct 11, 2020

adrianeboyd
Oct 12, 2020