NER training not doing well with custom labels? #6240
-
|
I'm at a loss as to why this training data is failing to properly train a blank model. As far as I can tell, my code follows the typical training approach. Some of my tests are even the exact same sentence as found in the training data! Codeimport spacy
import random
from spacy.util import minibatch, compounding
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("CONDITION")
ner.add_label("PERSON")
optimizer = nlp.begin_training()
TRAIN_DATA = [
("Johnny says he has acute depression and mild schizophrenia.", { "entities": [(0, 6, "PERSON"), (26, 36, "CONDITION"), (46, 59, "CONDITION")] }),
("Patient diagnosed with severe schizophrenia.", { "entities": [(31, 44, "CONDITION")] }),
("I think he has schizophrenia.", { "entities": [(16, 29, "CONDITION")] }),
("Patient was diagnosed with schizophrenia.", { "entities": [(28, 41, "CONDITION")] }),
("John was diagnosed with depression.", { "entities": [(0, 4, "PERSON"), (25, 35, "CONDITION")] }),
("Schizophrenia is a condition that is hard to treat.", { "entities": [(0, 14, "CONDITION")] }),
("Schizophrenia is a medical term.", { "entities": [(0, 14, "CONDITION")] }),
("My mother had schizophrenia when she turned 50.", { "entities": [(4, 10, "PERSON"), (15, 28, "CONDITION")] }),
("My aunt has schizophrenia.", { "entities": [(4, 8, "PERSON"), (13, 26, "CONDITION")] }),
("I have schizophrenia.", { "entities": [(8, 21, "CONDITION")] }),
("I don't have schizophrenia.", { "entities": [(14, 27, "CONDITION")] }),
("I'm sorry but you have schizophrenia.", { "entities": [(24, 37, "CONDITION")] }),
]
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*unaffected_pipes):
for itn in range(20):
random.shuffle(TRAIN_DATA)
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.35)
# Test some examples
TESTS = [
"John loves New York.",
"I think he has schizophrenia.",
"John has schizophrenia.",
"Mary has schizophrenia.",
"Johnny has schizophrenia.",
"Johnny has mild schizophrenia.",
"My mother has schizophrenia.",
"John was diagnosed with schizophrenia.",
"Schizophrenia is a condition that is hard to treat."
]
for test in TESTS:
print("> " + test)
doc = nlp(test)
for ent in doc.ents:
print(' "' + ent.text + ''" {" + ent.label_ + "}")ResultsNote how it only recognizes PERSON, but not CONDITION.
|
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
|
I figured out my problem.
Edit: Actually turns out my indexes were off the whole time. The code below fixes the issue because it automates the positioning calculations. Fixed code snippetUsing the above code but replacing CONDITIONS = ['schizophrenia', 'depression', 'high blood pressure', 'scoliosis']
TEMPLATES = [
("Johnny says he has mild %s.", [(0, 6, "PERSON")]),
("Patient diagnosed with severe %s.", []),
("I think he has %s.", []),
("Patient was diagnosed with %s.", []),
("John was diagnosed with %s.", [(0, 4, "PERSON")]),
("John has mild %s.", [(0, 4, "PERSON")]),
("%s is a condition that is hard to treat.", []),
("%s is a medical term.", []),
("My mother had %s when she turned 50.", [(4, 10, "PERSON")]),
("My aunt has %s.", [(4, 8, "PERSON")]),
("I have %s.", []),
("I don't have %s.", []),
("I'm sorry but you have %s.", []),
]
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
training = []
for condition in CONDITIONS:
for template in TEMPLATES:
text = template[0]
entities = template[1]
index = text.index("%s")
result = text.replace("%s", condition)
training.append((result, { "entities": entities + [(index, index + len(condition), "CONDITION")] }))New Results |
Beta Was this translation helpful? Give feedback.
-
|
@justinbhopper : could you check whether you got any warnings? I tried running your code, and got
Which I think is caused by
starting at char index 15, not 16. If you happen to have the indices wrong for all |
Beta Was this translation helpful? Give feedback.
-
|
@svlandeg Wow you're right, my indexes have been off by 1 position the whole time. So my "fix" above actually fixed because it uses |
Beta Was this translation helpful? Give feedback.
-
|
Haha ;-) Either way - I hope this helps you get going again ;-) |
Beta Was this translation helpful? Give feedback.
-
|
@svlandeg Is there any obvious reason that the misalignment UserWarning doesn't print out for me? I'm new to python so maybe its something with my settings? |
Beta Was this translation helpful? Give feedback.
-
|
Hm, good question. They should print by default. You can disable them specifically, but it looks like you didn't change any default settings. How are you executing your python script? |
Beta Was this translation helpful? Give feedback.
-
|
The warning wasn't added until v2.3. |
Beta Was this translation helpful? Give feedback.
@justinbhopper : could you check whether you got any warnings? I tried running your code, and got
Which I think is caused by
starting at char index 15, not 16. If you happen to have the indices wrong for all
CONDITIONentities, those will be skipped, and your m…