Spacy Custom Name Entity Recognition (NER) 'catastrophic forgetting' issue #6846

aravind-812 · 2021-01-28T15:11:46Z

aravind-812
Jan 28, 2021

The model is unable to remember the previous labels on which it was trained
i know that its 'catastrophic forgetting', but no example or blog seems to help this issue.
the most common response for this is this blog is this but this is pretty old now and is not helping

Here is my code:


    from __future__ import unicode_literals, print_function
    import json
    labeled_data = []
    with open(r"/content/emails_labeled.jsonl", "r") as read_file:
        for line in read_file:
            data = json.loads(line)
            labeled_data.append(data)
    
    TRAIN_DATA = []
    for entry in labeled_data:
        entities = []
        for e in entry['labels']:
            entities.append((e[0], e[1],e[2]))
        spacy_entry = (entry['text'], {"entities": entities})
        TRAIN_DATA.append(spacy_entry)       
    import plac
    import random
    import warnings
    from pathlib import Path
    import spacy
    from spacy.util import minibatch, compounding
    
    
    # new entity label
    LABEL = "OIL"
    
    # training data
    # Note: If you're using an existing model, make sure to mix in examples of
    # other entity types that spaCy correctly recognized before. Otherwise, your
    # model might learn the new type, but "forget" what it previously knew.
    # https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
    
    @plac.annotations(
        model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
        new_model_name=("New model name for model meta.", "option", "nm", str),
        output_dir=("Optional output directory", "option", "o", Path),
        n_iter=("Number of training iterations", "option", "n", int),
    )
    def main(model='/content/LinkModelOutput', new_model_name="Oil21", output_dir='/content/Last', n_iter=30):
        """Set up the pipeline and entity recognizer, and train the new entity."""
        random.seed(0)
        if model is not None:
            nlp = spacy.load(model)  # load existing spaCy model
            print("Loaded model '%s'" % model)
        else:
            nlp = spacy.blank("en")  # create blank Language class
            print("Created blank 'en' model")
        # Add entity recognizer to model if it's not in the pipeline
        # nlp.create_pipe works for built-ins that are registered with spaCy
        if "ner" not in nlp.pipe_names:
            ner = nlp.create_pipe("ner")
            nlp.add_pipe(ner)
        # otherwise, get it, so we can add labels to it
        else:
            ner = nlp.get_pipe("ner")
    
        ner.add_label(LABEL)  # add new entity label to entity recognizer
        # Adding extraneous labels shouldn't mess anything up
        #ner.add_label("VEGETABLE")
        if model is None:
            optimizer = nlp.begin_training()
        else:
            optimizer = nlp.resume_training()
        move_names = list(ner.move_names)
        # get names of other pipes to disable them during training
        pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
        other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
        # only train NER
        with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
            # show warnings for misaligned entity spans once
            warnings.filterwarnings("once", category=UserWarning, module='spacy')
    
            sizes = compounding(1.0, 4.0, 1.001)
            # batch up the examples using spaCy's minibatch
            for itn in range(n_iter):
                random.shuffle(TRAIN_DATA)
                batches = minibatch(TRAIN_DATA, size=sizes)
                losses = {}
                for batch in batches:
                    texts, annotations = zip(*batch)
                    nlp.entity.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
                print("Losses", losses)
    
        # test the trained model
        test_text = "Here is Hindustan petroleum's oil reserves coup in Australia. Details can be found at https://www.textfixer.com/tools/remove-line-breaks.php?"
        doc = nlp(test_text)
        print("Entities in '%s'" % test_text)
        for ent in doc.ents:
            print(ent.label_, ent.text)
    
        # save model to output directory
        if output_dir is not None:
            output_dir = Path(output_dir)
            if not output_dir.exists():
                output_dir.mkdir()
            nlp.meta["name"] = new_model_name  # rename model
            nlp.to_disk(output_dir)
            print("Saved model to", output_dir)
    
            # test the saved model
            print("Loading from", output_dir)
            nlp2 = spacy.load(output_dir)
            # Check the classes have loaded back consistently
            assert nlp2.get_pipe("ner").move_names == move_names
            doc2 = nlp2(test_text)
            for ent in doc2.ents:
                print(ent.label_, ent.text)
    
    
    if __name__ == "__main__":
        plac.call(main)

and the data annotation was done on 'Daccano'.
Here is a look at the data:

    {"id": 174, "text": "service\tmarathon petroleum reduces service postings marathon petroleum co said it reduced the contract price it will pay for all grades of service oil one dlr a barrel effective today the decrease brings marathon s posted price for both west texas intermediate and west texas sour to dlrs a bbl the south louisiana sweet grade of service was reduced to dlrs a bbl the company last changed its service postings on jan reuter", "meta": {}, "annotation_approver": null, "labels": [[61, 70, "OIL"], [147, 150, "OIL"]]}
    {"id": 175, "text": "mutual funds\tmunsingwear inc mun th qtr jan loss shr loss cts vs loss seven cts net loss vs loss revs mln vs mln year shr profit cts vs profit cts net profit vs profit revs mln vs mln avg shrs vs note per shr adjusted for for stock split july and for split may reuter", "meta": {}, "annotation_approver": null, "labels": []}

How to reproduce the behaviour

Your Environment

Info about spaCy

spaCy version: 2.3.5
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9

Answered by svlandeg

Jan 28, 2021

Can you explain in more detail which model you're starting from, which entities it already is able to predict, which type of (new and old) entities you're feeding into the model for retraining, and which problems in accuracy you're seeing?

The code above is not really a minimal reproducible snippet. The way you've pasted it, it looks like TRAIN_DATA is defined several times:

TRAIN_DATA = []
for entry in labeled_data:
    ...
    TRAIN_DATA.append(spacy_entry)      

...

LABEL = "OIL"
TRAIN_DATA = [
        (
            "Horses are too tall and they pretend to care about your feelings",
            {"entities": [(0, 6, LABEL)]},
        ),
...
]

Note that in the second definition of TRA…

View full answer

svlandeg · 2021-01-28T17:11:14Z

svlandeg
Jan 28, 2021

Can you explain in more detail which model you're starting from, which entities it already is able to predict, which type of (new and old) entities you're feeding into the model for retraining, and which problems in accuracy you're seeing?

The code above is not really a minimal reproducible snippet. The way you've pasted it, it looks like TRAIN_DATA is defined several times:

TRAIN_DATA = []
for entry in labeled_data:
    ...
    TRAIN_DATA.append(spacy_entry)      

...

LABEL = "OIL"
TRAIN_DATA = [
        (
            "Horses are too tall and they pretend to care about your feelings",
            {"entities": [(0, 6, LABEL)]},
        ),
...
]

Note that in the second definition of TRAIN_DATA, it looks like you're training the model to recognize that the word "Horses" is of type OIL, which is probably not what you want ;-)

To avoid the catastrophic forgetting problem, you need to create realistic training examples for all the entity types that you want the ML algorithm to learn / not forget, and feed those in when you're retraining the model. Can you verify whether that is indeed what you're doing?

0 replies

svlandeg · 2021-01-28T17:15:48Z

svlandeg
Jan 28, 2021

Also, just FYI, if you're mainly interested in recognizing a list of common synonyms, you might also consider a more rule-based approach, e.g. https://spacy.io/usage/rule-based-matching#phrasematcher. This might be particularly useful for words that are not ambiguous / not dependent on context.

0 replies

aravind-812 · 2021-01-28T17:45:37Z

aravind-812
Jan 28, 2021
Author

@svlandeg I have initially trained a model on recognizing links ( Reddit, stack-overflow and Twitter links) and after that i want to add another entity in the above case petroleum products.
Now i want the model to even recognize Petroleum Products, so i labeled them and gave them a tag ('OIL') and trained them
But the model is unable to recognize them.

and as far as the HORSE labeling is concerned it was the default train data given by Spacy.
I commented it anyways while training.

3 replies

svlandeg Jan 29, 2021

and as far as the HORSE labeling is concerned it was the default train data given by Spacy.
I commented it anyways while training.

Ok, but these details are important, ofcourse, and I can only go by the code you've pasted. Can you share a minimal code snippet that runs as is, and that showcases the problem you're running into?

svlandeg Jan 29, 2021

I have initially trained a model on recognizing links ( Reddit, stack-overflow and Twitter links)

Just out of curiosity - why are you training an NER on this challenge? Couldn't you recognize links with some sort of pattern matching / regular expression? The same link isn't going to change in meaning depending on context, right? So I don't think NER is suited for your task, but I may be misunderstanding something.

aravind-812 Jan 29, 2021
Author

@svlandeg
I just took Links as example
To serve you with another example:
I have annotated and trained a bunch of entities (tech companies) and still I observe catastrophic forgetting.
the only solution which is referred is (https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting?ref=Welcome.AI)
and the Syntax in the link are out dated.

svlandeg · 2021-01-28T18:12:03Z

svlandeg
Jan 28, 2021

FYI - I've transferred this issue to the discussion forum, which is better suited for usage questions and community discussions!

0 replies

shrinidhin · 2021-12-14T05:10:43Z

shrinidhin
Dec 14, 2021

Hi!I have a similar use case I am trying to implement. I am trying to build an entity recognizer with a vast set of entities of a specific type. It is quite possible I might not have training data to cover all entity labels. Is is possible to include it in a vocab of some sort in SPACY's NLP NER?

2 replies

polm Dec 20, 2021

Hey, you should probably open a new discussion since this one is pretty old.

I'm not entirely clear what you're asking, but if by vocab you mean a list of words, maybe you should have a look at the docs on rule based matchers.

shrinidhin Dec 20, 2021

Hey @polm Thanks for your reply!I believe I am looking for something similar to list of words, But I shall open a new discussion and detail out what I'm exactly looking to implement.

Uh oh!

Spacy Custom Name Entity Recognition (NER) 'catastrophic forgetting' issue #6846

Uh oh!

Uh oh!

How to reproduce the behaviour

Your Environment

Info about spaCy

Replies: 5 comments · 5 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-812 Jan 28, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-812 Jan 29, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 5 replies

aravind-812
Jan 28, 2021
Author

aravind-812 Jan 29, 2021
Author