Huge losses while Training Spacy Custom Model #12280

Pravin770 · 2023-02-15T11:04:34Z

Pravin770
Feb 15, 2023

I tried to create a new spacy model and train it with a custom dataset but received huge losses while training.

How can I fix the issue and get it trained correctly?

Length of the training data seems more than 1000 samples

ex training data:
[('AGS', {'entities': [(0, 3, 'CUST')]}), ('YML SERVICOS LTD', {'entities': [(0, 16, 'CUST')]}), ('BORG GROUP', {'entities': [(0, 10, 'CUST')]}), ('GRABCRANEX', {'entities': [(0, 10, 'CUST')]}), ('GREEN SHIP', {'entities': [(0, 10, 'CUST')]}),

Here is my Code:

from spacy.training import Example
from google.colab import files
from spacy.util import minibatch, compounding
def train_spacy(data, iterations, nlp):  # <-- Add model as nlp parameter
    TRAIN_DATA = data
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe('ner', last=True)

    elif 'ner' in nlp.pipe_names:
      ner=nlp.get_pipe("ner")
   

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
                ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(4.0, 500.0, 1.001)
        # optimizer = nlp.begin_training()
        optimizer = nlp.create_optimizer()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
#                 text, annotations = zip(*batch)
                
#                 print(text, annotations)
#                 print(",".join(text))
                for text, annotations in batch:
                    doc = nlp.make_doc(text)
                    example = Example.from_dict(doc, annotations)
    #                 example = Example.from_dict(doc, {"entities": annotations})
                    nlp.update(
                        [example],  # batch of texts
                           # batch of annotations
                        drop=0.5,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                print(losses, 'Iteration Number: '+ str(itn))
            if itn == 10:
                nlp.to_disk("Spacy_CUST_NAME_Model_10epochs")
            if itn == 20:
                nlp.to_disk("Spacy_CUST_NAME_Model_20epochs")
            if itn == 30:
                nlp.to_disk("Spacy_CUST_NAME_Model_30epochs")
            if itn == 40:
                nlp.to_disk("Spacy_CUST_NAME_Model_40epochs")
    return nlp

nlp = spacy.blank('en')  # create blank Language class  # Train new model.
# nlp = spacy.load("/content/content/Spacy_CUST_NAME_Model") #Retrain the old model.
# nlp = spacy.load('en_core_web_sm')
nlp.max_length = 150000000000
start_training = train_spacy(train_data, 10, nlp)

Answered by kadarakos

Feb 15, 2023

Hey,

Thanks for the question! Let me focus on the example data first:

data = [
    ("AGS", {"entities": [(0, 3, "CUST")]}),
    ("YML SERVICOS LTD", {"entities": [(0, 16, "CUST")]}),
    ("BORG GROUP", {"entities": [(0, 10, "CUST")]}),
    ("GRABCRANEX", {"entities": [(0, 10, "CUST")]}),
    ("GREEN SHIP", {"entities": [(0, 10, "CUST")]}),
]

If I understand correctly the texts here are the names of entities and not documents. When training a named entity recognizer the goal most often is to find the names of entities within texts. As such the methodology is to annotate the texts with entities and not to only expose the model to the entities themselves.

What you have here is a list of enti…

View full answer

kadarakos · 2023-02-15T14:18:17Z

kadarakos
Feb 15, 2023

Hey,

Thanks for the question! Let me focus on the example data first:

data = [
    ("AGS", {"entities": [(0, 3, "CUST")]}),
    ("YML SERVICOS LTD", {"entities": [(0, 16, "CUST")]}),
    ("BORG GROUP", {"entities": [(0, 10, "CUST")]}),
    ("GRABCRANEX", {"entities": [(0, 10, "CUST")]}),
    ("GREEN SHIP", {"entities": [(0, 10, "CUST")]}),
]

If I understand correctly the texts here are the names of entities and not documents. When training a named entity recognizer the goal most often is to find the names of entities within texts. As such the methodology is to annotate the texts with entities and not to only expose the model to the entities themselves.

What you have here is a list of entities and you can look at the EntityRuler component to build a rule-based pipeline using these names: https://spacy.io/api/entityruler.

In terms of the value for the loss. Do I understand correctly that this is the same question: https://stackoverflow.com/questions/75459001/spacy-train-the-existing-model-with-custom-data? The loss seems to be around 1.6e-08, which is scientific notation Python uses and it means the same as 1.6 * 10**-8:

assert 1.6 * 10**-8 == 1.6e-08

You can ask Python to show the actual number with:

print(f"{1.6e-08:.10f}")

The output should be 0.0000000160.

If I'm getting it right then the loss is really low.

6 replies

kadarakos Feb 17, 2023

Happy to hear the model works better now! What you are doing now in my understanding is to generate a synthetic training corpus for your model. To get better performance on realistic data that your model will encounter it would be good if you could annotate real data. Our colleague has published a really nice post on how he bootstrapped NER annotation for the Tagalog language https://ljvmiranda921.github.io/notebook/2023/02/04/tagalog-pipeline/.

It can be time consuming for sure to annotate real data, but working exclusively with synthetically generated data leads to this situation that you've noticed is that its hard to include the different kinds of real-world situations your model will encounter in your data generation algorithm.

Something that you could look into as well is weak-labeling for NER: https://spacy.io/universe/project/skweak.

Pravin770 Feb 20, 2023
Author

Hi @kadarakos ,

Thank you for the reply, I had reviewed the above links and it was very useful. Will try that.

And also, I have a quick question, I have trained my model without any punctuations. so while testing it, it didn't detect any customers that comes with comma or dot in the middle.
"For ex: Tesla-Cars Pvt. Ltd. and Yamaha Bikes Ltd. Here, the model detects Yamaha Bikes Ltd as CUST but not detecting the Tesla-Cars Pvt. Ltd. as it has dots in it. "

So, how can I overcome this issue??

FYI, I also tried training the model by not removing the punctuation in training data. But receiived huge loss like more than 70.44** and also model didn't detect any words correctly.

kadarakos Feb 22, 2023

I'm sorry, but I can only kind of reiterate. The main idea is to train your models on data that you think will naturally occur. You might have pre-processing steps you apply to the data and potential post-processing steps you apply to the model. In that case again its important that the model sees the pre-processed data and that you evaluate on the post-processed output. The bottom line is: you have to make sure the model is trained on the kind of data you will feed to it in practice and that you evaluate in a way that is again matching the circumstances of your application.

Concretely, its a good idea in general to train with punctuation. If you have some sort of pre-processing step that removes punctuation then you have to make it part of your pipeline so that your model receives the kind of input at test time as during training time.

Pravin770 Feb 23, 2023
Author

@kadarakos Thank you for the reply. I trained the model with the punctuations but while training I received very high losses. I have attached the snapshot below.

Even after training more than 100 epochs, I'm getting huge losses.

Is there anyway, Can I check whether my data is correct?? The format of the data is correct but not sure why the model is not learning.

I have kept the batch size as sizes = compounding(4.0, 32.0, 1.001)
And the dropout is 0.5

svlandeg Feb 23, 2023

Hey @Pravin770, the loss by itself is not very meaningful - I feel like you're unnecessarily focused on it. Ideally you'd also be evaluating your trained model on an independent dev set at each iteration, and the F-score or accuracy for that should be going up as you train - that is the best way to evaluate whether your model is in fact learning properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Huge losses while Training Spacy Custom Model #12280

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Huge losses while Training Spacy Custom Model #12280

Uh oh!

Pravin770 Feb 15, 2023

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

kadarakos Feb 15, 2023

Uh oh!

kadarakos Feb 17, 2023

Uh oh!

Pravin770 Feb 20, 2023 Author

Uh oh!

kadarakos Feb 22, 2023

Uh oh!

Uh oh!

Pravin770 Feb 23, 2023 Author

Uh oh!

svlandeg Feb 23, 2023

Pravin770
Feb 15, 2023

Replies: 1 comment 6 replies

kadarakos
Feb 15, 2023

Pravin770 Feb 20, 2023
Author

Pravin770 Feb 23, 2023
Author