Transformer stuck at zero loss #12211

ml-mar · 2023-01-31T15:22:48Z

ml-mar
Jan 31, 2023

Hello,

I am working on a NER project and I am trying to create a pipeline with both transformer and NER in it. So far its a pretty simple code however I see that while training the loss of the transformer is 0 and I have no movement there while I can clearly see that the NER is "learning".

import spacy
from spacy.training import Example
import os
from thinc.api import Config
import random

TRAIN_DATA = [['Treblinka is a small village in Poland.', {'entities': [[0, 9, 'GPE']]}], ['Wikipedia notes that Treblinka is not large.', {'entities': [[21, 30, 'GPE']]}]]

spacy.require_gpu()

confing_trans_ner = {
    "model": {
    "@architectures": "spacy-transformers.TransformerModel.v3",
    "name": "roberta-base", 
    "tokenizer_config": {
        "use_fast": True},
    'get_spans': {
        '@span_getters': 'spacy-transformers.strided_spans.v1', 
        'window': 168, 
        'stride': 128}
        }
    }

model_level_1 = spacy.blank('en')

model_level_1.add_pipe("transformer", config=confing_trans_ner)
model_level_1.add_pipe("ner", last=True)

ner = model_level_1.get_pipe("ner")

ner.add_label("GPE")

optimiser = model_level_1.initialize()
other_pipes = [pipe for pipe in model_level_1.pipe_names if pipe not in model_level_1.pipe_names]
with model_level_1.disable_pipes(*other_pipes):
    for itn in range(10):
        print("Starting iteration " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            doc = model_level_1.make_doc(text)
            example = Example.from_dict(doc, annotations)
            model_level_1.update(
                [example],
                drop=0.20,
                sgd=optimiser,
                losses=losses,
            )
        print(losses)


model_level_1.to_disk("/model_test")

And the output as follows:

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Starting iteration 0
{'transformer': 0.0, 'ner': 14.19078516960144}
Starting iteration 1
{'transformer': 0.0, 'ner': 13.090196907520294}
Starting iteration 2
{'transformer': 0.0, 'ner': 11.6663698554039}
Starting iteration 3
{'transformer': 0.0, 'ner': 9.03413525223732}
Starting iteration 4
{'transformer': 0.0, 'ner': 5.985114209353924}
Starting iteration 5
{'transformer': 0.0, 'ner': 3.420690793544054}
Starting iteration 6
{'transformer': 0.0, 'ner': 2.2395819851662964}
Starting iteration 7
{'transformer': 0.0, 'ner': 2.6794511335647258}
Starting iteration 8
{'transformer': 0.0, 'ner': 1.4924573140669963}
Starting iteration 9
{'transformer': 0.0, 'ner': 0.9220394611512575}

I have dabbled around the forums and other resources but unfortunately I can't find why the NER is not listening to the transformer or vice versa.

Any help would be greatly appreciated.

P.S. Amazing platform and product, keep the work going!

P.S 2 I appreciate I am using a training loop as that's not suggested in spacy v3 but my circumstances enforce it.

Answered by polm

Feb 1, 2023

When you add an NER component to your pipeline without a config, it uses the default config. The default config uses an embedded CNN tok2vec rather than a listener, so it has no way to interact with the Transformer.

I would recommend you use the training widget to generate a config with a transformer and examine it for an example of working settings. Even if you must use a training loop, you can copy the config settings into your code.

Can you elaborate on why you can't use a config file?

View full answer

polm · 2023-02-01T04:13:36Z

polm
Feb 1, 2023

When you add an NER component to your pipeline without a config, it uses the default config. The default config uses an embedded CNN tok2vec rather than a listener, so it has no way to interact with the Transformer.

I would recommend you use the training widget to generate a config with a transformer and examine it for an example of working settings. Even if you must use a training loop, you can copy the config settings into your code.

Can you elaborate on why you can't use a config file?

2 replies

ml-mar Feb 2, 2023
Author

Right, thank you for your first point.

I've changed the training data to be a bit bigger so difference in the results can be clearer.

I've generated a config working with a transformer based on accuracy. This is how the code looks now.

TRAIN_DATA = [('what is the price of polo?', {'entities': [(21, 25, 'PrdName')]}), 
              ('what is the price of ball?', {'entities': [(21, 25, 'PrdName')]}), 
              ('what is the price of jegging?', {'entities': [(21, 28, 'PrdName')]}), 
              ('what is the price of t-shirt?', {'entities': [(21, 28, 'PrdName')]}), 
              ('what is the price of jeans?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of bat?', {'entities': [(21, 24, 'PrdName')]}), 
              ('what is the price of shirt?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of bag?', {'entities': [(21, 24, 'PrdName')]}), 
              ('what is the price of cup?', {'entities': [(21, 24, 'PrdName')]}), 
              ('what is the price of jug?', {'entities': [(21, 24, 'PrdName')]}), 
              ('what is the price of plate?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of glass?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of moniter?', {'entities': [(21, 28, 'PrdName')]}), 
              ('what is the price of desktop?', {'entities': [(21, 28, 'PrdName')]}), 
              ('what is the price of bottle?', {'entities': [(21, 27, 'PrdName')]}), 
              ('what is the price of mouse?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of keyboad?', {'entities': [(21, 28, 'PrdName')]}), 
              ('what is the price of chair?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of table?', {'entities': [(21, 26, 'PrdName')]}), 
              ('what is the price of watch?', {'entities': [(21, 26, 'PrdName')]})]

spacy.require_gpu()

confing_trans_ner = {
    "model": {
    "@architectures": "spacy-transformers.TransformerModel.v3",
    "name": "roberta-base", 
    "tokenizer_config": {
        "use_fast": True},
    'get_spans': {
        '@span_getters': 'spacy-transformers.strided_spans.v1', 
        'window': 168, 
        'stride': 128}
        }
    }

config_ner = {
    "model": {
        "@architectures": "spacy.TransitionBasedParser.v2", "state_type": "ner", "extra_state_tokens": False, "hidden_width": 64, "maxout_pieces": 2, "use_upper": False, "nO": None, "tok2vec": {
            "@architectures": "spacy-transformers.TransformerListener.v1", "grad_factor": 1.0, "pooling": {
                "@layers": "reduce_mean.v1"}}}}

config_nlp = {
    "optimizer": {
        "@optimizers": "Adam.v1", "learn_rate": {
            "@schedules": "warmup_linear.v1", "warmup_steps": 250, "total_steps": 20000, "initial_rate": 5e-05}}}

model_level_1 = spacy.blank('en', config=config_nlp)

model_level_1.add_pipe("transformer", config=confing_trans_ner)
model_level_1.add_pipe("ner",config=config_ner, last=True)

ner = model_level_1.get_pipe("ner")

ner.add_label("PrdName")

optimiser = model_level_1.initialize()
other_pipes = [pipe for pipe in model_level_1.pipe_names if pipe not in model_level_1.pipe_names]
with model_level_1.disable_pipes(*other_pipes):
    for itn in range(10):
        print("Starting iteration " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            doc = model_level_1.make_doc(text)
            example = Example.from_dict(doc, annotations)
            model_level_1.update(
                [example],
                drop=0.20,
                sgd=optimiser,
                losses=losses,
            )
        print(losses)

Which corresponds to the config.cfg I've pulled:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

However I am getting worse results and I believe its from my set-up rather than using the transformer.

With transformer
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Starting iteration 0
{'transformer': 37.41446214914322, 'ner': 55.05508034429626}
Starting iteration 1
{'transformer': 7.877755716443062, 'ner': 39.67424229011522}
Starting iteration 2
{'transformer': 4.802493937313557, 'ner': 37.50349069136428}
Starting iteration 3
{'transformer': 3.723658137023449, 'ner': 36.259598061675206}
Starting iteration 4
{'transformer': 3.1512852162122726, 'ner': 35.46507719915826}
Starting iteration 5
{'transformer': 3.308831699192524, 'ner': 35.50586670427583}
Starting iteration 6
{'transformer': 3.2977167814970016, 'ner': 36.121610666275956}
Starting iteration 7
{'transformer': 3.0027999877929688, 'ner': 35.474430467002094}
Starting iteration 8
{'transformer': 3.2250120490789413, 'ner': 35.36269953567535}
Starting iteration 9
{'transformer': 3.454425849020481, 'ner': 35.34203212847933}

Without transformer:
Starting iteration 0
{'transformer': 0.0, 'ner': 65.33385383599489}
Starting iteration 1
{'transformer': 0.0, 'ner': 3.914825475495359}
Starting iteration 2
{'transformer': 0.0, 'ner': 2.166222182189714}
Starting iteration 3
{'transformer': 0.0, 'ner': 1.9760654984069992}
Starting iteration 4
{'transformer': 0.0, 'ner': 3.4082421959021807}
Starting iteration 5
{'transformer': 0.0, 'ner': 1.6472487591776965}
Starting iteration 6
{'transformer': 0.0, 'ner': 6.601963484640837}
Starting iteration 7
{'transformer': 0.0, 'ner': 4.746990928857978}
Starting iteration 8
{'transformer': 0.0, 'ner': 1.919605663993066}
Starting iteration 9
{'transformer': 0.0, 'ner': 1.9735621719705847}

I haven't ran an evaluation but just from testing some of the predictions, the model with the transformer performs worse.

So my question is, how to properly incorporate the full config inside the training loop.

I understand that using training loop is old technology which you do not encourage but the requirements set to me enforce me to use a such a loop. We are going to be working in the future on incorporating using cli and cli.train.

Appreciate your help so far, also I understand that asking for advices for old technologies is something you like/want to do.

adrianeboyd Feb 6, 2023

So this illustrates exactly why we suggest using spacy init config and spacy train instead. If just using the CLI is the issue, then you can call the train_cli or train or train_while_improving or other higher-level methods from python directly if you prefer. In addition, you can write custom corpus loaders if you want to load your data from a source other than a .spacy file.

But the basic issue is that there are a lot of details that you need to get right for transformer training to work well, and we've put a lot of effort into making it easier with the CLI tools.

In this example, your optimizer settings are not in the right place in the config (should be under [training]) and you need to step the optimizer schedule in the training loop in addition for this optimizer to work correctly.

You can compare your configs with nlp.config.to_disk() and the spacy init config output.

I doubt that updating the transformer model with individual sentences rather than larger batches is going to work as well in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Transformer stuck at zero loss #12211

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Transformer stuck at zero loss #12211

Uh oh!

Uh oh!

ml-mar Jan 31, 2023

Replies: 1 comment · 2 replies

Uh oh!

polm Feb 1, 2023

Uh oh!

Uh oh!

ml-mar Feb 2, 2023 Author

Uh oh!

adrianeboyd Feb 6, 2023

ml-mar
Jan 31, 2023

Replies: 1 comment 2 replies

polm
Feb 1, 2023

ml-mar Feb 2, 2023
Author