Spancat not being initialized before training #11636

NixBiks · 2022-10-12T16:03:47Z

NixBiks
Oct 12, 2022

I have a config file like this

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["sentencizer", "domain_entity_ruler", "spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.sentencizer]
factory = "sentencizer"
overwrite = false
punct_chars = ["!", ".", "\n", "?", "\r", "\r\n"]
scorer = {"@scorers":"spacy.senter_scorer.v1"}

[components.domain_entity_ruler]
factory = "domain_entity_ruler"
include_labels = ["AMOUNT"]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4

[components.spancat.suggester]
@misc = "sentence_suggester.v1"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["sentencizer"]
annotating_components = ["sentencizer"]
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.0
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
sents_f = null
sents_p = null
sents_r = null
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.components]

[initialize.tokenizer]

[initialize.before_init]
@callbacks = "customize_tokenizer"
tokenizers = ["abbreviations_tokenizer", "financial_tokenizer"]

And then I have a script like this to convert to spacy training data

@app.command()
def convert_for_spancat(
    input_file: str = typer.Option(..., "--input"),
    train_file: str = typer.Option(..., "--train"),
    dev_file: str = typer.Option(..., "--dev"),
    config_file: str = typer.Option(..., "--config"),
    split: float = typer.Option(..., "--eval-split"),
):
    """Prepare data for spancat."""
    random.seed(42)
    input_data = models.FilePayload.parse_file(input_file).__root__

    config = load_config(config_file)
    lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
    nlp = lang_cls.from_config(config, disable=["spancat"])  # disabling since spancat is not initialized
    train = DocBin()
    dev = DocBin()
    n = len(input_data)
    split_index = int(n * split)
    random.shuffle(input_data)
    span_key = config["components"]["spancat"]["spans_key"]

    for index, example in enumerate(input_data):
        full_text = report_to_text(title=example.title, html=example.html)
        doc = nlp(full_text)
        group = SpanGroup(doc, name=span_key, spans=[])
        for span in doc.sents:
            if example.deal_value is not None and example.deal_value in span.text:
                group.append(span)

        if len(group) == 0 and example.deal_value is not None:
            typer.echo(f"Warning: no spans found for {example.title}")
        doc.spans[span_key] = group
        if index < split_index:
            train.add(doc)
        else:
            dev.add(doc)

    train.to_disk(train_file)
    dev.to_disk(dev_file)

But when I run the following command

spacy train configs/sent_classifier.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --gpu-id -1

then I get the following error

ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

I've been searching around for a solution but to me it looks like my training data and dev data is correctly saved, i.e. doc.spans[span_key] = group for all documents in the DocBins

NixBiks · 2022-10-12T16:09:55Z

NixBiks
Oct 12, 2022
Author

Ahaaa - I had to add a label to the spanthat I add to the SpanGroup, i.e. span.label_ = MY_LABEL. But now I get really strange output from training

=================================== train ===================================
Running command: spacy train configs/sent_classifier.cfg --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --gpu-id -1
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-10-12 18:06:48,653] [INFO] Set up nlp object from config
[2022-10-12 18:06:48,659] [INFO] Pipeline: ['sentencizer', 'domain_entity_ruler', 'spancat']
[2022-10-12 18:06:48,825] [INFO] Created vocabulary
[2022-10-12 18:06:48,826] [INFO] Finished initializing nlp object
[2022-10-12 18:06:50,673] [INFO] Initialized pipeline components: ['domain_entity_ruler', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'domain_entity_ruler', 'spancat']
ℹ Frozen components: ['sentencizer']
ℹ Set annotations on update for: ['sentencizer']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0          0.00       29.73       19.30       64.71    0.30
  0     200          0.00       29.73       19.30       64.71    0.30                                                                                
  0     400          0.00       29.73       19.30       64.71    0.30                                                                                
  0     600          0.00       29.73       19.30       64.71    0.30                                                                                
  0     800          0.00       29.73       19.30       64.71    0.30                                                                                
  0    1000          0.00       29.73       19.30       64.71    0.30                                                                                
  1    1200          0.00       29.73       19.30       64.71    0.30                                                                                
  1    1400          0.00       29.73       19.30       64.71    0.30                                                                                
  1    1600          0.00       29.73       19.30       64.71    0.30                                                                                
Epoch 2:   0%|                                                                                                               | 0/200 [00:00<?, ?it/s]✔ Saved pipeline to output directory
training/model-last

FYI my convert script now looks like this

@app.command()
def convert_for_spancat(
    input_file: str = typer.Option(..., "--input"),
    train_file: str = typer.Option(..., "--train"),
    dev_file: str = typer.Option(..., "--dev"),
    config_file: str = typer.Option(..., "--config"),
    split: float = typer.Option(..., "--eval-split"),
):
    """Prepare data for spancat."""
    random.seed(42)
    input_data = models.FilePayload.parse_file(input_file).__root__

    config = load_config(config_file)
    lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
    nlp = lang_cls.from_config(config, disable=["spancat"])  # disabling since spancat is not initialized
    nlp.initialize()
    train = DocBin()
    dev = DocBin()
    n = len(input_data)
    split_index = int(n * split)
    random.shuffle(input_data)
    span_key = config["components"]["spancat"]["spans_key"]

    for index, example in enumerate(input_data):
        full_text = report_to_text(title=example.title, html=example.html)
        doc = nlp(full_text)
        group = SpanGroup(doc, name=span_key, spans=[])
        for span in doc.sents:
            if example.deal_value is not None and example.deal_value in span.text:
                span.label_ = "HAS_DEAL_VALUE"
                group.append(span)

        if len(group) == 0 and example.deal_value is not None:
            typer.echo(f"Warning: no spans found for {example.title}")
        doc.spans[span_key] = group
        if index < split_index:
            train.add(doc)
        else:
            dev.add(doc)

    train.to_disk(train_file)
    dev.to_disk(dev_file)

1 reply

adrianeboyd Oct 20, 2022

Hmm, this kind of training output makes it look like the spancat evaluation is reflecting an earlier rule-based component rather than any labels added by spancat itself, which doesn't seem to be learning or predicting anything.

What does domain_entity_ruler do?

Are you sure that the sentence suggester is working as intended? I think you might get this output because there there are no spans being suggested, so there's nothing for the spancat model to work with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spancat not being initialized before training #11636

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spancat not being initialized before training #11636

Uh oh!

NixBiks Oct 12, 2022

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

NixBiks Oct 12, 2022 Author

Uh oh!

adrianeboyd Oct 20, 2022

NixBiks
Oct 12, 2022

Replies: 1 comment 1 reply

NixBiks
Oct 12, 2022
Author