Can't train SpanCat #12519

delucca · 2023-04-11T14:15:58Z

delucca
Apr 11, 2023

Hey everyone!

I'm trying to train a SpanCat but not being able to do so. I've properly created the .spacy file with the data but when I start the training I get this error:

ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

Here is a summary of my setup:

I've a list of spans in the following format:

train_data = [
  [
    'SAMPLE TEXT 1',
    [80, 90, 'LABEL']
  ],
  [
    'SAMPLE TEXT 2',
    [80, 85, 'LABEL 2'],
  ]
  ...
]

I've created the following function to parse those into the .spacy file:

nlp = spacy.blank('en')
docs = []


for text, spans in train_data:
    doc = nlp.make_doc(text)
    spans = []

    for start, end, label in spans:
        span = doc.char_span(start, end, label=label, alignment_mode="strict")
        spans.append(span)

    doc.spans["sc"] = spacy.tokens.SpanGroup(doc, name="sc", spans=spans)
    docs.append(doc)

doc_bin = spacy.tokens.DocBin(docs=docs)
doc_bin.to_disk('./data/spacy/train.spacy')

I'm executing the train with the following command on my Jupyter notebook:

!python -m spacy train ./data/spacy/config.cfg --output ./data/spacy/training --paths.train ./data/spacy/train.spacy --paths.dev ../data/spacy/test.spacy --gpu-id 0

This is my config file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]


[components.transformer]
factory = "transformer"

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 10
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "RAdam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.0001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Can anyone give me a hand? I'm not sure what I should do.

Answered by delucca

Apr 11, 2023

Just figured out the issue, for some reason the function I was using to create the training data was not generating in the format I was expecting

View full answer

delucca · 2023-04-11T16:14:58Z

delucca
Apr 11, 2023
Author

Just figured out the issue, for some reason the function I was using to create the training data was not generating in the format I was expecting

2 replies

wendysphillips Sep 24, 2023

Could you please explain more about the function not generating the format you were expecting? What did you need to fix in the above code to have it generate the format you wanted? Thanks!

delucca Sep 25, 2023
Author

I don't remember exactly (since it's been a few months), but I remember that the format I was expecting was not being correctly generated by the function. In the end I gave up using SpanCat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can't train SpanCat #12519

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can't train SpanCat #12519

Uh oh!

Uh oh!

delucca Apr 11, 2023

Replies: 1 comment · 2 replies

Uh oh!

delucca Apr 11, 2023 Author

Uh oh!

wendysphillips Sep 24, 2023

Uh oh!

delucca Sep 25, 2023 Author

delucca
Apr 11, 2023

Replies: 1 comment 2 replies

delucca
Apr 11, 2023
Author

delucca Sep 25, 2023
Author