How to train the trainable lemmatizer #10806

spatiebalk · 2022-05-16T12:49:26Z

spatiebalk
May 16, 2022

Hi,

I'd like to train the trainable lemmatizer from spacy 3.3. I have added the correct lemma labels to the dataset by assigning the lemma value for all tokens like this:

token.lemma_ = correct_lemma

I have not initialized the labels of the lemmatizer yet and I wanted to generate the lemmatizer labels json file like described here using init labels but that results in this error, which kind of feels like a loop:

ValueError: [E143] Labels for component 'trainable_lemmatizer' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's 'initialize' method.

And I haven't been able to find out how to provide a representative batch of examples either.

Here is the config file I'm using:
config_trainable_lemmatizer.txt

So the question is, what am I doing wrong and how can I initialize the trainable lemmatizer component?

Answered by adrianeboyd

May 16, 2022

Be aware that spacy.read_labels.v1 fails silently unless you add require = true, so it could be something as simple as an incorrect path.

Or you can skip this and just provide your training corpus with spacy train, though, and that will also initialize the labels from train.spacy.

If you want to use nlp.initialize, then it looks like this:

from spacy.training import Corpus
nlp = spacy.blank("en")
examples = list(Corpus("/path/to/train.spacy")(nlp))
nlp.initialize(lambda: examples)

It is on our to-do list to improve all the Component.initialize docs because they don't really show how to do it properly.

View full answer

adrianeboyd · 2022-05-16T15:08:13Z

adrianeboyd
May 16, 2022

Be aware that spacy.read_labels.v1 fails silently unless you add require = true, so it could be something as simple as an incorrect path.

Or you can skip this and just provide your training corpus with spacy train, though, and that will also initialize the labels from train.spacy.

If you want to use nlp.initialize, then it looks like this:

from spacy.training import Corpus
nlp = spacy.blank("en")
examples = list(Corpus("/path/to/train.spacy")(nlp))
nlp.initialize(lambda: examples)

It is on our to-do list to improve all the Component.initialize docs because they don't really show how to do it properly.

6 replies

adrianeboyd May 17, 2022

You don't have use init labels or initialize labels in advance, this feature is only there to save time if you're training repeatedly and initializing the labels is a slow step in the process.

My general advice would be to remove the read_labels part for the edit tree lemmatizer from your config (or start with a new config from spacy init config -p trainable_lemmatizer) and to use spacy train. This should just work.

We don't really recommend using your own code loop like above to train the model. spacy train manages a lot of the fiddly details for you and should be an easier place to start.

It could be that something is going wrong with how the lemmas are saved in your .spacy file and that's why the lemmatizer isn't being initialized? It might be helpful to test just with your .spacy files and the trainable lemmatizer on its own before trying to train a larger pipeline.

In the future, please format code examples using fenced code blocks (three backticks on a separate line before and after the code), which makes it easier to read and to copy/paste for testing.

spatiebalk May 17, 2022
Author

Thank you! It worked :)

jd12006 May 19, 2022

A comment that may help others coming across this issue: I was also getting this error when running !python -m spacy debug data. I was testing a config created using spacy init config -p trainable_lemmatizer so the config was fine. As suggested by @adrianeboyd above, the issue was my training data. I had used nlp = spacy.blank("en") to tokenise when creating my .spacy files. When I recreated them using nlp = spacy.load("en-core-web-lg") the spacy debug data ran without error. I guess the trainable_lemmatizer pulls labels from the tokens... Others switching from using the rule-based lemmatizer, which (I think) takes labels from the tagger & attribute_ruler components, might want to know this!

Norky101 Jul 18, 2023

From reading @adrianeboyd reply and applying the method of making a fresh config file, then running train command gave me this:

Command:

 python -m spacy train 'config.cfg' --output 'model\' --code 'sentence.py' --code 'matcher.py'

('sentence.py' and 'matcher.py' are custom components that have worked on a different recent config file.)

error:

ValueError: [E143] Labels for component 'trainable_lemmatizer' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

From my reading of spacy, label data should be collected once train is ran, hence why it is optional whether to specify certain labels before training.

config.cfg file:

[paths]
train = "output.spacy"
dev = "output.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner","parser","trainable_lemmatizer"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.trainable_lemmatizer]
factory = "trainable_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
top_k = 1

[components.trainable_lemmatizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.trainable_lemmatizer.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 0.33
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
dep_uas = 0.17
dep_las = 0.17
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
lemma_acc = 0.33

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Using:
Python 3.11
Spacy 3.6
Models 3.6
Pycharm

rmitsch Jul 19, 2023
Maintainer

@Norky101 If you still have this problem, please open a new issue. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to train the trainable lemmatizer #10806

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to train the trainable lemmatizer #10806

Uh oh!

spatiebalk May 16, 2022

Replies: 1 comment · 6 replies

Uh oh!

adrianeboyd May 16, 2022

Uh oh!

adrianeboyd May 17, 2022

Uh oh!

spatiebalk May 17, 2022 Author

Uh oh!

jd12006 May 19, 2022

Uh oh!

Uh oh!

Norky101 Jul 18, 2023

Uh oh!

Uh oh!

rmitsch Jul 19, 2023 Maintainer

spatiebalk
May 16, 2022

Replies: 1 comment 6 replies

adrianeboyd
May 16, 2022

spatiebalk May 17, 2022
Author

rmitsch Jul 19, 2023
Maintainer