Training the textcat component, what else to include in the training config. #11394

kudryk · 2022-08-29T18:58:44Z

kudryk
Aug 29, 2022

I'm training the textcat component to predict a label for our text. I've just finished reviewing my work with a colleague, and he has suggested that I've too narrowly defined the training configuration file, to the detriment of the textcat results. I'd like to confirm this here please.

Our pipeline involves the following components in this order:

text_preprocessor
tok2vec
affix_splitter
tagger
parser
senter
attribute_rule
lemmatizer
textcat

A few weeks ago I had posted a question about frozen_components, and from digging through the code to understand how that and annotating_components was used, I came away with the impression that when defining a training pipeline, the components listed in nlp.pipeline should really only be the components one intends on training. I think this may have been a wrong conclusion on my part.

So... when I prepare my corpus for training textcat(a step separate from training), I actually run my sample texts through the aforementioned pipeline except for textcat, and save the resulting doc's into my training and dev corpuses. I've configured the persisted DocBin instance to save the desired attributes. These corpuses are then provided to spacy train.

My configuration to train textcat is as follows:

[paths]
train = ""
dev = ""
kb = ""
raw = null
init_tok2vec = null
vectors = ${vars.base_model}

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"oi.Tokenizer.v1"}
before_creation = null
after_creation = null
after_pipeline_creation = null
disabled = []
batch_size = 1000

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 100
max_steps = 20000
eval_frequency = 400
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

[vars]
base_model = "en_enverus_oi_ner"

As I'm training only textcat, I've limited the pipeline to textcat. I also didn't include components.textcat.model.tok2vec as the text has gone through the tok2vec component, but I'm now realizing why this is needed (to tell textcat which attributes to pay attention to).

My colleague has said that the rest of the pipeline is needed, as when spaCy trains textcat, it doesn't do so in isolation of the rest of the pipeline, but rather spaCy needs to process the raw text (i.e. doc.text) through all of the upstream components to arrive at my suggested answers I've provided in the training corpus. Therefore, I should define nlp.pipe to be the entire list at the top of this question, set frozen_components to all upstream instances of TrainablePipe except for textcat, and annotating_components to those components producing attributes textcat needs for training. Is this correct?

With the configuration above, I've successfully completed a training run and my trained textcat component is able to make correct predictions (most of the time not). So if my current configuration is incorrect or incomplete, how is it still able to still be able to make correct predictions?

Last, what attributes can textcat use for training? Can it use predicted entities?

Further to the above, I've since updated my training configuration file to include the full pipeline. I initially started out with the weights section defined as follows:

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

After running my config through init fill-config, spaCy expanded this section to

[training.score_weights]
tag_acc = 0.2
dep_uas = 0.1
dep_las = 0.1
dep_las_per_type = null
sents_p = 0.0
sents_r = 0.0
sents_f = 0.2
lemma_acc = 0.2
cats_score = 0.2
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

As I'm only interested in training the textcat component, would I still not want cats_score = 1.0 and all other settings equal to null?

Answered by polm

Aug 30, 2022

I came away with the impression that when defining a training pipeline, the components listed in nlp.pipeline should really only be the components one intends on training. I think this may have been a wrong conclusion on my part.

It is basically correct that the components in nlp.pipeline should only be the ones you are interested in training. However, there is a wrinkle to this.

When you train a statistical model, it needs a source of features. In spaCy pipelines that's going to be a tok2vec or Transformer (one exception, see next line). When you train a model, it's usually better to train the feature source with it at the same time. So, in your case, it would make sense to train a tok…

View full answer

polm · 2022-08-30T03:43:40Z

polm
Aug 30, 2022

I came away with the impression that when defining a training pipeline, the components listed in nlp.pipeline should really only be the components one intends on training. I think this may have been a wrong conclusion on my part.

It is basically correct that the components in nlp.pipeline should only be the ones you are interested in training. However, there is a wrinkle to this.

When you train a statistical model, it needs a source of features. In spaCy pipelines that's going to be a tok2vec or Transformer (one exception, see next line). When you train a model, it's usually better to train the feature source with it at the same time. So, in your case, it would make sense to train a tok2vec and textcat at the same time. Also you absolutely do not need to run your training data through the whole pipeline before preparing your DocBins, since the annotations you're applying will basically be ignored.

There is one exception to statistical pipelines needing a tok2vec: textcat can run with just a bag of words architecture, in which case you don't need a tok2vec. That architecture is very fast but typically has relatively low accuracy.

This does have a side effect that because the tok2vec has changed, any components you weren't training at the same time no longer work with it - it is speaking a different language now. One way to work around that is to train everything together, but it's easier, and often you get the same performance, by just including multiple tok2vecs in a pipeline. See the docs on sharing embedding layers for more information about that.

Also, when in doubt, I strongly recommend you try using the default settings from the training quickstart. The defaults are pretty good!

Last, what attributes can textcat use for training? Can it use predicted entities?

If you're using textcat with a tok2vec then you can customize the attributes the tok2vec uses, see here. You can use any of the attributes in spacy.attrs, though many of them aren't relevant or won't be set when the tok2vec runs.

Entity-related attributes are some of those that are present but won't be set with the tok2vec runs normally. Also there have been questions about using NER features in textcat before, see #10470, but basically it probably won't be very effective.

3 replies

kudryk Aug 30, 2022
Author

Thank you @polm for this explanation, it was very helpful. Just one more question please.

When I select the efficient configuration from the training Quickstart for textcat, its model is set to BoW.

[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

Yet when I switched to the accurate configuration, the tok2vec component is added, a tok2vec listener is introduced, but the BoW model still exists.

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

After reading your comments above I assumed it was one model or the other (i.e. tok2vec xor BoW). I had set up my textcat to have it's own tok2vec (independent configuration) without a BoW and this successfully validated (python -m spacy debug config ...), and then I noticed that Quickstart in its accurate configuration example also included BoW.

Why does the accurate configuration need 2 models?

polm Aug 31, 2022

The "accuracy" configuration uses an "ensemble" classifier, which means it uses the predictions of multiple models. One of those models is the bag of words model.

kudryk Aug 31, 2022
Author

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training the textcat component, what else to include in the training config. #11394

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training the textcat component, what else to include in the training config. #11394

Uh oh!

Uh oh!

kudryk Aug 29, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Aug 30, 2022

Uh oh!

kudryk Aug 30, 2022 Author

Uh oh!

polm Aug 31, 2022

Uh oh!

kudryk Aug 31, 2022 Author

kudryk
Aug 29, 2022

Replies: 1 comment 3 replies

polm
Aug 30, 2022

kudryk Aug 30, 2022
Author

kudryk Aug 31, 2022
Author