Unexpected textcat cat scores #12453

python3Berg · 2023-03-21T19:34:48Z

python3Berg
Mar 21, 2023

I've been hitting the team up with tons of questions so let me first say thanks for all of the help and feedback.

Working with spacy 3.4, on a local machine with cpu only, I am using a fairly simple configuration to train a text classifier that is very similar to the example you publish at https://github.com/explosion/projects/tree/v3/pipelines/textcat_demo. I have trained using textcat and textcat_multi (I was told that textcat_multi is architecture more similar to spacy v2.3).

The data universe is a series of sections and sub-sections from legal documents. In prod, I use a recursive process where top level sections are classified and then, depending on the assigned class, lower level sections are then classified recursively. There are about 80 different classifications and the training data has between 20 and 100 examples of each...generally without much deviation in the text.

In spacy 2.3, I got generally excellent results by selecting from the nlp.cats any that were greater than 80% confidence. These models were trained using a manual loop against spacy.blank() instance.

In spacy 3.4, using the ensemble textcat model, even with the confidence set at 99.5%, I am getting a huge amount of excess false positives. These include positives where that is almost nothing similar to the training examples, except for a couple of key words. That said, I am still getting very good results on the true positives with some slippage for those just under the 99.5% hurdle.

I have no intuition what might be happening. Its almost as if the models are returning cats based on relative similarity instead of actual. It is so far from my expectations and past results that I assume I must be doing something wrong. I've trained with different encoding depth, difference attributes and anything else I can think of. I've considered using static vectors but this seems like overkill, especially since v2.3 was giving me such excellent results without them.

Any guidance on how I should evolve my approach is very welcome. Config file attached below. Thanks

`[paths]
train = ""
dev = ""
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.7

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.textcat_multilabel.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 256
rows = [2000,2000,1000,1000,1000,1000]
attrs = ["ORTH","LOWER","PREFIX","SUFFIX","SHAPE","ID"]

include_static_vectors = false

[components.textcat_multilabel.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
window_size = 1
maxout_pieces = 3
depth = 8

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.4
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 100
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 4
stop = 50
compound = 1.05
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_per_type = null
cats_score = 0.0
cats_score_desc = null
cats_micro_p = 0.0
cats_micro_r = 0.0
cats_micro_f = 1.0
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "custom_tokenizer"

[initialize.components]

[initialize.components.textcat_multilabel]

[initialize.tokenizer]`

kadarakos · 2023-03-30T08:41:30Z

kadarakos
Mar 30, 2023

Hey python3Berg,

I'm not exactly sure yet what the issue could be, but the first main thing that stood out to me in the config is the exclusive_classes = true for the spacy.TextCatBOW.v1 architecture. When training for multi-label classification it should be exclusive_classes = false. I would hope that changing the flag would to false would already give you more meaningful results!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unexpected textcat cat scores #12453

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Unexpected textcat cat scores #12453

Uh oh!

python3Berg Mar 21, 2023

Replies: 1 comment

Uh oh!

kadarakos Mar 30, 2023

python3Berg
Mar 21, 2023

kadarakos
Mar 30, 2023