Unexpected textcat cat scores #12453
Unanswered
python3Berg
asked this question in
Help: Model Advice
Replies: 1 comment
-
Hey python3Berg, I'm not exactly sure yet what the issue could be, but the first main thing that stood out to me in the config is the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been hitting the team up with tons of questions so let me first say thanks for all of the help and feedback.
Working with spacy 3.4, on a local machine with cpu only, I am using a fairly simple configuration to train a text classifier that is very similar to the example you publish at https://github.com/explosion/projects/tree/v3/pipelines/textcat_demo. I have trained using textcat and textcat_multi (I was told that textcat_multi is architecture more similar to spacy v2.3).
The data universe is a series of sections and sub-sections from legal documents. In prod, I use a recursive process where top level sections are classified and then, depending on the assigned class, lower level sections are then classified recursively. There are about 80 different classifications and the training data has between 20 and 100 examples of each...generally without much deviation in the text.
In spacy 2.3, I got generally excellent results by selecting from the nlp.cats any that were greater than 80% confidence. These models were trained using a manual loop against spacy.blank() instance.
In spacy 3.4, using the ensemble textcat model, even with the confidence set at 99.5%, I am getting a huge amount of excess false positives. These include positives where that is almost nothing similar to the training examples, except for a couple of key words. That said, I am still getting very good results on the true positives with some slippage for those just under the 99.5% hurdle.
I have no intuition what might be happening. Its almost as if the models are returning cats based on relative similarity instead of actual. It is so far from my expectations and past results that I assume I must be doing something wrong. I've trained with different encoding depth, difference attributes and anything else I can think of. I've considered using static vectors but this seems like overkill, especially since v2.3 was giving me such excellent results without them.
Any guidance on how I should evolve my approach is very welcome. Config file attached below. Thanks
`[paths]
train = ""
dev = ""
vectors = null
init_tok2vec = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.7
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[components.textcat_multilabel.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 256
rows = [2000,2000,1000,1000,1000,1000]
attrs = ["ORTH","LOWER","PREFIX","SUFFIX","SHAPE","ID"]
include_static_vectors = false
[components.textcat_multilabel.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
window_size = 1
maxout_pieces = 3
depth = 8
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.4
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 100
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = []
[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 4
stop = 50
compound = 1.05
t = 0.0
[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_per_type = null
cats_score = 0.0
cats_score_desc = null
cats_micro_p = 0.0
cats_micro_r = 0.0
cats_micro_f = 1.0
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null
[initialize.before_init]
@callbacks = "custom_tokenizer"
[initialize.components]
[initialize.components.textcat_multilabel]
[initialize.tokenizer]`
Beta Was this translation helpful? Give feedback.
All reactions