NER and Textcat with shared tok2vec with only one pipeline updating vectors #12382

python3Berg · 2023-03-07T19:00:51Z

python3Berg
Mar 7, 2023

Using spacy 3.4 to train several domain specific models, each containing an ner and textcat pipeline. After much help from this board, decided to have separate configs for the two pipelines with only ner using tok2vec given greater assumed impact. I want to explore next steps of combining the two pipelines.

The discrete text chunks passed to each is often is related, yet still different. The text most often used for ner is often broken down into sections that are classified further. Given this, I've been advised that a shared tok2vec might lose some accuracy given the different text.

Is it possible for textcat to use the tok2veclistener updated during ner training without making updates? Can I bias the tok2vec to support ner, yet still use it to improve textcat even if textcat will not update vectors?

Is this as simple as making tok2vec a frozen component during textcat training? (I train ner and textcat separately so lots of freezing going on).

Thank you.
NER config

`[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.5
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 200000
eval_frequency = 100
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 2
stop = 10
compound = 1.05
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 0.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "custom_tokenizer"

[initialize.components]

[initialize.tokenizer]`

Textcat Config:
[paths]
train = ""
dev = ""
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["textcat"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 10000
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
; scorer = {"@scorers":"custom_textcat_scorer.v1"}
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 64
rows = [2000,2000,1000,1000,1000,1000]
attrs = ["ORTH","LOWER","PREFIX","SUFFIX","SHAPE","ID"]
include_static_vectors = false

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 64
window_size = 1
maxout_pieces = 3
depth = 2

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.4
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 100
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 4
stop = 50
compound = 1.05
t = 0.0

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
; cats_score = 1.0
; cats_score_desc = null
; cats_micro_p = null
; cats_micro_r = null
; cats_micro_f = null
; cats_macro_p = null
; cats_macro_r = null
; cats_macro_f = null
; cats_macro_auc = null
; cats_f_per_type = null
; cats_macro_auc_per_type = null
cats_score = 0.0
cats_score_desc = null
cats_micro_p = 0.0
cats_micro_r = 0.0
cats_micro_f = 1.0
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "custom_tokenizer"

[initialize.components]

[initialize.components.textcat]

[initialize.tokenizer]

Answered by thomashacker

Mar 9, 2023

Hey, thanks for your question!
Please make sure to correctly format your posts, as it makes it easier for us to read them.

You're correct; by freezing the tok2vec component while training the textcat, the tok2vec will not get updated.
To implement it the way you described it, you need to make two training runs.

Train the tok2vec together with the ner without the textcat
Source the tok2vec and ner from the trained model, put them in frozen_components and train the textcat on top

You can read more about freezing components here and about sourcing components from trained pipelines here.

Hope that helps!

View full answer

thomashacker · 2023-03-09T13:05:02Z

thomashacker
Mar 9, 2023

Hey, thanks for your question!
Please make sure to correctly format your posts, as it makes it easier for us to read them.

You're correct; by freezing the tok2vec component while training the textcat, the tok2vec will not get updated.
To implement it the way you described it, you need to make two training runs.

Train the tok2vec together with the ner without the textcat
Source the tok2vec and ner from the trained model, put them in frozen_components and train the textcat on top

You can read more about freezing components here and about sourcing components from trained pipelines here.

Hope that helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER and Textcat with shared tok2vec with only one pipeline updating vectors #12382

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER and Textcat with shared tok2vec with only one pipeline updating vectors #12382

Uh oh!

python3Berg Mar 7, 2023

Replies: 1 comment

Uh oh!

thomashacker Mar 9, 2023

python3Berg
Mar 7, 2023

thomashacker
Mar 9, 2023