Can I extract and use classifier component from a trained transformer pipeline? #12275

Mayazure · 2023-02-13T16:32:53Z

Mayazure
Feb 13, 2023

I've trained a pipeline which contains a transformer and a textcat_multilabel. The config is attached below.

We know that a pooling method is applied to the transformer output (in my case it is "reduce_mean.v1"). So I assume that the textcat_multilabel will take an input of a tensor with shape (1, 768).

Also, I figured out I can extract the pipeline component by doing nlp.components[1], which takes a Doc as input.

My question is ... is it possible, that I make a tensor with shape (1, 768) and input to the textcat_multilabel component directly to get the classification result? And how can I implement it?

Thank!

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","textcat_multilabel"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"...(customized scorer)"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "...(my pretrained roberta model)"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "psjm7_logger.v1"
log_path = ${paths.log}

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Answered by kadarakos

Feb 14, 2023

Hey there,

Thanks for the question! Let me first clarify one thing: the

pooling = {"@layers":"reduce_mean.v1"}

in the config refers to pooling the word-piece representations of the transformer together to align with the tokens produced by the Tokenizer.

The pooling within the textcat uses the reduce_sum and it happens right here https://github.com/explosion/spaCy/blob/master/spacy/ml/models/textcat.py#L123. But you are right, for each Doc the textcat does pool whole document into a single vector before feeding it forward to the output layer.

You are also correct in that you can retrieve the components from the pipeline. I tend to use nlp.get_pipe("textcat") instead of the nlp.components[i] …

View full answer

kadarakos · 2023-02-14T15:21:39Z

kadarakos
Feb 14, 2023

Hey there,

Thanks for the question! Let me first clarify one thing: the

pooling = {"@layers":"reduce_mean.v1"}

in the config refers to pooling the word-piece representations of the transformer together to align with the tokens produced by the Tokenizer.

The pooling within the textcat uses the reduce_sum and it happens right here https://github.com/explosion/spaCy/blob/master/spacy/ml/models/textcat.py#L123. But you are right, for each Doc the textcat does pool whole document into a single vector before feeding it forward to the output layer.

You are also correct in that you can retrieve the components from the pipeline. I tend to use nlp.get_pipe("textcat") instead of the nlp.components[i] to make sure its easy for me to understand later.

You can also access the underlying thinc model inside the textcat component like:

textcat = nlp.get_pipe("textcat")
textcat_model = textcat.model

However, the "spacy.TextCatEnsemble.v2" that you are using runs self-attention over the transformer output before feeding into the reduce_sum. If you would just run reduce_sum your self on top of the transformer you would bypass a crucial operation inside textcat:

        cnn_model = (
            tok2vec
            >> list2ragged()
            >> attention_layer
            >> reduce_sum()
            >> residual(maxout_layer >> norm_layer >> Dropout(0.0))
        )

I hope I made it a bit more clear.

2 replies

Mayazure Feb 16, 2023
Author

@kadarakos Thanks for correcting me with the reduce_mean pooling. That's very helpful!

Just to confirm that, for each Doc, the transformer's output contains a last_hidden_state which is the embedding matrix for all tokens and a pooled_output which is a single embedding vector for the first token. The TextCatEnsemble will then take the embedding matrix for all tokens and then apply reduce_sum to get a single vector for classification, while the pooled_output is not used. Is that correct?

kadarakos Feb 16, 2023

The "spacy-transformers.TransformerListener.v1" is actually taking care of a couple of things here with the trf2arrays layer: https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py#L42.

That's correct the last_hidden_state is a (batch_size, sequence_length, hidden_size) array, however, the sequence_length here refers to the number of word-pieces produced by the tokenizer. The trf2arrays aligns each word-piece with the tokens produced by the Tokenizer and uses the reduce_mean to pool the wordpieces together to produce a token-vector. This is where these steps happen: https://github.com/explosion/spacy-transformers/blob/8a9c87706d4598418942900895bdcd5793a9954e/spacy_transformers/layers/trfs2arrays.py#L45. So this is where the reduce_mean happens to go from word-pieces to tokens.

The TextCatEnsemble takes the output of the trf2arrays which is of type List[Floats2d] where each Floats2d is a number of tokens by number dimensions matrix. This is the same format produced by the tok2vec component and so textcat handles the transformer or tok2vec outputs the same.

What the textcat ensemble does is to first run a self-attention layer on the transformer/tok2vec output and then it runs the reduce_sum. Here the reduce_sum is used to go from token-vectors to a single vector per document. After this it also runs a residual maxout layer: https://github.com/explosion/spaCy/blob/master/spacy/ml/models/textcat.py#L119

I hope i helped with the double reduce:

From word-piece vectors to tokens vectors.
From tokens vectors to a single document vector.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can I extract and use classifier component from a trained transformer pipeline? #12275

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can I extract and use classifier component from a trained transformer pipeline? #12275

Uh oh!

Mayazure Feb 13, 2023

Replies: 1 comment · 2 replies

Uh oh!

kadarakos Feb 14, 2023

Uh oh!

Mayazure Feb 16, 2023 Author

Uh oh!

kadarakos Feb 16, 2023

Mayazure
Feb 13, 2023

Replies: 1 comment 2 replies

kadarakos
Feb 14, 2023

Mayazure Feb 16, 2023
Author