Finetuning transformer into TextCat #9599

kchalkSGS · 2021-11-02T20:05:36Z

kchalkSGS
Nov 2, 2021

I'm having a lot of trouble finetuning/pretraining a tok2vec or transformer layer in a pipeline with a text categorizer. I have a few variations on the pipeline configured and I've encountered different errors in different places. I hope to work through a few issues in this thread. (I'll post the first and edit more in as I finish writing them up or if/when appropriate.)

TypeError: 'FullTransformerBatch' object is not iterable

When pretraining a transformer, I get an error about transformer batches not being iterable. I assume this indicates something wrong with my configuration, but I've seen the error associated with known bugs, so I wonder if it's a spacy/transformers issue. Either way, I have no idea what this error is about. Does anyone see what is wrong below or know where else to look for mistakes?

Traceback

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
...
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/spacy/cli/pretrain.py\", line 70, in pretrain_cli
    pretrain(
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/spacy/training/pretrain.py\", line 40, in pretrain
    model = create_pretraining_model(nlp, P)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/spacy/training/pretrain.py\", line 163, in create_pretraining_model
    model.initialize(X=[nlp.make_doc(\"Give it a doc to infer shapes\")])
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/model.py\", line 299, in initialize
    self.init(self, X=X, Y=Y)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/spacy/ml/models/multi_task.py\", line 169, in mlm_initialize
    wrapped.initialize(X=X, Y=Y)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/model.py\", line 299, in initialize
    self.init(self, X=X, Y=Y)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/layers/chain.py\", line 88, in init
    layer.initialize(X=curr_input)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/model.py\", line 299, in initialize
    self.init(self, X=X, Y=Y)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/layers/chain.py\", line 90, in init
    curr_input = layer.predict(curr_input)
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/model.py\", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File \"/home/.../miniconda3/envs/default/lib/python3.9/site-packages/thinc/layers/list2array.py\", line 22, in forward
    lengths = model.ops.asarray1i([len(x) for x in Xs])
TypeError: 'FullTransformerBatch' object is not iterable

Command and config excerpts

python -m spacy pretrain config_transformer.cfg ./pretrain_transformer

train = Data/ste-train.spacy
dev = Data/ste-test.spacy
vectors = "en_core_web_md"
init_tok2vec = null
verbose = true
raw_text = Data/pretrain.spacy

[system]
gpu_allocator = "pytorch"
g = 0
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","textcat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true


[components.textcat]
factory = "textcat"
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.pretrain]
@readers = "spacy.Corpus.v1"
path = ${paths.raw_text}
gold_preproc = false
max_length = 500
limit = 0
augmenter = null




[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]



[pretraining]
max_epochs = 20
dropout = 0.2
n_save_every = null
component = "transformer"
layer = ""
corpus = corpora.pretrain

[pretraining.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
get_length = null

[pretraining.objective]
@architectures = "spacy.PretrainVectors.v1"
maxout_pieces = 3
hidden_size = 300
loss = "cosine"

[pretraining.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001


[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 5
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

Info about spaCy

spaCy version: 3.1.3
Platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.9.7
Pipelines: en_core_web_trf (3.1.0), en_core_web_md (3.1.0)
Spacy-transformers: 1.0.6

Answered by kchalkSGS

Nov 2, 2021

The answer is that Spacy doesn't support pretraining/finetuning transformers right now, isn't it.

"The impact of spacy pretrain varies, but it will usually be worth trying if you’re not using a transformer model"

https://spacy.io/usage/embeddings-transformers#pretraining

Darn. I suppose I'll re-pretrain my cnn tok2vec component and start a different thread (with an appropriate title) for the errors I had in that vein...

If anyone has words about why this isn't supported -- my domain is highly specific and full of jargon, which I think makes it worth it to finetune a LM even with a transformer, but I would listen to reasons I'm wrong on that -- I would be interested to hear them.

View full answer

kchalkSGS · 2021-11-02T21:12:26Z

kchalkSGS
Nov 2, 2021
Author

The answer is that Spacy doesn't support pretraining/finetuning transformers right now, isn't it.

"The impact of spacy pretrain varies, but it will usually be worth trying if you’re not using a transformer model"

https://spacy.io/usage/embeddings-transformers#pretraining

Darn. I suppose I'll re-pretrain my cnn tok2vec component and start a different thread (with an appropriate title) for the errors I had in that vein...

If anyone has words about why this isn't supported -- my domain is highly specific and full of jargon, which I think makes it worth it to finetune a LM even with a transformer, but I would listen to reasons I'm wrong on that -- I would be interested to hear them.

7 replies

kchalkSGS Nov 4, 2021
Author

A note on finetuning v. pretraining. You're totally right that my words are not specific enough. I shifted to finetuning because of the suggestion that pretraining is less useful if you're using a transformer. I suppose pretraining called to mind an impression of starting from scratch and I wanted to get away from that-- it's been a while since I've been forced into strict definitions in this domain. Finetuning in the linked blog post is almost a default state to my mind-- it's just not freezing those layers when you train the downstream. What I want to do (as you know, I think, but for clarity) is pretrain the LM portion of the model -- the transformer as a tok2vec replacement -- on the language of my domain without introducing the categorization task yet.

Glad to hear that what I'm doing is (probably) meant to be supported. :)

polm Nov 8, 2021

For the record, I confirmed this doesn't work; pretraining only works with tok2vec components right now.

I'll look into whether we want to add support for this, but for the time being I would recommend you try comparing results with CPU models, CPU models with pretraining, and Transformers to see how how much pretraining actually changes performance on your data.

kchalkSGS Nov 16, 2021
Author

Thanks! I was out last week or I would have gotten back to you sooner. As soon as MS fixes my VM I'll start training on the downstream tasks and let you know how the two options compare in terms of performance.

liuchengwang Jan 20, 2022

For the record, I confirmed this doesn't work; pretraining only works with tok2vec components right now.

I'll look into whether we want to add support for this, but for the time being I would recommend you try comparing results with CPU models, CPU models with pretraining, and Transformers to see how how much pretraining actually changes performance on your data.

Hi @polm,

I find your comment by searching through all spacy discussions. Could you please confirm if my understanding is correct?

I plan to build my own spacy pipeline = ["transformer","textcat"]. I initially plan to load the pretrained weights of "roberta-base" and fine tune these weights by applying a new header textcat on my training data. Based on what you said, it seems that I won't be able to load these pretrained weights in spacy.
I am wondering how the transformer weights of 'en_core_web_trf' were generated. Were they initialized randomly and further trained by spacy corpus?

Thank you so much. I am very new on spacy.

Best
Liucheng

polm Jan 23, 2022

Based on what you said, it seems that I won't be able to load these pretrained weights in spacy.

No, that is not the case. The normal way to use Transformers in spaCy is to load pretrained weights and fine-tune them. If you make a GPU config using the quickstart you'll see a reference to roberta-base there. "pretraining" in this thread refers to a specific spaCy feature which involves fine-tuning on unlabeled text, see here.

I am wondering how the transformer weights of 'en_core_web_trf' were generated.

It's initialized with roberta-base and trained on OntoNotes. You can find the details of any pipeline on the models page, for the English Transformer it's here. (Sometimes we forgot to include something, if you don't find info feel free to open a new thread.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Finetuning transformer into TextCat #9599

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Finetuning transformer into TextCat #9599

Uh oh!

Uh oh!

kchalkSGS Nov 2, 2021

TypeError: 'FullTransformerBatch' object is not iterable

Traceback

Command and config excerpts

Info about spaCy

Replies: 1 comment · 7 replies

Uh oh!

kchalkSGS Nov 2, 2021 Author

Uh oh!

kchalkSGS Nov 4, 2021 Author

Uh oh!

polm Nov 8, 2021

Uh oh!

kchalkSGS Nov 16, 2021 Author

Uh oh!

liuchengwang Jan 20, 2022

Uh oh!

polm Jan 23, 2022

kchalkSGS
Nov 2, 2021

Replies: 1 comment 7 replies

kchalkSGS
Nov 2, 2021
Author

kchalkSGS Nov 4, 2021
Author

kchalkSGS Nov 16, 2021
Author