Problem statement
Hi,
I have been using Flair library for 3 years now and have been training multiple different models with my custom datasets (RelationClassifier, RelationExtractor, SequenceTagger, TextClassifier, MultitaskModel). This year I performed layerwise probing on all models (following the Early exitting research at https://arxiv.org/abs/2004.12993 and https://aclanthology.org/2021.eacl-main.8/) and realised that most of the models do not require all 12 encoder layers (or 28 for ModernBERT) to achieve the same performance scores as the fully trained model with 12 layers. So as an optimisation technique I have been retraining all models with embeddings (TransformerDocumentEmbeddings or TransformerWordEmbeddings) parameter "layers" set to some value like -4 or -6 (to get the embeddings from one of the lower layers instead of the last layer like with the default value of -1).
Nevertheless, it seems that Flair embeddings objects still compute all layers during inference, even if the "layers" parameter is set to a lower layer. So my question is, is this on purpose and, if yes, then why? I don't see any reason why the layers that are not used (or even trained) should stay in the transformer and waste computation so this seems like a bug to me.
To deal with this, after training the model, I had to cut out unused layers from the checkpoint state_dict and update embeddings config.
Solution
If this indeed is a bug, as a solution to it, I propose removing of these unused layers inside the embeddings object after using the "layers" parameter and defining from which encoder layer are embeddings taken out. If this is intended to be like it is currently implemented, could you please explain why? Thanks!
Additional Context
No response
Problem statement
Hi,
I have been using Flair library for 3 years now and have been training multiple different models with my custom datasets (RelationClassifier, RelationExtractor, SequenceTagger, TextClassifier, MultitaskModel). This year I performed layerwise probing on all models (following the Early exitting research at https://arxiv.org/abs/2004.12993 and https://aclanthology.org/2021.eacl-main.8/) and realised that most of the models do not require all 12 encoder layers (or 28 for ModernBERT) to achieve the same performance scores as the fully trained model with 12 layers. So as an optimisation technique I have been retraining all models with embeddings (TransformerDocumentEmbeddings or TransformerWordEmbeddings) parameter "layers" set to some value like -4 or -6 (to get the embeddings from one of the lower layers instead of the last layer like with the default value of -1).
Nevertheless, it seems that Flair embeddings objects still compute all layers during inference, even if the "layers" parameter is set to a lower layer. So my question is, is this on purpose and, if yes, then why? I don't see any reason why the layers that are not used (or even trained) should stay in the transformer and waste computation so this seems like a bug to me.
To deal with this, after training the model, I had to cut out unused layers from the checkpoint state_dict and update embeddings config.
Solution
If this indeed is a bug, as a solution to it, I propose removing of these unused layers inside the embeddings object after using the "layers" parameter and defining from which encoder layer are embeddings taken out. If this is intended to be like it is currently implemented, could you please explain why? Thanks!
Additional Context
No response