Transformer / Train on resources with different annotations #6377

maxtrem · 2020-11-11T16:06:56Z

maxtrem
Nov 11, 2020

Hi, I'm trying to to train a Swedish transformer model using spacy-nightly ('3.0.0rc2'), using a configuration file.

There are two resources I want to use:
A UD treebank that includes parser, tagger and morphologizer annotations for Swedish and another corpus (SUC3.0) that has an incompatible tag set to UD and only auto-generated parser annotations, but features in turn ner annotation.

In short:

I have data with parser, tagger and morphologizer annotations
And I also data that features only ner annotation

I would like to combine these two resources and train the full pipeline of parser, tagger, morphologizer and ner with a transformer model.

I tried to just merge the DocBin instances containing the respective annotations, hoping the empty annotations would be omitted by its respective component.
However that did not seem to be the case. At the end all predicted POS tags where empty, I assume the model learned the empty annotations from the SUC part.

How can I combine these different resources in one model/package?

svlandeg · 2020-11-12T09:36:57Z

svlandeg
Nov 12, 2020

With spaCy 3, it should be easier to source components from different models that you trained independently, cf the docs at https://nightly.spacy.io/usage/processing-pipelines#sourced-components. Would that be an option for you? The only disadvantage to that approach, is that you won't be able to share the same transformer across all your components.

There's also another tricky part: if you have a transformer component in both models, and you're using a TransformerListener, it will be best to give each Transformer component a new, unique name. Then when you add everything together, your new pipeline should have two transformers with two different names, and the listeners should record which one they're listening too. See also #6366 for a related discussion and some more explanation.

Hope that works for you!

0 replies

maxtrem · 2020-11-12T10:16:04Z

maxtrem
Nov 12, 2020
Author

Hi Sofie and thanks for your answer!

I have thought about this possibility, but I discarded it for the reason that two transformer models would be necessary, since the transformer weights appear to be modified during training.
While this solution might lead to a slightly higher accuracy (as each trf is tuned to fewer different tasks) it also doubles the most expensive part of the pipeline which effectively would double the processing time. Therefore I would prefer a solution with only one trf model, if that is possible.

Would it be possible specify different training data for the different components of the pipeline while still training them jointly with one trf model? (this would be my prefered solution :)

If that is not possible:
Would it be possible to freeze the transformer weights during training? In that case sourcing the ner component should work with just one trf model in the pipeline?
I guess there might be a performance penalty, but as far as I know, many tasks work fine with frozen trf weights too.

0 replies

svlandeg · 2020-11-12T10:50:27Z

svlandeg
Nov 12, 2020

While this solution might lead to a slightly higher accuracy (as each trf is tuned to fewer different tasks) it also doubles the most expensive part of the pipeline which effectively would double the processing time. Therefore I would prefer a solution with only one trf model, if that is possible.

Ok, fair enough ;-)

Would it be possible specify different training data for the different components of the pipeline while still training them jointly with one trf model? (this would be my prefered solution :)

Hm, no, that won't be trivial. You typically train your nlp pipeline by calling nlp.update(batch), which then in turn sends the training data to all its different components.

I guess you could implement your own custom training loop though: create the pipeline with just one transformer, have all components listen to that one, and then call the update methods separately on the components, with the matching dataset. Since v3, we generally don't recommend implementing your own training loop (use the CLI+config instead), but this could be a valid reason to do so.

If that is not possible:
Would it be possible to freeze the transformer weights during training? In that case sourcing the ner component should work with just one trf model in the pipeline?

Hm. In theory, you can prevent pipeline components from training/updating by putting them in the training.frozen_components config block. But I just tested this for the Transformer, and I don't think that actually works because of the Listener pattern. I'll look into this.

0 replies

svlandeg · 2020-11-12T11:48:40Z

svlandeg
Nov 12, 2020

Ah, actually, you can also just ensure that none of your components update the transformer by setting grad_factor = 0.0 as argument for each TransformerListener. That should probably be the easiest fix for your use-case! (I hope it won't effect accuracy too much though)

0 replies

maxtrem · 2020-11-12T12:43:48Z

maxtrem
Nov 12, 2020
Author

Thanks a lot! The first model is training right now. I tried also to set training.frozen_components but as you predicted that didn't work - the listener raised an error.
My hope was that this might disable gradient calculation (for the transformer) and therefore speed up training time. I guess grad_factor = 0.0 does not prevent actual gradient creation for the transformer weights?

I will now train the two models with its respective components and grad_factor = 0.0 and then try to combine these in one pipeline.
So far this seems to work:

E    #       LOSS TRANS...  LOSS TAGGER  LOSS MORPH...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  -------------  -----------  -------------  -----------  -------  -------  ---------  -------  -------  -------  ------
  0       0           0.00      1292.14        1295.62      1857.84     5.65     0.00      35.63    19.15     0.73     0.00    0.24
  5     200           0.00    353628.72      355488.25    418184.38    54.70    68.17      77.03    21.03     4.36     4.64    0.52
 11     400           0.00    318495.10      327229.22    350960.83    54.91    62.82      75.91    40.69    25.19    35.22    0.56

Transformer loss is 0.0. However the accuracy gain for the first epochs is significantly less than it used to be in previous training runs. So I will see where this leads to :)

I will report back what results I got. Thanks for your help!

0 replies

maxtrem · 2020-11-17T10:36:56Z

maxtrem
Nov 17, 2020
Author

I have now done some experiments on this. The results are not extremely bad, but still significantly suffer from freezing the transformer.
While POS accuracy is almost the same, TAG and MORPH accuracy drops by around 2% (both drop from ~98. to ~96.) for my training data (UD-Swedish-Talbanken). So tagging and MORPH prediction works reasonably well with a frozen transformer.

Unfortunately it is worse for the dependency parser. There dep_uas drops from 92.53 to 87.85 and dep_las drops from 89.97 to 84.06. This is a pretty significant difference. There is a slight improvement (improvement by a half percent accuracy) when increasing hidden layer units for the dependency parser, but this is still much worse compared to a fine-tuned transformer.

Additionally, learning the NER component on top of the frozen transformer does not work too well.
Therefore I'm not continuing on this path.

I have now automatically annotated the UD treebank with NER labels. This actually works quite well and is already much better than the experiments with the frozen transformer.
My plan was to then continue training only the NER component on Gold data while freezing the transformer model and later replace the NER component with the one that I continued to train on the gold data.

However this doesn't seem to work. Even though I have grad_factor = 0.0 the decreasing transformer loss is stated (instead of 0.00):

[nlp]
lang = "sv"
pipeline = ["transformer","ner"]
tokenizer = {"@tokenizers": "spacy.Tokenizer.v1"}

[components]

[components.transformer]
source="sv_comb_ner_gen"


[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "KB/bert-base-swedish-cased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96


[components.ner]
source="sv_comb_ner_gen"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 0.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

(sv_comb_ner_gen is the package that I trained with on UD + auto-generated NER data)

When I initialize a new NER component with factory="ner" everything works like expected and the transformer loss is 0.00 again. But then I cannot make use of the already existing NER component weights.

Is there some way to freeze the transformer weights while continuing training on the existing NER component weights (resp. initialize the component with already existing weights)?

0 replies

Uh oh!

Transformer / Train on resources with different annotations #6377

Uh oh!

Uh oh!

maxtrem Nov 11, 2020

Replies: 6 comments

Uh oh!

Uh oh!

svlandeg Nov 12, 2020

Uh oh!

Uh oh!

maxtrem Nov 12, 2020 Author

Uh oh!

svlandeg Nov 12, 2020

Uh oh!

Uh oh!

svlandeg Nov 12, 2020

Uh oh!

maxtrem Nov 12, 2020 Author

Uh oh!

Uh oh!

maxtrem Nov 17, 2020 Author

maxtrem
Nov 11, 2020

svlandeg
Nov 12, 2020

maxtrem
Nov 12, 2020
Author

svlandeg
Nov 12, 2020

svlandeg
Nov 12, 2020

maxtrem
Nov 12, 2020
Author

maxtrem
Nov 17, 2020
Author