update_model #9846

vistamou · 2021-12-10T11:31:25Z

vistamou
Dec 10, 2021

Hello,

I'm trying to train a new language model starting from data coming from other language models i.e. macedonian
and then continue by reading gold data for X lang--
My config looks like this:

[nlp]
lang = "mk"
pipeline = ["tok2vec","tagger","morphologizer"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.morphologizer]
source = "mk_core_news_md"
component = "morphologizer"


[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

But during training the scores remain stable (don't update):

=========================== Initializing pipeline ===========================
[2021-12-10 13:24:18,377] [INFO] Set up nlp object from config
[2021-12-10 13:24:18,387] [INFO] Pipeline: ['tok2vec', 'tagger', 'morphologizer']
[2021-12-10 13:24:18,387] [INFO] Resuming training for: ['morphologizer']
[2021-12-10 13:24:18,396] [INFO] Created vocabulary
[2021-12-10 13:24:19,314] [INFO] Added vectors: mk_core_news_md
[2021-12-10 13:24:19,396] [INFO] Finished initializing nlp object
[2021-12-10 13:24:21,979] [INFO] Initialized pipeline components: ['tok2vec', 'tagger']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  TAG_ACC  POS_ACC  MORPH_ACC  SCORE 
---  ------  ------------  -----------  -------------  -------  -------  ---------  ------
  0       0          0.00        90.94          26.22    38.05    24.94      32.91    0.33
  0     200        210.55      7278.80         576.95    84.80    34.28      32.91    0.59
  1     400        334.91      3770.62         197.31    89.04    32.64      32.91    0.61

what could be wrong?
Thanks a lot!!

Answered by vistamou

Dec 12, 2021

Thanks for the answer!! I followed your suggestion, but I got this error:

KeyError: "[E944] Can't copy pipeline component 'tok2vec' from source 'mk_core_news_md': not found in pipeline. Available components: morphologizer, parser, senter, attribute_ruler, lemmatizer, ner"

aha I see, ok even if that's so, I can still train PoS prediction based on that information; actually the idea is to get a baseline for the target language based on a similar language

View full answer

adrianeboyd · 2021-12-10T14:11:38Z

adrianeboyd
Dec 10, 2021

When sourcing components, leave out all the other settings, so just:

[components.morphologizer]
source = "mk_core_news_md"

[components.tok2vec]
source = "mk_core_news_md"

Since the mk_core_news_md tok2vec is shared with the parser and here it looks like you're trying to add a new tagger with a separate tok2vec, it might make more sense to replace the listener in the morphologizer instead, so the morphologizer can be trained independently of the other new components:

[components.morphologizer]
source = "mk_core_news_md"
replace_listeners = ["model.tok2vec"]

Our Macedonian training data doesn't contain morph values (just pos for the morphologizer), so in effect it's predicting an empty morph tag for every token, which is why the evaluation doesn't change. The model only contains POS in its labels. It depends on what you're trying to do, but I'm not sure trying to extend the morphologizer while also training a new tagger is the right thing to do for your task? If you have morph values, it would probably make more sense to train from scratch instead of extending this model.

0 replies

vistamou · 2021-12-12T16:06:05Z

vistamou
Dec 12, 2021
Author

Thanks for the answer!! I followed your suggestion, but I got this error:

KeyError: "[E944] Can't copy pipeline component 'tok2vec' from source 'mk_core_news_md': not found in pipeline. Available components: morphologizer, parser, senter, attribute_ruler, lemmatizer, ner"

aha I see, ok even if that's so, I can still train PoS prediction based on that information; actually the idea is to get a baseline for the target language based on a similar language

8 replies

vistamou Dec 13, 2021
Author

actually I'm not allowed to share any data, but I will try to explain better:

I have corpus for X language and I want to build a baseline annotated version of it based on a similar language Y
Then (or even better synchronously) I want to optimize this process by starting (initializing) with mk model and at the same time reading Gold annotated corpus X in order for the model to update/learn based on the gold info

adrianeboyd Dec 13, 2021

Sorry, I still don't really understand what you're trying to do with the morphologizer in particular. If your training data doesn't have the same type of annotation (POS + empty morph), then it probably doesn't make sense to try to fine-tune it instead of training from scratch.

vistamou Dec 13, 2021
Author

my training data does have the information of PoS & morph (as I said, I do have a gold set)--so the format is like this:

text = From the AP comes this story :

1 From from ADP IN _ 3 case 3:case _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _
4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _
6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _
7 : : PUNCT : _ 4 punct 4:punct _

of course training from scratch will be better, but I want to test the case of having previous knowledge(=mk model) & updating that knowledge ( with = gold data). So the question is (again) what I'm doing wrong in the config file, and it's not updated given the gold data I feed the model with?

adrianeboyd Dec 13, 2021

The morphologizer model is an extremely simple tagger underneath and it treats POS=NOUN as a completely independent/separate tag from Number=Sing|POS=NOUN. The morphologizer from mk_core_news_md is only initialized with POS= labels, so it doesn't work well to train it on data with more features than the original corpus.

You would only want to extend an existing model with the exact same set of POS/feats it's already been trained with, so I think training from scratch is probably a better choice if you want the morph information. If you don't want the morph information, remove it from your training data before training and you should be able to update/extend just from the POS tags.

vistamou Dec 14, 2021
Author

I see, ok! Thanks a lot !! :)

Uh oh!

update_model #9846

Uh oh!

Uh oh!

vistamou Dec 10, 2021

Replies: 2 comments · 8 replies

Uh oh!

Uh oh!

adrianeboyd Dec 10, 2021

Uh oh!

Uh oh!

vistamou Dec 12, 2021 Author

Uh oh!

Uh oh!

vistamou Dec 13, 2021 Author

Uh oh!

adrianeboyd Dec 13, 2021

Uh oh!

vistamou Dec 13, 2021 Author

text = From the AP comes this story :

Uh oh!

adrianeboyd Dec 13, 2021

Uh oh!

vistamou Dec 14, 2021 Author

vistamou
Dec 10, 2021

Replies: 2 comments 8 replies

adrianeboyd
Dec 10, 2021

vistamou
Dec 12, 2021
Author

vistamou Dec 13, 2021
Author

vistamou Dec 13, 2021
Author

vistamou Dec 14, 2021
Author