Dependency Parser Training #9470

kanayer · 2021-10-15T02:11:09Z

kanayer
Oct 15, 2021

Hi everyone, I was working on developing Korean Language support and I have a couple of questions regarding implementation & dataset. Would anyone be so kind to answer the following questions?

What are the benefits of training the dependency parser model together with the pos tagger model? Is it better to have a separate model (e.g. Pointer Generator + CRF) for lemmatization & pos tagging components and a separate one (e.g. statistical transition-based) for the dependency parser component? Or is it better to share a single transformer between these components?
Do dependency parsing datasets have to include POS tag info? And do both pos tagger and parser have to be trained on the same dataset?
spaCy tokenizer for Korean is morpheme-based and my lemmatization&pos tagging dataset is also morpheme-based, but the dependency relations dataset is word-based, would that create a problem?

Answered by polm

Oct 15, 2021

What are the benefits of training the dependency parser model together with the pos tagger model? Is it better to have a separate model (e.g. Pointer Generator + CRF) for lemmatization & pos tagging components and a separate one (e.g. statistical transition-based) for the dependency parser component? Or is it better to share a single transformer between these components?

Often basic features that are relevant for POS prediction are also relevant for the dependency parse - for example, nmod usually attaches to an adjective and noun pair. There's no guarantee that's optimal, but we also don't have some other architectures (like pointer generators and CRFs) in spaCy.

Do dependency parsing…

View full answer

polm · 2021-10-15T03:52:09Z

polm
Oct 15, 2021

What are the benefits of training the dependency parser model together with the pos tagger model? Is it better to have a separate model (e.g. Pointer Generator + CRF) for lemmatization & pos tagging components and a separate one (e.g. statistical transition-based) for the dependency parser component? Or is it better to share a single transformer between these components?

Often basic features that are relevant for POS prediction are also relevant for the dependency parse - for example, nmod usually attaches to an adjective and noun pair. There's no guarantee that's optimal, but we also don't have some other architectures (like pointer generators and CRFs) in spaCy.

Do dependency parsing datasets have to include POS tag info? And do both pos tagger and parser have to be trained on the same dataset?

I think we can train these on separate datasets, but I'm not sure I've ever seen a dependency parsing dataset without POS data. Do you have one like that?

spaCy tokenizer for Korean is morpheme-based and my lemmatization&pos tagging dataset is also morpheme-based, but the dependency relations dataset is word-based, would that create a problem?

I think you would need to convert the word-based annotations to morpheme based ones. I know something similar happened with some Japanese datasets, which had dependency annotations at the bunsetsu level ("bunsetsu" is roughly word + particles/endings) that were converted to token-level. It was possible to automate that because bunsetsu-internal structure was basically always unambiguous and predictable. Does that sound feasible for Korean?

16 replies

adrianeboyd Oct 18, 2021

And to clarify: I created the example above by hand. But from looking at the mecab output, I think it would be pretty straightfoward to do it automatically. The hardest part would be choosing the mapping from mecab tags to UD pos/deprel, since there may not be perfect choices in the UD scheme.

kanayer Oct 19, 2021
Author

Oh, I see! Thank you. I have noticed that UD_Korean_GSD treebank has a dataset in the following format (image attached below): "eojeol" (word) based tokenization in the "form" column, then morpheme-based analysis in lemma and pos tag columns, and then "eojeol" based analysis in deprel columns. Do you think such a dataset can be used to train the transformer model? The reason why I'm asking this is that we can combine our two datasets to be in the same manner. Is it possible to do so that the morphological analyzer learns that "잡스는" consists of 잡스/NNP + 는/JX (tags will be converted according to UD scheme) and "잡스는" itself has nsubj for the deprel?

polm Oct 19, 2021

I imagine that you would use the lemma column to pull out the morphemes and replace the eojeol with the morphemes and their tags. That would be a static conversion you run before actually using the training data, and you would automatically make a dependency annotation from the particle (like 는) using the left-side token as the head with type case (or whatever is appropriate).

MeCab, the morphological analyzer, uses a dictionary, so it can't learn lemmas. To the extent it has a model, it just learns weights for the dictionary terms. I'm not sure exactly how it handles eojeol, or if it's capable of handling unks correctly, you'd have to look at that. (In Japanese/the original distribution, it can't generate dictionary entries except for unks, and they don't include useful data in lemmas/POS. I'm not sure if eojeol entries are being dynamically constructed or if they're all being pre-registered in a static dictionary.)

kanayer Oct 19, 2021
Author

I see, thanks for your explanation. You once mentioned that it is important to have the same segmentation level for different tasks in the pipeline. I wonder if the static conversion you mentioned in the above message can be avoided by training a separate tagger and parser models. For example, the tagger would have the eojeol based input 잡스는 analyzed into 잡스/NNP + 는/JX and then the parser model would also have the eojeol based input 잡스는 together with eojeol based pos tag (not for separate morphemes but the whole eojeol itself) and arcs? Would that create clash between pos tags in the pipeline?

adrianeboyd Oct 19, 2021

A Doc only supports one tokenization, so this would only be possible with two separate pipelines.

(I always feel like I have to add a disclaimer that it's not technically impossible to retokenize in the middle of the pipeline, but you probably don't want to do it. Two separate pipelines or finding a solution with a shared tokenization will be much easier.)

Uh oh!

Dependency Parser Training #9470

Uh oh!

Uh oh!

kanayer Oct 15, 2021

Replies: 1 comment · 16 replies

Uh oh!

polm Oct 15, 2021

Uh oh!

adrianeboyd Oct 18, 2021

Uh oh!

Uh oh!

kanayer Oct 19, 2021 Author

Uh oh!

polm Oct 19, 2021

Uh oh!

Uh oh!

kanayer Oct 19, 2021 Author

Uh oh!

adrianeboyd Oct 19, 2021

kanayer
Oct 15, 2021

Replies: 1 comment 16 replies

polm
Oct 15, 2021

kanayer Oct 19, 2021
Author

kanayer Oct 19, 2021
Author