Training a trainable Lemmatizer with partial data #12282

Premshay · 2023-02-15T15:30:43Z

Premshay
Feb 15, 2023

I want to train a Lemmatizer using the new edit tree Lemmatizer.
I'll be using a trained transformer pipeline with a lookup Lemmatizer for low resource languages, but it performs really badly. So I'd like to try the new trainable method, using a combination of the current capabilities (lookup and tagger), and my own lemma annotations.
My annotated data includes partial annotation regarding lemmata: only some words in each sentence were annotated with its correct lemma. Other words were not annotated at all.
Question is:

Can I train an edit-tree Lemmatizer using the tags from the transformer model (pos, lemma, etc.), and merge my annotated (correct) lemmas for the words that I have that data to?
Does this method have a chance to be useful?
I couldn't find the specific data format the edit tree Lemmatizer needs. Is it the regular doc output (tags, features of pos, dep and lemma)?

Answered by kadarakos

Feb 15, 2023

Hey Premshay,

When training a pipeline with the goal of training only the edit-tree lemmatizer we only need lemma annotations and no pos or tag information is required.

The edit-tree lemmatizer can also learn from partially annotated data, which means that you can train on your partial gold-data if you'd like to try.

You can most definitely merge your gold annotation with the silver data produced by your other model. You just need to create a DocBin as usual and add the token.lemma_ information. If a token is assigned the empty lemma "" then it is skipped during training.

View full answer

kadarakos · 2023-02-15T17:00:40Z

kadarakos
Feb 15, 2023

Hey Premshay,

When training a pipeline with the goal of training only the edit-tree lemmatizer we only need lemma annotations and no pos or tag information is required.

The edit-tree lemmatizer can also learn from partially annotated data, which means that you can train on your partial gold-data if you'd like to try.

You can most definitely merge your gold annotation with the silver data produced by your other model. You just need to create a DocBin as usual and add the token.lemma_ information. If a token is assigned the empty lemma "" then it is skipped during training.

0 replies

Premshay · 2023-02-18T06:55:46Z

Premshay
Feb 18, 2023
Author

@jmyerston, we can try and work on this together, since it's your model I'm currently testing this on :) Regarding your question: where did you see that conllu was the format? As far as I know the format is the regular spacy data format. You could iterate through the tokens and switch the token.lemma_ to the one from the list when the token is the word from the list.

…

On Thu, Feb 16, 2023, 20:46 Jacobo Myerston ***@***.***> wrote: Hi I have a similar question, but slightly different. I also want to improve a trainable lemmatizer for a low resource language for which a long gold-standard list of tokens and lemmas already exist in csv format, with one column for token and the other for the lemma. I assumed that this list needs to be converted to conllu and then to spacy data format using the command line. So, I have been creating conllu files with empty fields for anything that is not a token id, a token, and a lemma but often get into issues when the conllu file is not well-formed. Is there any other file format that could be used to train the trainable lemmatizer? Would BILUO work? — Reply to this email directly, view it on GitHub <#12282 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGWMILCUDQMLAKOWT2PB3N3WXZYZFANCNFSM6AAAAAAU474IEM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

jmyerston Mar 16, 2023

Hello,

I have updated the ancient Greek models with about 30.000 new lemmata. This was done by training the lemmatizer with both the proiel and perseus corpora. The way to permanently improve lemmatization is by providing more training data, and the only datasets we have for now are those I just mentioned.

The main format to train those models is conllu which is later converted into the spacy format. You can check the file project in greCy for more details.

The models are in the huggingface hub.

Premshay · 2023-03-19T14:44:40Z

Premshay
Mar 19, 2023
Author

Thanks, I'll try retraining with the updated models. I think we can pull our resources together and assemble larger datasets. BTW, in your recent update, did you also update the transformer to be compatible with spacy 3.5 and the latest spacy-transformers?

…

On Fri, Mar 17, 2023, 00:48 Jacobo Myerston ***@***.***> wrote: Hello, I have updated the ancient Greek models with about 30.000 new lemmata. This was done by training the lemmatizer with both the proiel and perseus corpora. The way to permanently improve lemmatization is by providing more training data, and the only datasets we have for now are those I just mentioned. The main format to train those models is conllu which is later converted into the spacy format. You can check the file project in greCy for more details. The models are in the huggingface hub. — Reply to this email directly, view it on GitHub <#12282 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGWMILCAQYFBMYWOVOZETKDW4OKCFANCNFSM6AAAAAAU474IEM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training a trainable Lemmatizer with partial data #12282

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training a trainable Lemmatizer with partial data #12282

Uh oh!

Uh oh!

Premshay Feb 15, 2023

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

kadarakos Feb 15, 2023

Uh oh!

Uh oh!

Premshay Feb 18, 2023 Author

Uh oh!

jmyerston Mar 16, 2023

Uh oh!

Premshay Mar 19, 2023 Author

Premshay
Feb 15, 2023

Replies: 3 comments 1 reply

kadarakos
Feb 15, 2023

Premshay
Feb 18, 2023
Author

Premshay
Mar 19, 2023
Author