Training a trainable Lemmatizer with partial data #12282
-
I want to train a Lemmatizer using the new edit tree Lemmatizer.
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hey Premshay, When training a pipeline with the goal of training only the edit-tree lemmatizer we only need The edit-tree lemmatizer can also learn from partially annotated data, which means that you can train on your partial gold-data if you'd like to try. You can most definitely merge your gold annotation with the silver data produced by your other model. You just need to create a |
Beta Was this translation helpful? Give feedback.
-
@jmyerston, we can try and work on this together, since it's your model I'm currently testing this on :)
Regarding your question: where did you see that conllu was the format?
As far as I know the format is the regular spacy data format.
You could iterate through the tokens and switch the token.lemma_ to the one from the list when the token is the word from the list.
…On Thu, Feb 16, 2023, 20:46 Jacobo Myerston ***@***.***> wrote:
Hi I have a similar question, but slightly different.
I also want to improve a trainable lemmatizer for a low resource language
for which a long gold-standard list of tokens and lemmas already exist in
csv format, with one column for token and the other for the lemma. I
assumed that this list needs to be converted to conllu and then to spacy
data format using the command line.
So, I have been creating conllu files with empty fields for anything that
is not a token id, a token, and a lemma but often get into issues when the
conllu file is not well-formed.
Is there any other file format that could be used to train the trainable
lemmatizer? Would BILUO work?
—
Reply to this email directly, view it on GitHub
<#12282 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGWMILCUDQMLAKOWT2PB3N3WXZYZFANCNFSM6AAAAAAU474IEM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks, I'll try retraining with the updated models. I think we can pull
our resources together and assemble larger datasets.
BTW, in your recent update, did you also update the transformer to be
compatible with spacy 3.5 and the latest spacy-transformers?
…On Fri, Mar 17, 2023, 00:48 Jacobo Myerston ***@***.***> wrote:
Hello,
I have updated the ancient Greek models with about 30.000 new lemmata.
This was done by training the lemmatizer with both the proiel and perseus
corpora. The way to permanently improve lemmatization is by providing more
training data, and the only datasets we have for now are those I just
mentioned.
The main format to train those models is conllu which is later converted
into the spacy format. You can check the file project in greCy for more
details.
The models are in the huggingface hub.
—
Reply to this email directly, view it on GitHub
<#12282 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGWMILCAQYFBMYWOVOZETKDW4OKCFANCNFSM6AAAAAAU474IEM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hey Premshay,
When training a pipeline with the goal of training only the edit-tree lemmatizer we only need
lemma
annotations and nopos
ortag
information is required.The edit-tree lemmatizer can also learn from partially annotated data, which means that you can train on your partial gold-data if you'd like to try.
You can most definitely merge your gold annotation with the silver data produced by your other model. You just need to create a
DocBin
as usual and add thetoken.lemma_
information. If a token is assigned the empty lemma""
then it is skipped during training.