Training a model for multiple tasks on a dataset with missing values #12307
-
Hi everyone, I have two datasets, dataset A and dataset B. One contains information on POS, Dependency Parsing, and Lemmatization. The other one contains information on Named-Entities. I have now merged the two to a third dataset, dataset C. I am looking for a way to train a transformer model on dataset C. But since all examples in this dataset will have missing values, I want to ensure that the components/heads for each of the tasks are not updated for examples of missing values. E.g.: Training on an example that does not contain information on Named-Entities should not be interpreted as an example with no Named-Entity tags, but rather as an example that should be excluded in the updating of the NER-head. My hope and guess is that spaCy already automatically behaves in this manner, but I just want to make sure that this is what happens. What I found out so far:
My questions are therefore the following:
Best, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hi! I understand that you want to train a If you do want to stick to the original setup you mentioned using the merged datasets, let me give you an overview of how to set partial annotations with spaCy. Some of these features are experimental, which is why we haven't documented them properly. I'd be interested to hear whether you'll get satisfying performance out of this, and how it would compare to the more basic approach I described in the first paragraph.
Hope this helps you get started! |
Beta Was this translation helpful? Give feedback.
-
Actually, there's a third alternative to approaching this: train both pipelines separately on the different datasets, and then combine the components from the different models by sourcing them into one pipeline and calling I've explained something similar to this in this post: #12302 (comment) |
Beta Was this translation helpful? Give feedback.
Hi!
I understand that you want to train a
tagger
,parser
,ner
andtrainable_lemmatizer
all using the sametransformer
model, correct?As one alternative to merging the datasets and running the training simultaneously, you could also train the NER separately, then source its
transformer
andner
, freeze both and train the other components on top of it. Or vice versa, train the other components with a transformer, then source all of it, freeze everything and train an NER model. Even if this is not what you set out to do initially, this might actually obtain better performance than working with partially annotated datasets for all of the components.If you do want to stick to the original set…