Training a model for multiple tasks on a dataset with missing values #12307

emiltj · 2023-02-20T16:06:06Z

emiltj
Feb 20, 2023

Hi everyone,

I have two datasets, dataset A and dataset B. One contains information on POS, Dependency Parsing, and Lemmatization. The other one contains information on Named-Entities. I have now merged the two to a third dataset, dataset C.

I am looking for a way to train a transformer model on dataset C. But since all examples in this dataset will have missing values, I want to ensure that the components/heads for each of the tasks are not updated for examples of missing values. E.g.: Training on an example that does not contain information on Named-Entities should not be interpreted as an example with no Named-Entity tags, but rather as an example that should be excluded in the updating of the NER-head. My hope and guess is that spaCy already automatically behaves in this manner, but I just want to make sure that this is what happens.

What I found out so far:

My questions are therefore the following:

Does setting annotations for specific tasks to "missing values" mean that examples with missing examples are completely disregarded during training for the respective component/head?
How do I in spaCy explicitly set NER, Lemmas, POS, Dependency Parsing as missing values, respectively? (so not just set_ents)
If the missing values solution is not the correct way of disregarding missing value data, how do I then obtain the same result?

Best,
Emil

Answered by svlandeg

Feb 22, 2023

Hi!

I understand that you want to train a tagger, parser, ner and trainable_lemmatizer all using the same transformer model, correct?
As one alternative to merging the datasets and running the training simultaneously, you could also train the NER separately, then source its transformer and ner, freeze both and train the other components on top of it. Or vice versa, train the other components with a transformer, then source all of it, freeze everything and train an NER model. Even if this is not what you set out to do initially, this might actually obtain better performance than working with partially annotated datasets for all of the components.

If you do want to stick to the original set…

View full answer

svlandeg · 2023-02-22T14:54:08Z

svlandeg
Feb 22, 2023

Hi!

I understand that you want to train a tagger, parser, ner and trainable_lemmatizer all using the same transformer model, correct?
As one alternative to merging the datasets and running the training simultaneously, you could also train the NER separately, then source its transformer and ner, freeze both and train the other components on top of it. Or vice versa, train the other components with a transformer, then source all of it, freeze everything and train an NER model. Even if this is not what you set out to do initially, this might actually obtain better performance than working with partially annotated datasets for all of the components.

If you do want to stick to the original setup you mentioned using the merged datasets, let me give you an overview of how to set partial annotations with spaCy. Some of these features are experimental, which is why we haven't documented them properly. I'd be interested to hear whether you'll get satisfying performance out of this, and how it would compare to the more basic approach I described in the first paragraph.

ner: use None as the IOB tag, as explained in Training NER on Incomplete Annotations #11114
tagger: use "" as the tag, e.g. ["", "V", "S", "J", ""]
parser: use None both for heads and deps e.g. [1, 1, 1, None] and ["nsubj", "ROOT", "dobj", None]
trainable_lemmatizer: set the lemma to "", e.g. ["", "like", "green", ""]

Hope this helps you get started!

9 replies

emiltj Feb 27, 2023
Author

Hi again,

Thanks a bunch for your response!

I have tried creating new Doc objects in a similar form to what you mentioned. However, to me it isn't as trivial as I would have imagined it might be. The token.heads are seemingly hard to set as missing values.
I tried the following which results in still having token.heads for all tokens:

nlp = spacy.load("en_core_web_lg")
text = "This is a text about Paris, France"
doc = nlp(text)
spaces = [t.whitespace_ for t in doc]
words = [t.text for t in doc]

Followed by either:

new_doc = Doc(
    vocab=nlp.vocab,
    words=words,
    spaces=spaces,
    heads=None,
)

or

new_doc = Doc(
    vocab=nlp.vocab,
    words=words,
    spaces=spaces,
    heads=[None for t in doc],
)

Moreover (and less importantly for me) setting entities when constructing a new Doc object puzzles me.
You have to create a Doc and then in a new step, add the entities.

nlp = spacy.load("en_core_web_sm")
text = "This is a text about Paris, France"
doc = nlp(text)
spaces = [t.whitespace_ for t in doc]
words = [t.text for t in doc]
ents = doc.ents

Results in error:
new_doc = Doc(vocab=nlp.vocab, words=words, spaces=spaces, ents=ents)

Does not result in error:

new_doc = Doc(vocab=nlp.vocab, words=words, spaces=spaces)
new_doc.ents = ents

My questions are therefore:

How do I set the heads to missing?
If it is not possible to set heads as missing, will these examples contribute to updating of weights and biases of the dependency parsing head?
(Less important I'm just curious) Why does it not work to set Doc(ents = ents), but only after already assigning the Doc object?

Best,
Emil

adrianeboyd Feb 27, 2023

The only format accepted for Doc(ents=) is word-level IOB/IOB2 tags.

The formats expected for missing annotation isn't documented particularly well and there are some minor inconsistencies between "" and "-" or similar between different components, so it is a bit confusing.

For dependency parses, you can't have an unset head, so unset dependency parses are marked through unset dependency labels (token.dep_ = ""). If your goal is an unlabelled dependency tree (for some reason, this is not typical for spacy), then you need to use a single placeholder label like dep for all arcs.

A big warning, though: based on our experience with training the provided pipelines, we have found that training a shared transformer (as in en_core_web_trf) with partial annotation does not work well in practice. We're not actually entirely sure why at this point, but there's partial NER annotation in OntoNotes that was leading to poor NER performance for en_core_web_trf model version v3.0.0--v3.4.0. (Model version, not spacy version.) Some more details: #7493

For the current en_core_web_trf model, we're training a transformer+ner pipeline on the data with NER annotation and using that to fill in missing NER annotation, and then training the full pipeline from the full corpus with both gold and silver NER annotation.

We haven't experimented as much because the core sm/md/lg pipelines have independent ner components, but it seems like a shared tok2vec is not as sensitive to partial annotation overall as transformer, although you'd still want to compare a shared tok2vec vs. internal tok2vec for each component because that might work better especially if the annotation is very sparse.

emiltj Feb 28, 2023
Author

Great to get some understanding about how the missing annotations work, although it still remains a bit confusing to get a full understanding of it.

And for the training with partial tagging - this is crucial information!
I'm very grateful for the insight before I get in too deep with training and releasing new models. I will see if I can follow the practice that you are using for the current en_core_web_trf model, by using a NER annotation pipeline to fill in the missing NER annotations.

I am however combining two datasets, one with NER and one with everything except for NER. I am thinking it might be necessary to also fill the NER dataset with the missing information on .deps, .tags, etc. but I don't know if this is the case?
Is performance also poor with transformers training on partial information for the other tasks, outside NER?

adrianeboyd Mar 1, 2023

We're not sure exactly what's going on with the poor performance on partial annotation for a shared transformers, and we're pretty surprised at the results we got related to en_core_web_trf.

My guess is that it's not any better for partial tags/dependencies, but I don't actually know for sure. For OntoNotes it wasn't a huge portion of the data with missing NER, so we found that training with a small amount of silver data mixed in doesn't affect the overall performance and it improves the specific problems users were reporting with PERSON.

I think you'll have an easier time overall and see better accuracy (not speed, obviously) if you train with two separate transformer pipelines and then assemble them together after training. Just remember to use a custom transformer component name and set upstream to that name in one of the two pipelines so they're easy to assemble.

We're using a shared transformer in *_core_web_trf in order not to bloat the pipelines too much, but it may not be the right decision outside of this kind of general-purpose context, where we want users to be able to download a not-too-huge, not-too-slow pipeline to try out.

emiltj Mar 1, 2023
Author

I tested the performance for other tags than ents, and it seemed to also suffer with partial data.

I will try it out.

Thank you guys for your help!

svlandeg · 2023-02-22T19:11:46Z

svlandeg
Feb 22, 2023

Actually, there's a third alternative to approaching this: train both pipelines separately on the different datasets, and then combine the components from the different models by sourcing them into one pipeline and calling replace_listeners on either all components from pipeline 1, or all components from pipeline 2. You would end up with a "heavier" pipeline as you'll need to run several transformer components during inference, but at least accuracy wouldn't suffer.

I've explained something similar to this in this post: #12302 (comment)

0 replies

Uh oh!

Training a model for multiple tasks on a dataset with missing values #12307

Uh oh!

Uh oh!

emiltj Feb 20, 2023

Replies: 2 comments · 9 replies

Uh oh!

svlandeg Feb 22, 2023

Uh oh!

emiltj Feb 27, 2023 Author

Uh oh!

adrianeboyd Feb 27, 2023

Uh oh!

emiltj Feb 28, 2023 Author

Uh oh!

adrianeboyd Mar 1, 2023

Uh oh!

emiltj Mar 1, 2023 Author

Uh oh!

Uh oh!

svlandeg Feb 22, 2023

emiltj
Feb 20, 2023

Replies: 2 comments 9 replies

svlandeg
Feb 22, 2023

emiltj Feb 27, 2023
Author

emiltj Feb 28, 2023
Author

emiltj Mar 1, 2023
Author

svlandeg
Feb 22, 2023