Coreference with incompletely annotated data? #12533

KennethEnevoldsen · 2023-04-17T03:07:02Z

KennethEnevoldsen
Apr 17, 2023

So want to perform coreference resolution with an incomplete dataset where only some of the documents are annotated.

I know that for some components (e.g. NER), it is possible to annotate using an incomplete dataset. Is that also the case for the coreference component?

If it is how does the model differentiate between "not annotated" vs. "annotated but no clusters"?

Answered by adrianeboyd

Apr 17, 2023

So this isn't true for coref (yet, we should take a look at this), but in general all components should be able to support training from misaligned tokenization. The get_loss methods should basically ignore instances where the tokens can't be aligned. (But be aware that the alignment code is full of special cases depending on the attribute to try to keep as much annotation as possible, like only the token start char matters for SENT_START and it will align AB/TAG1 to A/TAG2 B/TAG2 as long as the tags are the same for the whole sequence.)

Not all components can train from partial annotation, though, mainly because there's not always a way to mark partial annotation, like for spancat traini…

View full answer

adrianeboyd · 2023-04-17T09:33:04Z

adrianeboyd
Apr 17, 2023

So this isn't true for coref (yet, we should take a look at this), but in general all components should be able to support training from misaligned tokenization. The get_loss methods should basically ignore instances where the tokens can't be aligned. (But be aware that the alignment code is full of special cases depending on the attribute to try to keep as much annotation as possible, like only the token start char matters for SENT_START and it will align AB/TAG1 to A/TAG2 B/TAG2 as long as the tags are the same for the whole sequence.)

Not all components can train from partial annotation, though, mainly because there's not always a way to mark partial annotation, like for spancat training from annotation on doc.spans there's no way to mark unannotated sections of a text. I think this will be similar for coref, at least initially.

Also be aware that we've tried to train transformer components with partial annotation as in en_core_web_trf and it just doesn't work well and we're not sure what's going on. See the discussion (well, it's just me) starting here: #7493 (comment)

1 reply

KennethEnevoldsen Apr 17, 2023
Author

Thanks exactly the info I was looking for. Thanks! I will then start out with training the coref component using frozen weights.

Will try some experiments with the partial annotation as well (for NER+POS without dep.). Will add to the thread if I find something which is promising.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Coreference with incompletely annotated data? #12533

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Coreference with incompletely annotated data? #12533

Uh oh!

KennethEnevoldsen Apr 17, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Apr 17, 2023

Uh oh!

KennethEnevoldsen Apr 17, 2023 Author

KennethEnevoldsen
Apr 17, 2023

Replies: 1 comment 1 reply

adrianeboyd
Apr 17, 2023

KennethEnevoldsen Apr 17, 2023
Author