Coreference with incompletely annotated data? #12533
-
So want to perform coreference resolution with an incomplete dataset where only some of the documents are annotated. I know that for some components (e.g. NER), it is possible to annotate using an incomplete dataset. Is that also the case for the coreference component? If it is how does the model differentiate between "not annotated" vs. "annotated but no clusters"? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
So this isn't true for coref (yet, we should take a look at this), but in general all components should be able to support training from misaligned tokenization. The Not all components can train from partial annotation, though, mainly because there's not always a way to mark partial annotation, like for Also be aware that we've tried to train |
Beta Was this translation helpful? Give feedback.
So this isn't true for coref (yet, we should take a look at this), but in general all components should be able to support training from misaligned tokenization. The
get_loss
methods should basically ignore instances where the tokens can't be aligned. (But be aware that the alignment code is full of special cases depending on the attribute to try to keep as much annotation as possible, like only the token start char matters forSENT_START
and it will alignAB/TAG1
toA/TAG2 B/TAG2
as long as the tags are the same for the whole sequence.)Not all components can train from partial annotation, though, mainly because there's not always a way to mark partial annotation, like for
spancat
traini…