Coreference does work when the tokenisation isn't correct? #12532

KennethEnevoldsen · 2023-04-17T01:19:49Z

KennethEnevoldsen
Apr 17, 2023

So I am currently trying to train a Danish coreference model. It took some time to get the dataset in the correct format (I at least think that it is). Feel free to check it out here.

When running the following training command:

spacy train configs/cluster.cfg --output training/cluster --paths.train corpus/cdt/train.spacy --paths.dev corpus/cdt/dev.spacy --nlp.lang=da

I get the following error:

IndexError: Misalignment in Coref. Head token has no match in training doc.

Running it in debug mode I find that the error is indeed correct:
In the text:

Boksen er et resultat af et af rockhistoriens mest ambitiøse plade-projekter: Den rummer den komplette række af originale indspilninger fra Elvis'produktion i 50'erne

The tokens Elvis'produktion is annotated as two tokens, but is tokenized in the training as one token. Thus the head token has no match.

I can naturally go in and resolve these error in the dataset, but that seems problematic especially if you want to use annotations on noisy texts. Alternatively one could train assuming the "gold" tokens, but that will overestimate the performance of the model (and the model does not 'learn' to deal with the errors of the tokenizer).

I am unsure what the best solution is for this, I would love to hear a second opinion

Answered by svlandeg

Apr 19, 2023

I think this has been answered in #12533 (comment), right?

View full answer

svlandeg · 2023-04-19T07:15:10Z

svlandeg
Apr 19, 2023

I think this has been answered in #12533 (comment), right?

1 reply

KennethEnevoldsen Apr 19, 2023
Author

Indeed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Coreference does work when the tokenisation isn't correct? #12532

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Coreference does work when the tokenisation isn't correct? #12532

Uh oh!

KennethEnevoldsen Apr 17, 2023

Replies: 1 comment · 1 reply

Uh oh!

svlandeg Apr 19, 2023

Uh oh!

KennethEnevoldsen Apr 19, 2023 Author

KennethEnevoldsen
Apr 17, 2023

Replies: 1 comment 1 reply

svlandeg
Apr 19, 2023

KennethEnevoldsen Apr 19, 2023
Author