Coreference does work when the tokenisation isn't correct? #12532
-
So I am currently trying to train a Danish coreference model. It took some time to get the dataset in the correct format (I at least think that it is). Feel free to check it out here. When running the following training command:
I get the following error:
Running it in debug mode I find that the error is indeed correct:
The tokens I can naturally go in and resolve these error in the dataset, but that seems problematic especially if you want to use annotations on noisy texts. Alternatively one could train assuming the "gold" tokens, but that will overestimate the performance of the model (and the model does not 'learn' to deal with the errors of the tokenizer). I am unsure what the best solution is for this, I would love to hear a second opinion |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I think this has been answered in #12533 (comment), right? |
Beta Was this translation helpful? Give feedback.
I think this has been answered in #12533 (comment), right?