Sudden drop in the accuracy of the parser #10517
-
Could you please help me understand the behavior of the parser? I have trained the spaCy transformer model with the experimental lemmatizer & tokenizer on my own Korean custom dataset. Initially, the dataset had end-of-the-sentence punctuation marks attached to the words (e.g. hello.) and only 35 sentences had end-of-the-sentence punctuation marks correctly on the next line. The example sentence is attached below. The accuracy of the transformer model for this dataset was the following:
It was a very good result, however, after I removed end-of-the-sentence punctuation marks (except for already correctly marked 35 sentences) the accuracy of the parser (UAS and LAS) dropped by
Both models were trained using the dataset with the same number of sentences, the same deprel tag distribution, and the same UPOS tag distribution. The only difference between the two datasets was the removal of the end-of-the-sentence punctuation marks. The dataset statistics are provided below. If possible, could you please help me understand what can cause such a drastic drop in the accuracy scores? I have also tried to move the end-of-the-sentence punctuation marks to the next line for every sentence which resulted in Dataset statistics:
Deprel tags distribution (token-level):
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 11 replies
-
The parser is also learning where to split sentences, and I think what's going on is that if you remove the Instead of removing
|
Beta Was this translation helpful? Give feedback.
The parser is also learning where to split sentences, and I think what's going on is that if you remove the
.
characters, you're removing a really strong clue about where to put sentence boundaries, so you end up with a lot of longer or shorter parses and more errors.Instead of removing
.
, I'd recommend splitting.
into a separate token and attaching it withpunct
to the previous word.punct
andp
relations are ignored by the scorer by default, but you can configure that with a custom scorer if you like.