Sudden drop in the accuracy of the parser #10517

kanayer · 2022-03-18T06:42:22Z

kanayer
Mar 18, 2022

Could you please help me understand the behavior of the parser?

I have trained the spaCy transformer model with the experimental lemmatizer & tokenizer on my own Korean custom dataset. Initially, the dataset had end-of-the-sentence punctuation marks attached to the words (e.g. hello.) and only 35 sentences had end-of-the-sentence punctuation marks correctly on the next line. The example sentence is attached below.

The accuracy of the transformer model for this dataset was the following:

POS 97.85%
UAS 94.81%
LAS 92.37%
Lemmas 95.06%

It was a very good result, however, after I removed end-of-the-sentence punctuation marks (except for already correctly marked 35 sentences) the accuracy of the parser (UAS and LAS) dropped by 7 and 10 percent respectively. The example sentence and the accuracy are shown below.

POS 95.50%
UAS 87.22%
LAS 82.55%
Lemmas 95.56%

Both models were trained using the dataset with the same number of sentences, the same deprel tag distribution, and the same UPOS tag distribution. The only difference between the two datasets was the removal of the end-of-the-sentence punctuation marks. The dataset statistics are provided below. If possible, could you please help me understand what can cause such a drastic drop in the accuracy scores? I have also tried to move the end-of-the-sentence punctuation marks to the next line for every sentence which resulted in punct and root taking up almost 30% of the deprel tag distribution. This change in the dataset caused the parser accuracy to drop to 50%. Can the end-of-the-sentence punctuation marks affect the accuracy that much?

Dataset statistics:

Size: 50643 sentences
Train: 40514 sentences
Dev: 5064 sentences
Test: 5065 sentences

Deprel tags distribution (token-level):

root 50643
advmod 40476
nsubj 35488
obj 24984
xcomp 24474
obl 23432
acl 21557
discourse 15489
compound 14232
det 13803
ccomp 10461
amod 8176
cc 6139
nmod 5658
dep 5591
conj 4716
nummod 2745
aux 2345
nmod:poss 1717
advcl 1668
csubj 542
vocative 36
punct 35

Answered by adrianeboyd

Mar 18, 2022

The parser is also learning where to split sentences, and I think what's going on is that if you remove the . characters, you're removing a really strong clue about where to put sentence boundaries, so you end up with a lot of longer or shorter parses and more errors.

Instead of removing ., I'd recommend splitting . into a separate token and attaching it with punct to the previous word.

punct and p relations are ignored by the scorer by default, but you can configure that with a custom scorer if you like.

View full answer

adrianeboyd · 2022-03-18T07:43:59Z

adrianeboyd
Mar 18, 2022

The parser is also learning where to split sentences, and I think what's going on is that if you remove the . characters, you're removing a really strong clue about where to put sentence boundaries, so you end up with a lot of longer or shorter parses and more errors.

Instead of removing ., I'd recommend splitting . into a separate token and attaching it with punct to the previous word.

punct and p relations are ignored by the scorer by default, but you can configure that with a custom scorer if you like.

11 replies

kanayer Mar 21, 2022
Author

Thank you for your instructions. Could you please explain in more detail what do you mean by language defaults? I have tried the code you've kindy provided after installing all needed dependencies (mecab + nutty) however, I get an error message saying:

AttributeError: 'KoreanTokenizer' object has no attribute 'explain'

For background info on my model: I have used the command python -m spacy project clone benchmarks/ud_benchmark and used the existing configurations both in config files and the project.yml file. The transformer model is xml-roberta-base. The model trained fine on the Kaist dataset (parser UAS 89% and LAS 87%) however, on my dataset the accuracy is significantly lower (UAS 60% and LAS 57%). And if I put the . back next to the word the accuracy is above 90%. If both models trained on KAIST and my datasets had exactly the same settings, what can cause such a drastic difference in the accuracy?

I have also experimented with the bert-base-multilingual-uncased on both Kaist and my dataset, the accuracy of the parser for both datasets had dropped to 30% and 50% respectively.

adrianeboyd Mar 21, 2022

Okay, let's back up a step and check the tokenizer settings under [nlp.tokenizer] in your configs. The options for Korean are a bit confusing, and since we don't want to break existing workflows, the default is still the mecab tokenizer, which probably isn't what you want.

If you're using spacy.Tokenizer.v1 as your tokenizer, the tokenizer settings that are used by default come from nlp.Defaults. Currently with spacy v3.2 the settings for lang = "ko" are the same as for lang = "xx" (whitespace + basic punctuation splitting). (Some minor improvements in the defaults that are customized for UD Korean Kaist will be coming in v3.3.)

If you're using spacy.ko.KoreanTokenizer, this is the mecab-ko tokenizer that doesn't use the defaults at all.

To get spacy.Tokenizer.v1 with the example above, you can use:

nlp = spacy.blank("ko", config={"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}})

(Sorry for the confusion, I just copied an existing English example.)

The ud_benchmark project uses yet another completely separate trainable tokenizer (spacy-experimental.char_pretokenizer.v1 as the tokenizer and then experimental_char_ner_tokenizer as a trainable retokenizing component). You can also train it for your data if you want, but it's kind of overkill for UD Korean Kaist-ish tokenization, which is relatively easy to handle with regexes, see #10322.

If you run spacy init config -l ko, you get the mecab tokenizer by default, which again probably isn't want you want. Currently the best way to modify the tokenizer is to just edit the config by hand.

kanayer Mar 21, 2022
Author

Thank you a lot for such a detailed answer!

I run the code below on the Korean word 확인하다.

import spacy
nlp = spacy.blank("ko", config={"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}})
suffixes = nlp.Defaults.suffixes + [r"."]
nlp.tokenizer.suffix_search = spacy.util.compile_suffix_regex(suffixes).search
print(nlp.tokenizer.explain("This is a test."))

As you can see below, the tokenizer identified the punctuation as a separate token. Which is what I am aiming for.

[('SUFFIX', '확'), ('SUFFIX', '인'), ('SUFFIX', '하'), ('SUFFIX', '다'), ('SUFFIX', '.')]

Currently, I am using experimental_char_ner_tokenizer, the way it was given in the project template. ud_benchmark has 3 config files: assemble.cfg, tokenizer.cfg and transformer.cfg. In order for me to switch to using spacy.Tokenizer.v1, is it better to edit the tokenizer.cfg and transformer.cfg or to delete tokenizer.cfg and only edit transformer.cfg?

adrianeboyd Mar 21, 2022

But every character is a suffix now (r".") so this isn't working like you intended. Don't you want 확인하다 .?

I think it should be fine to use the transformer.cfg after modifying the tokenizer and removing the experimental tokenizer pipeline component everywhere.

kanayer Mar 22, 2022
Author

Thank you so much, it worked! The accuracy for the parser now is UAS 91%, LAS 95%! And UPOS 97%! Thank you for your kind and detailed help!

Uh oh!

Sudden drop in the accuracy of the parser #10517

Uh oh!

Uh oh!

kanayer Mar 18, 2022

Replies: 1 comment · 11 replies

Uh oh!

adrianeboyd Mar 18, 2022

Uh oh!

Uh oh!

kanayer Mar 21, 2022 Author

Uh oh!

Uh oh!

adrianeboyd Mar 21, 2022

Uh oh!

Uh oh!

kanayer Mar 21, 2022 Author

Uh oh!

adrianeboyd Mar 21, 2022

Uh oh!

kanayer Mar 22, 2022 Author

kanayer
Mar 18, 2022

Replies: 1 comment 11 replies

adrianeboyd
Mar 18, 2022

kanayer Mar 21, 2022
Author

kanayer Mar 21, 2022
Author

kanayer Mar 22, 2022
Author