Compare two CoNLL-X files or directories, to obtain the tokenization F-score and POS tag accuracy, as well as the LAS, UAS, and label scores.
Since comparison usually occurs between gold and parsed files, the two files/directories will be differentiated using gold and parsed keywords. In other words, you do not need to have gold and parsed files to compare; any two will do.
The tree alignment part of the code uses ced_word_alignment.
Note: the evaluator is also CoNLL-U compatible.
- Two files or directories are passed to the evaluator. If two directories are passed, the directories must have matching file names.
- The files are read, and the trees every two files are compared.
- Align trees using ced_word_alignment
- involves inserting null alignment tokens
- The evaluation scores are then calulated
- tokenization f-score is calculated on all aligned tokens, while the remaining metrics are calulated after removing insertions (null alignment tokens added to the gold tree)
Since ced_word_alignment is used, the second and third assumptions are the same.
- No words are added to either the parsed or gold files.
- No changes to the word order.
- Text is in the same script and encoding.
align_trees.pyaligns trees using the ced_word_alignment algorithmclass_conllxused to read CoNLL-X filesclassesdataclasses used throughout the codeconllx_countsgets different statistics after comparing 2 CoNLL-X filesconllx_scorescalculates scores given countsevaluate_conllx_drivermain scripthandle_argssimplifies use of the argparse libraryrequirements.txtnecessary dependencies needed to run the scripts.- ced_word_alignment/ the ced alignment library
README.mdthis document.
- Python 3.8 and above.
To use, you need to first install the necessary dependencies by running the following command:
pip install -r requirements.txtusage: evaluate_conllx_driver.py [-h] [-g] [-p] [-gd] [-pd]
This script takes 2 CoNLL-X files or 2 directories of CoNLL-X files and evaluates the scores.
required arguments:
-g , --gold the gold CoNLL-X file
-p , --parsed the parsed CoNLL-X file
or:
-gd , --gold_dir the gold directory containing CoNLL-X files
-pd , --parsed_dir the parsed directory containing CoNLL-X files
The sentences used are taken from CamelTB_1001_introduction_1.conllx and CamelTB_1001_night_1_1.conllx (data can be obtained from The Camel Treebank.
The toknization is the same, and so the F_score is 100%, and the insertion/deletion counts are both 0.
python src/main.py -g data/samples_gold/sample_1.conllx -p data/samples_parsed/sample_1.conllx
| tokenization_f_score | 100.0 |
| tokenization_precision | 100.0 |
| tokenization_recall | 100.0 |
| word_accuracy | 100.0 |
| pos | 81.579 |
| uas | 55.263 |
| label | 65.789 |
| las | 44.737 |
| pp_uas_score | 0 |
| pp_label_score | 0 |
| pp_las_score | 0 |
python src/main.py -g data/samples_gold/sample_2.conllx -p data/samples_parsed/sample_2.conllx
| tokenization_f_score | 90.385 |
| tokenization_precision | 90.385 |
| tokenization_recall | 90.385 |
| word_accuracy | 97.222 |
| pos | 86.538 |
| uas | 65.385 |
| label | 75.0 |
| las | 57.692 |
| pp_uas_score | 0.0 |
| pp_label_score | 0.0 |
| pp_las_score | 0.0 |
Using the arguments x (punctuation), n (number), and a (alef, yeh, and ta marbuta), the evaluation will ignore differences in tokenization. When using the arguments, the following comparisons will be equal:
1 and ١ , and ، ي and ى
python src/main.py -g data/samples_gold/sample_4_norm.conllx -p data/samples_parsed/sample_4_norm.conllx
| tokenization_f_score | 80.0 |
| tokenization_precision | 80.0 |
| tokenization_recall | 80.0 |
| word_accuracy | 75.0 |
| pos | 80.0 |
| uas | 80.0 |
| label | 80.0 |
| las | 80.0 |
| pp_uas_score | 50.0 |
| pp_label_score | 50.0 |
| pp_las_score | 50.0 |
python src/main.py -g data/samples_gold/sample_4_norm.conllx -p data/samples_parsed/sample_4_norm.conllx -xn
| tokenization_f_score | 100.0 |
| tokenization_precision | 100.0 |
| tokenization_recall | 100.0 |
| word_accuracy | 100.0 |
| pos | 100.0 |
| uas | 100.0 |
| label | 100.0 |
| las | 100.0 |
| pp_uas_score | 100.0 |
| pp_label_score | 100.0 |
| pp_las_score | 100.0 |
python src/main.py --gold_dir=data/samples_gold --parsed_dir=data/samples_parsed
| tokenization_f_score | tokenization_precision | tokenization_recall | word_accuracy | pos | uas | label | las | pp_uas_score | pp_label_score |
| sample_4_norm | 80.0 | 80.0 | 80.0 | 75.0 | 80.0 | 80.0 | 80.0 | 80.0 | 50.0 |
| sample_2 | 90.385 | 90.385 | 90.385 | 97.222 | 86.538 | 65.385 | 75.0 | 57.692 | 0.0 |
| sample_1 | 100.0 | 100.0 | 100.0 | 100.0 | 81.579 | 55.263 | 65.789 | 44.737 | 0.0 |
| sample_3 | 80.0 | 80.0 | 80.0 | 75.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
conllx_evaluator is available under the MIT license. See the LICENSE file for more info.