This repo cotains the code for Evaluating the Evaluators: Are readability metrics good measures of readability?
If you use this repo, please cite the following paper:
<INSERT BIBTEX>
$ conda create python==3.9.16 --name eval-readability
$ conda activate eval-readability
$ python setup.py clean install
We use the following summarization datasets:
Our human annotated readability data is from August et al 2024.
Data format: The code expects the data in a text file, with each new line containing a summary.
For the HuggingFace datasets, use the following command to load and format the data:
$ python scripts/format_data.py \
--dataset_name <HF_DATASET_NAME> \
--subset <DATA_SUBSET> \
--split <DATA_SPLIT> \
--summary_col <COL_NAME_WITH_SUMMARIES> \
--outfile </PATH/TO/SAVE/FORMATTED/DATA>
** This is the official, released version of this dataset. We found multiple grammatical errors and re-collected the dataset for this paper. We are currently working with the original authors of the SJK paper to re-release the cleaned data.
We use the following language models:
- Mistral 7B Instruct
- Mixtral-8x7B Instruct
- gemma 1.1 7b Instruct
- Llama 3.1 8B Instruct
- Llama 3.3 70B Instruct
The following command prompts the model to rate the readability of the summaries in the input file:
$ python main.py \
<MODEL> \
</PATH/TO/INPUT_FILE> \
</PATH/TO/OUTPUT_FILE> \
-bsz <BATCH_SIZE>
The following command runs a script to extract the model ratings:
$ python scripts/get_rating.py </PATH/TO/OUTPUT_FILE>
We release the results of our literature survey (Sec 4.1) here.
Jupyter notebooks with the analysis code can be found in analysis/
.
analysis/model_analysis.ipynb
contains the code for comparing the human judgements to traditional metrics and LM readability judgements. (Sec 4.2-4.3)analysis/dataset_analysis.ipynb
contains the code for the LM based evaluation for readability datasets. (Sec 4.4)