This repository contains code for Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation.
The nlpstats library is essential for this project, and it needs to be installed locally. Some parts of the code have been modified (e.g., the addition of item-level correlation). Other necessary libraries include pandas, numpy, and others.
To set up the environment and install dependencies:
cd nlpstats
pip install --editable .
pip install numpy
pip install pandas- Download the dataset from this link, and place the
data.csvfile under theDP_RCdirectory. - For sensitivity to score granularity, download from this link, and place the
data_all_rescaled.jsonfile under thescore_granularitydirectory.
To calculate ranking consistency, you can use the following command. This will save the individual group results:
cd DP_RC
python ranking_consistency.py --input-file data.csv --output-file results.json --world-size 32 --number-trials 1000 --save-group-resultsTo calculate the discriminative power using permutation tests, use the following command. It will also save the individual group results:
cd DP_RC
python discriminative_power_permutaion_test.py --input-file data.csv --output-file results.json --world-size 32 --number-trials 1000 --save-group-resultsTo measure the sensitivity to score granularity using the scores of GPT-3.5/4/4o, use this command. You can adjust the number of workers with the --num_workers argument (default is 4):
cd score_granularity
python re_sampling.py --data_type summarization --model GPT-3.5 --input_file data_all_rescaled.json --output_file gpt3.5_summ_result.csv --num_workers 4