Official code for the paper "CAPE: Context-Aware Personality Evaluation Framework for Large Language Models" accepted in EMNLP25 (Findings). If you use this code please cite our paper.
Some of the important software dependecies are as follows:
- Python 3.9.21
- anthropic 0.49.0
- google-generativeai 0.8.3
- openai 1.60.2
- scikit-learn 1.6.1
We assume that you have installed conda beforehand. Use the following commands to replicate our environments using openai.yml. You can find the details of dependecies in these files. Make sure you export your openai key to the environment using export OPENAI_API_KEY=YOUR_KEY.
conda env create -f openai.yml
conda activate openai
cd codes
Execute all the codes from ./codes path.
Using the followng script, you can apply CAPE on any llm using guidelines. Make sure you have your API key for the respective llm provider.
python run_CAPE.py --model_name gpt-3.5-turbo --Experiment_name _stability_ --run_id 1 --test_temperature 0.0 --Activate_full_context true --shuffle_questions 1 --instruction_setting 1 --option_ordering_setting 1 --option_wording_setting 1 --response_sensitivity_setting 1 --item_paraphrasing_setting 1
Using the following flags, you can generate results for all the combinations reported in the Table 1. To activate our CAPE framework, use --Activate_full_context true. To get more idea about other flags, refer to code or Appendix D in the paper.
model_name : gpt-3.5-turbo gpt-4-turbo gemini-1.5-flash claude-3-5-haiku-20241022 llama-3.1-8b-instant meta-llama/Llama-3.3-70B-Instruct-Turbo-Free meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Experiment_name : _stability_ _temperature_ _option_wording_ _option_ordering_ _item_paraphrase_
run_id : 1 2 3
test_temperature : 0.5 1.0 1.5
Activate_full_context : true false
shuffle_questions : 1 2 3
instruction_setting : 1 2 3
option_ordering_setting : 1 2 3
option_wording_setting : 1 2 3
item_paraphrasing_setting : 1 2 3
We have stored all assessment reports in Result section. The following script uses these stored reports to evaluate the results. Please note that if you choose to regenerate the reports using run_CAPE.py you may not get the exact numbers reported in the Table 1 due to various factors such as stochastic nature of LLMs, exact version of the llm etc. However, you will find the similar trend of results reported in Table 1.
bash CAPE_results.sh
Refer to CAPE_Analysis_Results.ipynb file.
You need to create a new conda enviroment using chatharushi.yml file. We have used the Chatharushi library in our implementation. Activate this new environment to assess all 32 characters' persona with/without our framework. Each RPA is ran 3 times to evaluate consistency.
conda env create -f chatharushi.yml
conda activate chatharushi
bash RPA_tests.sh
The personality assessment tests of the RPAs are stored in the Results/Character-BFI folder. The following script produces the overall scores reported in the Table 2 for the respective GPT version. Please note that evaluation script takes little time due to GPR taking time to learn the function.
python RPA_Results.py --With_CAPE 0 --GPT "3.5"
@misc{sandhan2025capecontextawarepersonalityevaluation,
title={CAPE: Context-Aware Personality Evaluation Framework for Large Language Models},
author={Jivnesh Sandhan and Fei Cheng and Tushar Sandhan and Yugo Murawaki},
year={2025},
eprint={2508.20385},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.20385},
}
This project is licensed under the terms of the Apache license 2.0.
This work was supported by the “R&D Hub Aimed at Ensuring Transparency and Reliability of Generative AI Models” project of the Ministry of Education, Culture, Sports, Science and Technology.