Repository for the paper "Evaluating Language Model Reasoning about Confidential Information" and the PasswordEval benchmark. You can also find our dataset on HuggingFace.
Our dataset is contained in 'data/generated_passwords.json'. The multi-turn version of our dataset is contained in 'data/generated_multi_passwords.json'. To re-generate examples (or to generate more examples) for both of these dataset, you can run:
src/generate_password.py
src/generate_password_multiple.py
To evaluate models on this task, please run
python -m src.grade_passwords --model gpt-4o-mini \
--output_path ./eval_out/gpt-4o-mini.jsonl --context_examples
python -m src.grade_passwords --model gpt-4o-mini \
--jailbreak --output_path ./eval_out/gpt-4o-mini_jailbreak.jsonl --context_examples
Note the '--jailbreak' flag uses a fixed adversarial template in the user request. The '--context_examples' flag adds in in-context examples, which is the primary evaluation setting for this benchmark. To evaluate will adaptive jailbreaks (see the next section), you should use the 'src/grade_passwords_gcg.py' and 'src/grade_passwords_pair.py' files.
If you are evaluating API-based models (i.e., GPT or o-series models, Gemini models), please make sure to set your API keys in 'src/utils.py'.
To run more advanced jailbreaking strategies, you can use the files in nanoGCG and HarmBench (for PAIR). GCG can be run with the command:
python gcg.py \
--start_index 0 \
--end_index 1 \
--model_id meta-llama/Llama-3.1-8B-Instruct \
--model_name Llama-3.1-8B-Instruct \
--num_steps 500 \
--context_examples
If you found our paper or repository helpful, please cite our work as:
@misc{sam2025evaluating,
title={Evaluating Language Model Reasoning about Confidential Information},
author={Dylan Sam and Alexander Robey and Andy Zou and Matt Fredrikson and J. Zico Kolter},
year={2025},
eprint={2508.19980},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.19980},
}