Skip to content

Repository for the paper "Evaluating Language Model Reasoning about Confidential Information" and the PasswordEval benchmark.

Notifications You must be signed in to change notification settings

locuslab/confidential_llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

confidential_llms

Repository for the paper "Evaluating Language Model Reasoning about Confidential Information" and the PasswordEval benchmark. You can also find our dataset on HuggingFace.

Data

Our dataset is contained in 'data/generated_passwords.json'. The multi-turn version of our dataset is contained in 'data/generated_multi_passwords.json'. To re-generate examples (or to generate more examples) for both of these dataset, you can run:

src/generate_password.py
src/generate_password_multiple.py

Evaluating Models

To evaluate models on this task, please run

python -m src.grade_passwords --model gpt-4o-mini \
    --output_path ./eval_out/gpt-4o-mini.jsonl --context_examples

python -m src.grade_passwords --model gpt-4o-mini \
    --jailbreak --output_path ./eval_out/gpt-4o-mini_jailbreak.jsonl --context_examples

Note the '--jailbreak' flag uses a fixed adversarial template in the user request. The '--context_examples' flag adds in in-context examples, which is the primary evaluation setting for this benchmark. To evaluate will adaptive jailbreaks (see the next section), you should use the 'src/grade_passwords_gcg.py' and 'src/grade_passwords_pair.py' files.

If you are evaluating API-based models (i.e., GPT or o-series models, Gemini models), please make sure to set your API keys in 'src/utils.py'.

Jailbreaks

To run more advanced jailbreaking strategies, you can use the files in nanoGCG and HarmBench (for PAIR). GCG can be run with the command:

python gcg.py \
    --start_index 0 \
    --end_index 1 \
    --model_id    meta-llama/Llama-3.1-8B-Instruct \
    --model_name  Llama-3.1-8B-Instruct \
    --num_steps   500 \
    --context_examples

Citation

If you found our paper or repository helpful, please cite our work as:

@misc{sam2025evaluating,
      title={Evaluating Language Model Reasoning about Confidential Information}, 
      author={Dylan Sam and Alexander Robey and Andy Zou and Matt Fredrikson and J. Zico Kolter},
      year={2025},
      eprint={2508.19980},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.19980}, 
}

About

Repository for the paper "Evaluating Language Model Reasoning about Confidential Information" and the PasswordEval benchmark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages