confidential_llms

Repository for the paper "Evaluating Language Model Reasoning about Confidential Information" and the PasswordEval benchmark. You can also find our dataset on HuggingFace.

Data

Our dataset is contained in 'data/generated_passwords.json'. The multi-turn version of our dataset is contained in 'data/generated_multi_passwords.json'. To re-generate examples (or to generate more examples) for both of these dataset, you can run:

src/generate_password.py
src/generate_password_multiple.py

Evaluating Models

To evaluate models on this task, please run

python -m src.grade_passwords --model gpt-4o-mini \
    --output_path ./eval_out/gpt-4o-mini.jsonl --context_examples

python -m src.grade_passwords --model gpt-4o-mini \
    --jailbreak --output_path ./eval_out/gpt-4o-mini_jailbreak.jsonl --context_examples

Note the '--jailbreak' flag uses a fixed adversarial template in the user request. The '--context_examples' flag adds in in-context examples, which is the primary evaluation setting for this benchmark. To evaluate will adaptive jailbreaks (see the next section), you should use the 'src/grade_passwords_gcg.py' and 'src/grade_passwords_pair.py' files.

If you are evaluating API-based models (i.e., GPT or o-series models, Gemini models), please make sure to set your API keys in 'src/utils.py'.

Jailbreaks

To run more advanced jailbreaking strategies, you can use the files in nanoGCG and HarmBench (for PAIR). GCG can be run with the command:

python gcg.py \
    --start_index 0 \
    --end_index 1 \
    --model_id    meta-llama/Llama-3.1-8B-Instruct \
    --model_name  Llama-3.1-8B-Instruct \
    --num_steps   500 \
    --context_examples

Citation

If you found our paper or repository helpful, please cite our work as:

@misc{sam2025evaluating,
      title={Evaluating Language Model Reasoning about Confidential Information}, 
      author={Dylan Sam and Alexander Robey and Andy Zou and Matt Fredrikson and J. Zico Kolter},
      year={2025},
      eprint={2508.19980},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.19980}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

confidential_llms

Data

Evaluating Models

Jailbreaks

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
HarmBench		HarmBench
data		data
nanoGCG		nanoGCG
src		src
.gitignore		.gitignore
README.md		README.md
gcg.py		gcg.py

locuslab/confidential_llms

Folders and files

Latest commit

History

Repository files navigation

confidential_llms

Data

Evaluating Models

Jailbreaks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages