Skip to content

Commit 4270243

Browse files
committed
Updating data description
1 parent 4584be7 commit 4270243

File tree

1 file changed

+8
-5
lines changed

1 file changed

+8
-5
lines changed

docs/ref/checks/jailbreak.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,13 +89,16 @@ When conversation history is available, the guardrail automatically:
8989

9090
### Dataset Description
9191

92-
This benchmark evaluates model performance on a diverse set of prompts:
92+
This benchmark combines multiple public datasets and synthetic benign conversations:
9393

94-
- **Subset of the open source jailbreak dataset [JailbreakV-28k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)** (n=2,000)
95-
- **Synthetic prompts** covering a diverse range of benign topics (n=1,000)
96-
- **Open source [Toxicity](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) dataset** containing harmful content that does not involve jailbreak attempts (n=1,000)
94+
- **Red Queen jailbreak corpus ([GitHub](https://github.com/kriti-hippo/red_queen/blob/main/Data/Red_Queen_Attack.zip))**: 14,000 positive samples collected with gpt-4o attacks.
95+
- **Tom Gibbs multi-turn jailbreak attacks ([Hugging Face](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets/tree/main))**: 4,136 positive samples.
96+
- **Scale MHJ dataset ([Hugging Face](https://huggingface.co/datasets/ScaleAI/mhj))**: 537 positive samples.
97+
- **Synthetic benign conversations**: 12,433 negative samples generated by seeding prompts from [WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix?utm_source=chatgpt.com) where `adversarial=false` and `prompt_harm_label=false`, then expanding each single-turn input into five-turn dialogues using gpt-4.1.
9798

98-
**Total n = 4,000; positive class prevalence = 2,000 (50.0%)**
99+
**Total n = 31,106; positives = 18,673; negatives = 12,433**
100+
101+
For benchmarking, we randomly sampled 4,000 conversations from this pool while preserving the overall class balance.
99102

100103
### Results
101104

0 commit comments

Comments
 (0)