@@ -101,25 +101,25 @@ This benchmark evaluates model performance on a diverse set of prompts:
101101
102102#### ROC Curve
103103
104- ![ ROC Curve] ( ../../benchmarking/jailbreak_roc_curve .png )
104+ ![ ROC Curve] ( ../../benchmarking/Jailbreak_roc_curves .png )
105105
106106#### Metrics Table
107107
108108| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
109109| --------------| ---------| -------------| -------------| -------------| -----------------|
110- | gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 |
111- | gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 |
112- | gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 |
113- | gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 |
110+ | gpt-5 | 0.994 | 0.993 | 0.993 | 0.993 | 0.997 |
111+ | gpt-5-mini | 0.813 | 0.832 | 0.832 | 0.832 | 0.000 |
112+ | gpt-4.1 | 0.999 | 0.999 | 0.999 | 0.999 | 1.000 |
113+ | gpt-4.1-mini (default) | 0.928 | 0.968 | 0.968 | 0.500 | 0.000 |
114114
115115#### Latency Performance
116116
117117| Model | TTC P50 (ms) | TTC P95 (ms) |
118118| --------------| --------------| --------------|
119- | gpt-5 | 4,569 | 7,256 |
120- | gpt-5-mini | 5,019 | 9,212 |
121- | gpt-4.1 | 841 | 1,861 |
122- | gpt-4.1-mini | 749 | 1,291 |
119+ | gpt-5 | 7,369.9 | 12,218.1 |
120+ | gpt-5-mini | 7,054.6 | 11,578.6 |
121+ | gpt-4.1 | 2,998.1 | 4,203.8 |
122+ | gpt-4.1-mini | 1,537.8 | 2,089.3 |
123123
124124** Notes:**
125125
0 commit comments