Skip to content

Commit f75a8c1

Browse files
CopilotXunzhuo
andcommitted
Update documentation for category-level jailbreak detection
- Add category-level jailbreak configuration to jailbreak-protection.md - Update category configuration docs with jailbreak_enabled parameter - Add security-focused configuration example - Update global configuration docs with category override notes - Update README to mention fine-grained security control Co-authored-by: Xunzhuo <[email protected]>
1 parent 458d7e7 commit f75a8c1

File tree

4 files changed

+148
-5
lines changed

4 files changed

+148
-5
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Detect PII in the prompt, avoiding sending PII to the LLM so as to protect the p
7676

7777
#### Prompt guard
7878

79-
Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving.
79+
Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving. Can be configured globally or at the category level for fine-grained security control.
8080

8181
### Similarity Caching ⚡️
8282

website/docs/installation/configuration.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ tools:
3838

3939
# Jailbreak protection
4040
prompt_guard:
41-
enabled: false
41+
enabled: false # Global default - can be overridden per category
4242
use_modernbert: true
4343
model_id: "models/jailbreak_classifier_modernbert-base_model"
4444
threshold: 0.7
@@ -84,6 +84,8 @@ categories:
8484
# Optional: Category-level cache settings
8585
# semantic_cache_enabled: true
8686
# semantic_cache_similarity_threshold: 0.9 # Higher threshold for math
87+
# Optional: Category-level jailbreak settings
88+
# jailbreak_enabled: true # Override global jailbreak detection
8789
- name: computer science
8890
model_scores:
8991
- model: your-model

website/docs/overview/categories/configuration.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,34 @@ curl -X PUT http://localhost:8080/config/system-prompts \
8383

8484
### Reasoning Configuration
8585

86+
#### `jailbreak_enabled` (Optional)
87+
88+
- **Type**: Boolean
89+
- **Description**: Whether to enable jailbreak detection for this category
90+
- **Default**: Inherits from global `prompt_guard.enabled` setting
91+
- **Impact**: Enables or disables jailbreak protection for this specific category
92+
93+
```yaml
94+
categories:
95+
- name: customer_support
96+
jailbreak_enabled: true # Explicitly enable for public-facing
97+
model_scores:
98+
- model: qwen3
99+
score: 0.8
100+
101+
- name: code_generation
102+
jailbreak_enabled: false # Disable for internal tools
103+
model_scores:
104+
- model: qwen3
105+
score: 0.9
106+
107+
- name: general
108+
# No jailbreak_enabled - inherits from global prompt_guard.enabled
109+
model_scores:
110+
- model: qwen3
111+
score: 0.5
112+
```
113+
86114
#### `use_reasoning` (Required)
87115

88116
- **Type**: Boolean
@@ -196,7 +224,46 @@ categories:
196224
score: 0.2
197225
```
198226

199-
### Example 3: Multi-Category Configuration
227+
### Example 3: Security-Focused Configuration (Jailbreak Protection)
228+
229+
```yaml
230+
categories:
231+
# High-security public-facing category
232+
- name: "customer_support"
233+
description: "Customer support and general inquiries"
234+
jailbreak_enabled: true # Strict jailbreak protection
235+
use_reasoning: false
236+
model_scores:
237+
- model: "phi4"
238+
score: 0.9
239+
- model: "mistral-small3.1"
240+
score: 0.7
241+
242+
# Trusted internal development category
243+
- name: "code_generation"
244+
description: "Internal code generation for developers"
245+
jailbreak_enabled: false # Allow broader input for trusted users
246+
use_reasoning: true
247+
reasoning_effort: "medium"
248+
model_scores:
249+
- model: "gemma3:27b"
250+
score: 0.9
251+
- model: "phi4"
252+
score: 0.7
253+
254+
# General category using global default
255+
- name: "general"
256+
description: "General queries"
257+
# jailbreak_enabled not specified - inherits from global prompt_guard.enabled
258+
use_reasoning: false
259+
model_scores:
260+
- model: "phi4"
261+
score: 0.6
262+
- model: "mistral-small3.1"
263+
score: 0.6
264+
```
265+
266+
### Example 4: Multi-Category Configuration
200267

201268
```yaml
202269
categories:

website/docs/tutorials/content-safety/jailbreak-protection.md

Lines changed: 76 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,59 @@ Enable jailbreak detection in your configuration:
4343
```yaml
4444
# config/config.yaml
4545
prompt_guard:
46-
enabled: true
46+
enabled: true # Global default - can be overridden per category
4747
model_id: "models/jailbreak_classifier_modernbert-base_model"
4848
threshold: 0.7 # Detection sensitivity (0.0-1.0)
4949
use_cpu: true # Run on CPU
5050
use_modernbert: true # Use ModernBERT architecture
5151
jailbreak_mapping_path: "config/jailbreak_type_mapping.json" # Path to jailbreak type mapping
5252
```
5353
54+
### Category-Level Jailbreak Protection
55+
56+
You can enable or disable jailbreak detection at the category level for fine-grained security control:
57+
58+
```yaml
59+
# Global default setting
60+
prompt_guard:
61+
enabled: true # Default for all categories
62+
63+
categories:
64+
# High-security category - explicitly enable
65+
- name: customer_support
66+
jailbreak_enabled: true # Strict protection for public-facing
67+
model_scores:
68+
- model: qwen3
69+
score: 0.8
70+
71+
# Internal tool - disable for trusted environment
72+
- name: code_generation
73+
jailbreak_enabled: false # Allow broader input for developers
74+
model_scores:
75+
- model: qwen3
76+
score: 0.9
77+
78+
# General category - inherits global setting
79+
- name: general
80+
# No jailbreak_enabled specified - uses global prompt_guard.enabled
81+
model_scores:
82+
- model: qwen3
83+
score: 0.5
84+
```
85+
86+
**Category-Level Behavior**:
87+
88+
- **When `jailbreak_enabled` is not specified**: Category inherits from global `prompt_guard.enabled`
89+
- **When `jailbreak_enabled: true`**: Jailbreak detection is explicitly enabled for this category
90+
- **When `jailbreak_enabled: false`**: Jailbreak detection is explicitly disabled for this category
91+
- **Category-specific setting always overrides global setting** when explicitly configured
92+
93+
**Use Cases**:
94+
95+
- **Enable for public-facing categories**: Customer support, business advice
96+
- **Disable for internal tools**: Code generation for developers, testing environments
97+
- **Inherit for general categories**: Use global default for most categories
98+
5499
## How Jailbreak Protection Works
55100

56101
The jailbreak protection system works as follows:
@@ -134,9 +179,38 @@ security_policy_violations_total 45
134179
### 4. Integration with Routing
135180
136181
- Apply stricter protection to sensitive models
137-
- Use different thresholds for different categories
182+
- Use category-level jailbreak settings for different domains
138183
- Combine with PII detection for comprehensive security
139184
185+
**Example**: Configure different jailbreak policies per category:
186+
187+
```yaml
188+
prompt_guard:
189+
enabled: true # Global default
190+
191+
categories:
192+
# Strict protection for customer-facing categories
193+
- name: customer_support
194+
jailbreak_enabled: true
195+
model_scores:
196+
- model: safe-model
197+
score: 0.9
198+
199+
# Relaxed protection for internal development
200+
- name: code_generation
201+
jailbreak_enabled: false # Allow broader input
202+
model_scores:
203+
- model: code-model
204+
score: 0.9
205+
206+
# Use global default for general queries
207+
- name: general
208+
# Inherits from prompt_guard.enabled
209+
model_scores:
210+
- model: general-model
211+
score: 0.7
212+
```
213+
140214
## Troubleshooting
141215

142216
### High False Positives

0 commit comments

Comments
 (0)