Skip to content

Commit ea50e6c

Browse files
CopilotXunzhuo
andcommitted
Add category-level jailbreak threshold configuration
- Add JailbreakThreshold field to Category struct - Add GetJailbreakThresholdForCategory helper method - Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods - Update performSecurityChecks to use category-specific threshold - Add 5 comprehensive tests for threshold configuration - Update example configs with threshold tuning examples - Update documentation with threshold configuration and tuning guidelines - Add threshold tuning guide with recommendations for different category types Co-authored-by: Xunzhuo <[email protected]>
1 parent f75a8c1 commit ea50e6c

File tree

8 files changed

+219
-37
lines changed

8 files changed

+219
-37
lines changed

config/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ categories:
6363
- name: business
6464
system_prompt: "You are a senior business consultant and strategic advisor with expertise in corporate strategy, operations management, financial analysis, marketing, and organizational development. Provide practical, actionable business advice backed by proven methodologies and industry best practices. Consider market dynamics, competitive landscape, and stakeholder interests in your recommendations."
6565
# jailbreak_enabled: true # Optional: Override global jailbreak detection per category
66+
# jailbreak_threshold: 0.8 # Optional: Override global jailbreak threshold per category
6667
model_scores:
6768
- model: qwen3
6869
score: 0.7

config/examples/jailbreak_category_example.yaml

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Category-Level Jailbreak Detection Example
22
# This example demonstrates how to configure jailbreak detection at the category level
3-
# Different categories can have different jailbreak detection settings based on their risk profiles
3+
# Different categories can have different jailbreak detection settings and thresholds based on their risk profiles
44

55
# Global jailbreak detection configuration (can be overridden per category)
66
prompt_guard:
77
enabled: true # Global default - can be overridden per category
88
use_modernbert: true
99
model_id: "models/jailbreak_classifier_modernbert-base_model"
10-
threshold: 0.7
10+
threshold: 0.7 # Global default threshold - can be overridden per category
1111
use_cpu: true
1212
jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json"
1313

@@ -22,30 +22,33 @@ classifier:
2222

2323
# Categories with different jailbreak detection settings
2424
categories:
25-
# High-security category: Enable jailbreak detection
25+
# High-security category: Strict jailbreak detection with high threshold
2626
- name: business
2727
description: "Business queries, strategy, and professional advice"
2828
jailbreak_enabled: true # Explicitly enable (inherits from global by default)
29+
jailbreak_threshold: 0.9 # Higher threshold for stricter detection
2930
system_prompt: "You are a professional business consultant. Provide practical, actionable business advice."
3031
model_scores:
3132
- model: qwen3
3233
score: 0.7
3334
use_reasoning: false
3435

35-
# Public-facing category: Enable jailbreak detection
36+
# Public-facing category: Enable with standard threshold
3637
- name: customer_support
3738
description: "Customer support and general inquiries"
3839
jailbreak_enabled: true # Explicitly enable for customer-facing content
40+
jailbreak_threshold: 0.8 # Slightly higher than global for public-facing
3941
system_prompt: "You are a friendly customer support agent. Help users with their questions."
4042
model_scores:
4143
- model: qwen3
4244
score: 0.8
4345
use_reasoning: false
4446

45-
# Internal tool category: Disable jailbreak detection (trusted environment)
47+
# Internal tool category: Relaxed threshold (trusted environment)
4648
- name: code_generation
4749
description: "Internal code generation and development tools"
48-
jailbreak_enabled: false # Disable for internal developer tools
50+
jailbreak_enabled: true # Keep enabled but with relaxed threshold
51+
jailbreak_threshold: 0.5 # Lower threshold to reduce false positives for code
4952
system_prompt: "You are a code generation assistant for internal developers."
5053
model_scores:
5154
- model: qwen3
@@ -62,10 +65,11 @@ categories:
6265
score: 0.6
6366
use_reasoning: false
6467

65-
# Default category: Uses global setting (inherits prompt_guard.enabled)
68+
# Default category: Uses global setting (inherits prompt_guard.enabled and threshold)
6669
- name: general
6770
description: "General queries that don't fit into specific categories"
6871
# jailbreak_enabled not specified - will inherit from global prompt_guard.enabled
72+
# jailbreak_threshold not specified - will inherit from global prompt_guard.threshold (0.7)
6973
system_prompt: "You are a helpful assistant."
7074
model_scores:
7175
- model: qwen3
@@ -98,14 +102,25 @@ vllm_endpoints:
98102

99103
# Usage Notes:
100104
# =============
101-
# 1. Global Setting (prompt_guard.enabled): Sets the default for all categories
102-
# 2. Category Override (jailbreak_enabled): Override global setting per category
103-
# 3. Inheritance: If jailbreak_enabled is not specified, inherits from prompt_guard.enabled
104-
# 4. Use Cases:
105-
# - Set jailbreak_enabled: true for high-security, public-facing categories
106-
# - Set jailbreak_enabled: false for internal tools or trusted environments
107-
# - Omit jailbreak_enabled to use the global default
108-
# 5. Security Best Practices:
105+
# 1. Global Settings:
106+
# - prompt_guard.enabled: Sets the default enabled/disabled for all categories
107+
# - prompt_guard.threshold: Sets the default detection threshold (0.0-1.0) for all categories
108+
# 2. Category Overrides:
109+
# - jailbreak_enabled: Override global enabled/disabled setting per category
110+
# - jailbreak_threshold: Override global threshold per category
111+
# 3. Inheritance:
112+
# - If jailbreak_enabled is not specified, inherits from prompt_guard.enabled
113+
# - If jailbreak_threshold is not specified, inherits from prompt_guard.threshold
114+
# 4. Threshold Tuning:
115+
# - Higher threshold (0.8-0.95): Stricter detection, fewer false positives, may miss subtle attacks
116+
# - Lower threshold (0.5-0.7): More sensitive detection, catches more attacks, higher false positive rate
117+
# - Recommended: Start with 0.7 globally, adjust per category based on risk profile
118+
# 5. Use Cases:
119+
# - High-security categories (business, customer_support): Use higher thresholds (0.8-0.9)
120+
# - Internal tools with code/technical content: Use lower thresholds (0.5-0.6) to reduce false positives
121+
# - General categories: Use global default threshold
122+
# 6. Security Best Practices:
109123
# - Enable jailbreak detection by default (prompt_guard.enabled: true)
110-
# - Only disable for specific categories where the risk is managed differently
111-
# - Consider the consequences of disabling protection on a per-category basis
124+
# - Only disable or use very low thresholds for specific categories where the risk is managed differently
125+
# - Consider the consequences of threshold settings on a per-category basis
126+
# - Monitor false positive and false negative rates to tune thresholds appropriately

src/semantic-router/pkg/config/config.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,9 @@ type Category struct {
373373
// JailbreakEnabled controls whether jailbreak detection is enabled for this category
374374
// If nil, inherits from global PromptGuard.Enabled setting
375375
JailbreakEnabled *bool `yaml:"jailbreak_enabled,omitempty"`
376+
// JailbreakThreshold defines the confidence threshold for jailbreak detection (0.0-1.0)
377+
// If nil, uses the global threshold from PromptGuard.Threshold
378+
JailbreakThreshold *float32 `yaml:"jailbreak_threshold,omitempty"`
376379
}
377380

378381
// GetModelReasoningFamily returns the reasoning family configuration for a given model name
@@ -829,3 +832,14 @@ func (c *RouterConfig) IsJailbreakEnabledForCategory(categoryName string) bool {
829832
// Fall back to global setting
830833
return c.PromptGuard.Enabled
831834
}
835+
836+
// GetJailbreakThresholdForCategory returns the effective jailbreak detection threshold for a category
837+
// Priority: category-specific > global prompt_guard threshold
838+
func (c *RouterConfig) GetJailbreakThresholdForCategory(categoryName string) float32 {
839+
category := c.GetCategoryByName(categoryName)
840+
if category != nil && category.JailbreakThreshold != nil {
841+
return *category.JailbreakThreshold
842+
}
843+
// Fall back to global threshold
844+
return c.PromptGuard.Threshold
845+
}

src/semantic-router/pkg/config/config_test.go

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2029,4 +2029,88 @@ categories:
20292029
})
20302030
})
20312031
})
2032+
2033+
Describe("GetJailbreakThresholdForCategory", func() {
2034+
Context("when global threshold is set", func() {
2035+
It("should return global threshold for category without explicit setting", func() {
2036+
category := config.Category{
2037+
Name: "test",
2038+
ModelScores: []config.ModelScore{{Model: "test", Score: 1.0}},
2039+
}
2040+
2041+
cfg := &config.RouterConfig{
2042+
PromptGuard: config.PromptGuardConfig{
2043+
Threshold: 0.7,
2044+
},
2045+
Categories: []config.Category{category},
2046+
}
2047+
2048+
Expect(cfg.GetJailbreakThresholdForCategory("test")).To(Equal(float32(0.7)))
2049+
})
2050+
2051+
It("should return category-specific threshold when set", func() {
2052+
category := config.Category{
2053+
Name: "test",
2054+
JailbreakThreshold: config.Float32Ptr(0.9),
2055+
ModelScores: []config.ModelScore{{Model: "test", Score: 1.0}},
2056+
}
2057+
2058+
cfg := &config.RouterConfig{
2059+
PromptGuard: config.PromptGuardConfig{
2060+
Threshold: 0.7,
2061+
},
2062+
Categories: []config.Category{category},
2063+
}
2064+
2065+
Expect(cfg.GetJailbreakThresholdForCategory("test")).To(Equal(float32(0.9)))
2066+
})
2067+
2068+
It("should allow lower threshold override", func() {
2069+
category := config.Category{
2070+
Name: "test",
2071+
JailbreakThreshold: config.Float32Ptr(0.5),
2072+
ModelScores: []config.ModelScore{{Model: "test", Score: 1.0}},
2073+
}
2074+
2075+
cfg := &config.RouterConfig{
2076+
PromptGuard: config.PromptGuardConfig{
2077+
Threshold: 0.7,
2078+
},
2079+
Categories: []config.Category{category},
2080+
}
2081+
2082+
Expect(cfg.GetJailbreakThresholdForCategory("test")).To(Equal(float32(0.5)))
2083+
})
2084+
2085+
It("should allow higher threshold override", func() {
2086+
category := config.Category{
2087+
Name: "test",
2088+
JailbreakThreshold: config.Float32Ptr(0.95),
2089+
ModelScores: []config.ModelScore{{Model: "test", Score: 1.0}},
2090+
}
2091+
2092+
cfg := &config.RouterConfig{
2093+
PromptGuard: config.PromptGuardConfig{
2094+
Threshold: 0.7,
2095+
},
2096+
Categories: []config.Category{category},
2097+
}
2098+
2099+
Expect(cfg.GetJailbreakThresholdForCategory("test")).To(Equal(float32(0.95)))
2100+
})
2101+
})
2102+
2103+
Context("when category does not exist", func() {
2104+
It("should fall back to global threshold", func() {
2105+
cfg := &config.RouterConfig{
2106+
PromptGuard: config.PromptGuardConfig{
2107+
Threshold: 0.8,
2108+
},
2109+
Categories: []config.Category{},
2110+
}
2111+
2112+
Expect(cfg.GetJailbreakThresholdForCategory("nonexistent")).To(Equal(float32(0.8)))
2113+
})
2114+
})
2115+
})
20322116
})

src/semantic-router/pkg/extproc/request_handler.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -438,14 +438,20 @@ func (r *OpenAIRouter) performSecurityChecks(ctx *RequestContext, userContent st
438438
jailbreakEnabled = jailbreakEnabled && r.Config.IsJailbreakEnabledForCategory(categoryName)
439439
}
440440

441+
// Get category-specific threshold
442+
jailbreakThreshold := r.Config.PromptGuard.Threshold
443+
if categoryName != "" && r.Config != nil {
444+
jailbreakThreshold = r.Config.GetJailbreakThresholdForCategory(categoryName)
445+
}
446+
441447
// Perform jailbreak detection on all message content
442448
if jailbreakEnabled {
443449
// Start jailbreak detection span
444450
spanCtx, span := observability.StartSpan(ctx.TraceContext, observability.SpanJailbreakDetection)
445451
defer span.End()
446452

447453
startTime := time.Now()
448-
hasJailbreak, jailbreakDetections, err := r.Classifier.AnalyzeContentForJailbreak(allContent)
454+
hasJailbreak, jailbreakDetections, err := r.Classifier.AnalyzeContentForJailbreakWithThreshold(allContent, jailbreakThreshold)
449455
detectionTime := time.Since(startTime).Milliseconds()
450456

451457
observability.SetSpanAttributes(span,

src/semantic-router/pkg/utils/classification/classifier.go

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -425,6 +425,11 @@ func (c *Classifier) initializeJailbreakClassifier() error {
425425

426426
// CheckForJailbreak analyzes the given text for jailbreak attempts
427427
func (c *Classifier) CheckForJailbreak(text string) (bool, string, float32, error) {
428+
return c.CheckForJailbreakWithThreshold(text, c.Config.PromptGuard.Threshold)
429+
}
430+
431+
// CheckForJailbreakWithThreshold analyzes the given text for jailbreak attempts with a custom threshold
432+
func (c *Classifier) CheckForJailbreakWithThreshold(text string, threshold float32) (bool, string, float32, error) {
428433
if !c.IsJailbreakEnabled() {
429434
return false, "", 0.0, fmt.Errorf("jailbreak detection is not enabled or properly configured")
430435
}
@@ -453,21 +458,26 @@ func (c *Classifier) CheckForJailbreak(text string) (bool, string, float32, erro
453458
}
454459

455460
// Check if confidence meets threshold and indicates jailbreak
456-
isJailbreak := result.Confidence >= c.Config.PromptGuard.Threshold && jailbreakType == "jailbreak"
461+
isJailbreak := result.Confidence >= threshold && jailbreakType == "jailbreak"
457462

458463
if isJailbreak {
459464
observability.Warnf("JAILBREAK DETECTED: '%s' (confidence: %.3f, threshold: %.3f)",
460-
jailbreakType, result.Confidence, c.Config.PromptGuard.Threshold)
465+
jailbreakType, result.Confidence, threshold)
461466
} else {
462467
observability.Infof("BENIGN: '%s' (confidence: %.3f, threshold: %.3f)",
463-
jailbreakType, result.Confidence, c.Config.PromptGuard.Threshold)
468+
jailbreakType, result.Confidence, threshold)
464469
}
465470

466471
return isJailbreak, jailbreakType, result.Confidence, nil
467472
}
468473

469474
// AnalyzeContentForJailbreak analyzes multiple content pieces for jailbreak attempts
470475
func (c *Classifier) AnalyzeContentForJailbreak(contentList []string) (bool, []JailbreakDetection, error) {
476+
return c.AnalyzeContentForJailbreakWithThreshold(contentList, c.Config.PromptGuard.Threshold)
477+
}
478+
479+
// AnalyzeContentForJailbreakWithThreshold analyzes multiple content pieces for jailbreak attempts with a custom threshold
480+
func (c *Classifier) AnalyzeContentForJailbreakWithThreshold(contentList []string, threshold float32) (bool, []JailbreakDetection, error) {
471481
if !c.IsJailbreakEnabled() {
472482
return false, nil, fmt.Errorf("jailbreak detection is not enabled or properly configured")
473483
}
@@ -480,7 +490,7 @@ func (c *Classifier) AnalyzeContentForJailbreak(contentList []string) (bool, []J
480490
continue
481491
}
482492

483-
isJailbreak, jailbreakType, confidence, err := c.CheckForJailbreak(content)
493+
isJailbreak, jailbreakType, confidence, err := c.CheckForJailbreakWithThreshold(content, threshold)
484494
if err != nil {
485495
observability.Errorf("Error analyzing content %d: %v", i, err)
486496
continue

website/docs/overview/categories/configuration.md

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,42 @@ categories:
111111
score: 0.5
112112
```
113113

114+
#### `jailbreak_threshold` (Optional)
115+
116+
- **Type**: Float (0.0-1.0)
117+
- **Description**: Confidence threshold for jailbreak detection
118+
- **Default**: Inherits from global `prompt_guard.threshold` setting
119+
- **Impact**: Controls sensitivity of jailbreak detection for this category
120+
- **Tuning**: Higher values = stricter (fewer false positives), Lower values = more sensitive (catches more attacks)
121+
122+
```yaml
123+
categories:
124+
- name: customer_support
125+
jailbreak_enabled: true
126+
jailbreak_threshold: 0.9 # Strict detection for public-facing
127+
model_scores:
128+
- model: qwen3
129+
score: 0.8
130+
131+
- name: code_generation
132+
jailbreak_enabled: true
133+
jailbreak_threshold: 0.5 # Relaxed to reduce false positives on code
134+
model_scores:
135+
- model: qwen3
136+
score: 0.9
137+
138+
- name: general
139+
# No jailbreak_threshold - inherits from global prompt_guard.threshold
140+
model_scores:
141+
- model: qwen3
142+
score: 0.5
143+
```
144+
145+
**Threshold Guidelines**:
146+
- **0.8-0.95**: High-security categories (customer support, business)
147+
- **0.6-0.8**: Standard categories (general queries)
148+
- **0.4-0.6**: Technical categories (code generation, development tools)
149+
114150
#### `use_reasoning` (Required)
115151

116152
- **Type**: Boolean
@@ -228,21 +264,23 @@ categories:
228264

229265
```yaml
230266
categories:
231-
# High-security public-facing category
267+
# High-security public-facing category with strict threshold
232268
- name: "customer_support"
233269
description: "Customer support and general inquiries"
234270
jailbreak_enabled: true # Strict jailbreak protection
271+
jailbreak_threshold: 0.9 # High threshold for public-facing
235272
use_reasoning: false
236273
model_scores:
237274
- model: "phi4"
238275
score: 0.9
239276
- model: "mistral-small3.1"
240277
score: 0.7
241278
242-
# Trusted internal development category
279+
# Technical category with relaxed threshold
243280
- name: "code_generation"
244-
description: "Internal code generation for developers"
245-
jailbreak_enabled: false # Allow broader input for trusted users
281+
description: "Code generation for developers"
282+
jailbreak_enabled: true # Keep enabled
283+
jailbreak_threshold: 0.5 # Lower threshold to reduce false positives on code
246284
use_reasoning: true
247285
reasoning_effort: "medium"
248286
model_scores:

0 commit comments

Comments
 (0)