Update documentation for category-level jailbreak detection

Copilot · Xunzhuo · Copilot · commit f75a8c172254 · 2025-10-22T07:28:16.000Z
- Add category-level jailbreak configuration to jailbreak-protection.md
- Update category configuration docs with jailbreak_enabled parameter
- Add security-focused configuration example
- Update global configuration docs with category override notes
- Update README to mention fine-grained security control

Co-authored-by: Xunzhuo &lt;48784001+Xunzhuo@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -76,7 +76,7 @@ Detect PII in the prompt, avoiding sending PII to the LLM so as to protect the p
 
 #### Prompt guard
 
-Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving.
+Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving. Can be configured globally or at the category level for fine-grained security control.
 
 ### Similarity Caching ⚡️
 
diff --git a/website/docs/installation/configuration.md b/website/docs/installation/configuration.md
@@ -38,7 +38,7 @@ tools:
 
 # Jailbreak protection
 prompt_guard:
-  enabled: false
+  enabled: false  # Global default - can be overridden per category
   use_modernbert: true
   model_id: "models/jailbreak_classifier_modernbert-base_model"
   threshold: 0.7
@@ -84,6 +84,8 @@ categories:
   # Optional: Category-level cache settings
   # semantic_cache_enabled: true
   # semantic_cache_similarity_threshold: 0.9  # Higher threshold for math
+  # Optional: Category-level jailbreak settings
+  # jailbreak_enabled: true  # Override global jailbreak detection
 - name: computer science
   model_scores:
   - model: your-model
diff --git a/website/docs/overview/categories/configuration.md b/website/docs/overview/categories/configuration.md
@@ -83,6 +83,34 @@ curl -X PUT http://localhost:8080/config/system-prompts \
 
 ### Reasoning Configuration
 
+#### `jailbreak_enabled` (Optional)
+
+- **Type**: Boolean
+- **Description**: Whether to enable jailbreak detection for this category
+- **Default**: Inherits from global `prompt_guard.enabled` setting
+- **Impact**: Enables or disables jailbreak protection for this specific category
+
+```yaml
+categories:
+  - name: customer_support
+    jailbreak_enabled: true  # Explicitly enable for public-facing
+    model_scores:
+      - model: qwen3
+        score: 0.8
+
+  - name: code_generation
+    jailbreak_enabled: false  # Disable for internal tools
+    model_scores:
+      - model: qwen3
+        score: 0.9
+
+  - name: general
+    # No jailbreak_enabled - inherits from global prompt_guard.enabled
+    model_scores:
+      - model: qwen3
+        score: 0.5
+```
+
 #### `use_reasoning` (Required)
 
 - **Type**: Boolean
@@ -196,7 +224,46 @@ categories:
         score: 0.2
 ```
 
-### Example 3: Multi-Category Configuration
+### Example 3: Security-Focused Configuration (Jailbreak Protection)
+
+```yaml
+categories:
+  # High-security public-facing category
+  - name: "customer_support"
+    description: "Customer support and general inquiries"
+    jailbreak_enabled: true  # Strict jailbreak protection
+    use_reasoning: false
+    model_scores:
+      - model: "phi4"
+        score: 0.9
+      - model: "mistral-small3.1"
+        score: 0.7
+
+  # Trusted internal development category
+  - name: "code_generation"
+    description: "Internal code generation for developers"
+    jailbreak_enabled: false  # Allow broader input for trusted users
+    use_reasoning: true
+    reasoning_effort: "medium"
+    model_scores:
+      - model: "gemma3:27b"
+        score: 0.9
+      - model: "phi4"
+        score: 0.7
+
+  # General category using global default
+  - name: "general"
+    description: "General queries"
+    # jailbreak_enabled not specified - inherits from global prompt_guard.enabled
+    use_reasoning: false
+    model_scores:
+      - model: "phi4"
+        score: 0.6
+      - model: "mistral-small3.1"
+        score: 0.6
+```
+
+### Example 4: Multi-Category Configuration
 
 ```yaml
 categories:
diff --git a/website/docs/tutorials/content-safety/jailbreak-protection.md b/website/docs/tutorials/content-safety/jailbreak-protection.md
@@ -43,14 +43,59 @@ Enable jailbreak detection in your configuration:
 ```yaml
 # config/config.yaml
 prompt_guard:
-  enabled: true
+  enabled: true  # Global default - can be overridden per category
   model_id: "models/jailbreak_classifier_modernbert-base_model"
   threshold: 0.7                   # Detection sensitivity (0.0-1.0)
   use_cpu: true                    # Run on CPU
   use_modernbert: true             # Use ModernBERT architecture
   jailbreak_mapping_path: "config/jailbreak_type_mapping.json"  # Path to jailbreak type mapping
 ```
 
+### Category-Level Jailbreak Protection
+
+You can enable or disable jailbreak detection at the category level for fine-grained security control:
+
+```yaml
+# Global default setting
+prompt_guard:
+  enabled: true  # Default for all categories
+
+categories:
+  # High-security category - explicitly enable
+  - name: customer_support
+    jailbreak_enabled: true  # Strict protection for public-facing
+    model_scores:
+      - model: qwen3
+        score: 0.8
+
+  # Internal tool - disable for trusted environment
+  - name: code_generation
+    jailbreak_enabled: false  # Allow broader input for developers
+    model_scores:
+      - model: qwen3
+        score: 0.9
+
+  # General category - inherits global setting
+  - name: general
+    # No jailbreak_enabled specified - uses global prompt_guard.enabled
+    model_scores:
+      - model: qwen3
+        score: 0.5
+```
+
+**Category-Level Behavior**:
+
+- **When `jailbreak_enabled` is not specified**: Category inherits from global `prompt_guard.enabled`
+- **When `jailbreak_enabled: true`**: Jailbreak detection is explicitly enabled for this category
+- **When `jailbreak_enabled: false`**: Jailbreak detection is explicitly disabled for this category
+- **Category-specific setting always overrides global setting** when explicitly configured
+
+**Use Cases**:
+
+- **Enable for public-facing categories**: Customer support, business advice
+- **Disable for internal tools**: Code generation for developers, testing environments
+- **Inherit for general categories**: Use global default for most categories
+
 ## How Jailbreak Protection Works
 
 The jailbreak protection system works as follows:
@@ -134,9 +179,38 @@ security_policy_violations_total 45
 ### 4. Integration with Routing
 
 - Apply stricter protection to sensitive models
-- Use different thresholds for different categories
+- Use category-level jailbreak settings for different domains
 - Combine with PII detection for comprehensive security
 
+**Example**: Configure different jailbreak policies per category:
+
+```yaml
+prompt_guard:
+  enabled: true  # Global default
+
+categories:
+  # Strict protection for customer-facing categories
+  - name: customer_support
+    jailbreak_enabled: true
+    model_scores:
+      - model: safe-model
+        score: 0.9
+
+  # Relaxed protection for internal development
+  - name: code_generation
+    jailbreak_enabled: false  # Allow broader input
+    model_scores:
+      - model: code-model
+        score: 0.9
+
+  # Use global default for general queries
+  - name: general
+    # Inherits from prompt_guard.enabled
+    model_scores:
+      - model: general-model
+        score: 0.7
+```
+
 ## Troubleshooting
 
 ### High False Positives