-
Notifications
You must be signed in to change notification settings - Fork 180
Commit 0270850
Feature/add jailbreak detection test (#331)
* feat: add comprehensive jailbreak detection test
Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:
1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
- Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
- Security risk: harmful content bypasses jailbreak detection
2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
- Direct jailbreak classification endpoint not implemented
- Forces users to rely on batch endpoint with broken routing
3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
- Validates end-to-end security filtering in LLM completion pipeline
- Documents security bypass where harmful instructions can be generated
Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency
This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.
Signed-off-by: Yossi Ovadia <[email protected]>
* fix: correct jailbreak test to use proper API parameters
Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection:
CORRECTED API USAGE:
- Changed task_type from "jailbreak" to "security" (the correct parameter)
- Updated expectations to check for threat detection vs "safe" classification
- Fixed validation logic to properly test security endpoint behavior
VALIDATION CONFIRMED:
- task_type="security" correctly routes to security classifier
- Jailbreak content now properly detected as "jailbreak" with 99.1% confidence
- Test validates that dangerous content is NOT classified as "safe"
ENDPOINTS VALIDATED:
- ✅ /api/v1/classify/batch with task_type="security" - Works correctly
- ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected)
The test now accurately validates jailbreak detection capabilities using
the correct API interface, rather than testing against wrong parameters.
Signed-off-by: Yossi Ovadia <[email protected]>
* feat: add comprehensive jailbreak detection tests
Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak
detection across multiple classifier paths:
- Batch API security classification (ModernBERT path)
- Direct security endpoint testing
- ExtProc pipeline security validation
- Pattern analysis across multiple test cases
Features:
- Cache-busting with unique test cases per run
- Clear documentation of expected results per path
- Detailed logging of classifier behavior differences
- Comprehensive security gap analysis
Tests expose critical security vulnerabilities where jailbreak content
bypasses detection and reaches LLM backends, generating harmful responses.
Co-Authored-By: Claude <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
---------
Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Huamin Chen <[email protected]>1 parent 017a330 commit 0270850Copy full SHA for 0270850
File tree
Expand file treeCollapse file tree
1 file changed
+628
-0
lines changedFilter options
- e2e-tests
Expand file treeCollapse file tree
1 file changed
+628
-0
lines changed
0 commit comments