-
Notifications
You must be signed in to change notification settings - Fork 277
Commit 1175e2c
committed
feat: add comprehensive jailbreak detection test
Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection:
1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier
- Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak"
- Security risk: harmful content bypasses jailbreak detection
2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404
- Direct jailbreak classification endpoint not implemented
- Forces users to rely on batch endpoint with broken routing
3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through
- Validates end-to-end security filtering in LLM completion pipeline
- Documents security bypass where harmful instructions can be generated
Test Features:
- Documents multiple jailbreak attempts and safe content for comparison
- Provides detailed analysis of detection patterns and accuracy
- Exposes routing bugs and security gaps with clear failure messages
- Follows existing e2e test patterns for consistency
This test serves as both documentation of current security issues and
validation framework for future jailbreak detection improvements.
Signed-off-by: Yossi Ovadia <[email protected]>1 parent 5ec06fd commit 1175e2cCopy full SHA for 1175e2c
File tree
Expand file treeCollapse file tree
1 file changed
+470
-0
lines changedOpen diff view settings
Filter options
- e2e-tests
Expand file treeCollapse file tree
1 file changed
+470
-0
lines changedOpen diff view settings
0 commit comments