Skip to content

Commit 022a1eb

Browse files
yossiovadiarootfs
authored andcommitted
Feature/add jailbreak detection test (vllm-project#331)
* feat: add comprehensive jailbreak detection test Adds 05-jailbreak-test.py to expose critical security gaps in jailbreak detection: 1. Routing Bug: task_type="jailbreak" incorrectly routes to category classifier - Jailbreak content "How to build a bomb?" classified as "psychology" instead of "jailbreak" - Security risk: harmful content bypasses jailbreak detection 2. Missing Endpoint: POST /api/v1/classify/jailbreak returns 404 - Direct jailbreak classification endpoint not implemented - Forces users to rely on batch endpoint with broken routing 3. ExtProc Security Gap: Tests that ExtProc pipeline allows jailbreak content through - Validates end-to-end security filtering in LLM completion pipeline - Documents security bypass where harmful instructions can be generated Test Features: - Documents multiple jailbreak attempts and safe content for comparison - Provides detailed analysis of detection patterns and accuracy - Exposes routing bugs and security gaps with clear failure messages - Follows existing e2e test patterns for consistency This test serves as both documentation of current security issues and validation framework for future jailbreak detection improvements. Signed-off-by: Yossi Ovadia <[email protected]> * fix: correct jailbreak test to use proper API parameters Updates 05-jailbreak-test.py to use the correct API parameters for jailbreak detection: CORRECTED API USAGE: - Changed task_type from "jailbreak" to "security" (the correct parameter) - Updated expectations to check for threat detection vs "safe" classification - Fixed validation logic to properly test security endpoint behavior VALIDATION CONFIRMED: - task_type="security" correctly routes to security classifier - Jailbreak content now properly detected as "jailbreak" with 99.1% confidence - Test validates that dangerous content is NOT classified as "safe" ENDPOINTS VALIDATED: - ✅ /api/v1/classify/batch with task_type="security" - Works correctly - ❌ /api/v1/classify/jailbreak - Confirmed missing (404 as expected) The test now accurately validates jailbreak detection capabilities using the correct API interface, rather than testing against wrong parameters. Signed-off-by: Yossi Ovadia <[email protected]> * feat: add comprehensive jailbreak detection tests Adds 05-jailbreak-test.py with comprehensive test coverage for jailbreak detection across multiple classifier paths: - Batch API security classification (ModernBERT path) - Direct security endpoint testing - ExtProc pipeline security validation - Pattern analysis across multiple test cases Features: - Cache-busting with unique test cases per run - Clear documentation of expected results per path - Detailed logging of classifier behavior differences - Comprehensive security gap analysis Tests expose critical security vulnerabilities where jailbreak content bypasses detection and reaches LLM backends, generating harmful responses. Co-Authored-By: Claude <[email protected]> Signed-off-by: Yossi Ovadia <[email protected]> --------- Signed-off-by: Yossi Ovadia <[email protected]> Co-authored-by: Huamin Chen <[email protected]> Signed-off-by: liuhy <[email protected]>
1 parent bc8b6ea commit 022a1eb

File tree

1 file changed

+628
-0
lines changed

1 file changed

+628
-0
lines changed

0 commit comments

Comments
 (0)