Skip to content

Enable LoRA auto-detection for Jailbreak Detection #725

@yossiovadia

Description

@yossiovadia

Summary

Following the request from @Xunzhuo in PR #718, this issue tracks the migration of Jailbreak Detection to support LoRA auto-detection.

Background

Currently, PII detection has LoRA auto-detection (PR #709), and Intent Classification will have it (#724), but Jailbreak Detection does not. This creates an inconsistency where some classification features can leverage LoRA models while others cannot.

Note: This work depends on #724 (Intent Classification LoRA) being merged first, as it will establish the pattern to follow.

Current Behavior (BEFORE)

Problem: Jailbreak detection cannot automatically use LoRA models.

How it works:

  1. Configuration has a use_modernbert flag that determines which initializer to use
  2. System makes a hardcoded choice between two paths:
    • use_modernbert: false → Uses LinearJailbreakInitializer (Traditional BERT only)
    • use_modernbert: true → Uses ModernBertJailbreakInitializer (ModernBERT only)
  3. Neither path can detect or use LoRA models automatically
  4. Even if you point model_id to a LoRA jailbreak model, it will fail

Current config:

prompt_guard:
  use_modernbert: true
  model_id: "models/jailbreak_classifier_modernbert-base_model"
  # Cannot use: models/lora_jailbreak_classifier_bert-base-uncased_model

Expected Behavior (AFTER)

Solution: Jailbreak detection should auto-detect LoRA models (just like PII and Intent).

How it should work:

  1. Single auto-detecting initializer that intelligently routes based on model type
  2. Detection happens automatically by checking:
    • LoRA weights in model.safetensors file
    • Presence of lora_config.json
  3. Smart fallback chain: LoRA → Traditional BERT → ModernBERT
  4. The use_modernbert config flag becomes optional/ignored (backward compatible)
  5. Zero configuration needed - just point to model path and system figures it out

Example (can now use LoRA models):

prompt_guard:
  model_id: "models/lora_jailbreak_classifier_bert-base-uncased_model"
  use_modernbert: false  # Ignored - auto-detection finds LoRA and uses it

Implementation Notes

  • Depends on: Issue Enable LoRA auto-detection for Intent/Category Classification #724 being merged first (provides the pattern)
  • Follow the same pattern as Intent Classification and PII detection
  • Update both Go layer (classifier.go) and Rust layer (init.rs, classify.rs)
  • LoRA jailbreak models already exist in models/ directory:
    • lora_jailbreak_classifier_bert-base-uncased_model
    • lora_jailbreak_classifier_modernbert-base_model
    • lora_jailbreak_classifier_roberta-base_model
  • Add auto-detection test similar to Intent Classification
  • Check if Rust FFI functions need to be added (like InitCandleBertJailbreakClassifier)

Implementation Approach

Go Layer Changes (classifier.go)

  • Replace LinearJailbreakInitializer and ModernBertJailbreakInitializer with JailbreakInitializerImpl
  • Replace LinearJailbreakInference and ModernBertJailbreakInference with JailbreakInferenceImpl
  • Remove useModernBERT parameter from factory functions
  • Update call sites (similar to Intent Classification changes)

Rust Layer Changes

  • Add LORA_JAILBREAK_CLASSIFIER static variable to init.rs
  • Update init_candle_bert_jailbreak_classifier (or create it) with intelligent routing
  • Update classify_candle_bert_jailbreak_text to try LoRA first, then fallback
  • May need to add helper method to existing LoRA jailbreak classifier

Related Work

Success Criteria

  • LoRA jailbreak models are automatically detected and used
  • Traditional BERT and ModernBERT fallback paths still work
  • No configuration changes required for users
  • All existing tests continue to pass
  • New tests demonstrate LoRA auto-detection working
  • Consistent pattern across all three classification types (PII, Intent, Jailbreak)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions