-
Notifications
You must be signed in to change notification settings - Fork 297
Description
Problem Statement
The current semantic router uses fine-tuned models as classifiers to determine routing decisions (model selection, reasoning chain selection, jailbreak detection, PII detection). These classifiers currently leverage the candle-binding framework (Rust FFI based on Candle ML framework) for inference.
While this approach works well for CPU-based models like BERT (which can run fast enough for classification tasks), it lacks the necessary optimizations for larger GPU-based guardrail models or general LLM models that require GPU acceleration. Specifically, the current framework does not include:
- Paged Attention: Efficient memory management for variable-length sequences
- Continuous Batching: Dynamic batching to maximize GPU utilization
- Tensor Parallelism: Distribution across multiple GPUs for large models
- Optimized KV Cache Management: Efficient caching for repeated computations
- High Throughput Serving: Optimizations for serving multiple concurrent requests
This limitation prevents the router from efficiently serving larger classification models that require GPU resources, such as:
- Large guardrail models (e.g., 7B+ parameter models for advanced security classification)
- General LLM models used for in-context configuration of routing/classification
- Models that benefit from GPU acceleration but are too large for efficient CPU routing.
Current Architecture
Classification Flow
The router currently uses an interface-based architecture for classification:
// Current interfaces in classifier.go
type CategoryInitializer interface {
Init(modelID string, useCPU bool, numClasses ...int) error
}
type CategoryInference interface {
Classify(text string) (candle_binding.ClassResult, error)
ClassifyWithProbabilities(text string) (candle_binding.ClassResultWithProbs, error)
}Current Implementation
- Initialization: Models are initialized via
candle_binding.InitClassifier(),candle_binding.InitModernBertClassifier(), etc. - Inference: Classification calls
candle_binding.ClassifyText(),candle_binding.ClassifyModernBertText(), etc. - Backend: Rust FFI bindings to Candle ML framework
- Limitations:
- Single-request inference (no batching optimizations)
- No GPU-specific optimizations (paged attention, continuous batching)
- Limited to models that can run efficiently on CPU or simple GPU inference
Use Cases Affected
- Large Guardrail Models: Advanced security/jailbreak detection models that require GPU acceleration
- In-Context Routing: LLM-based classifiers that use in-context learning for routing decisions
- High-Throughput Scenarios: Production deployments requiring high concurrent request handling
- Multi-GPU Deployments: Large models that need to be distributed across multiple GPUs
Proposed Solution
Integrate vLLM (a popular and efficient open-source inference engine) as an alternative routing backend for GPU-based classification models, while maintaining backward compatibility with the existing candle-binding implementation.
Architecture Design
┌─────────────────────────────────────────────────────────────┐
│ Classification Layer │
│ (CategoryInference, JailbreakInference, PIIInference) │
└───────────────────────┬───────────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
┌───────▼────────┐ ┌─────────▼──────────┐
│ Candle Binding │ │ vLLM Backend │
│ (CPU/Simple) │ │ (GPU Optimized) │
│ │ │ │
│ - BERT │ │ - Paged Attention │
│ - ModernBERT │ │ - Continuous Batch │
│ - Small Models │ │ - Tensor Parallel │
└─────────────────┘ │ - KV Cache Opt │
└────────────────────┘
Key Components
-
vLLM Backend Integration
- Create a new
VLLMInferenceimplementation that implements existing interfaces - Support for vLLM's OpenAI-compatible API or direct Python bindings
- Configuration-based selection between candle-binding and vLLM backends
- Create a new
-
Model Initialization
- Extend
CategoryInitializer,JailbreakInitializer,PIIInitializerinterfaces - Add
VLLMCategoryInitializer,VLLMJailbreakInitializer,VLLMPIIInitializer - Support vLLM model loading and configuration
- Extend
-
Inference Interface
- Implement
VLLMCategoryInference,VLLMJailbreakInference,VLLMPIIInference - Maintain compatibility with existing
ClassResultandClassResultWithProbsstructures - Support batch inference for improved throughput
- Implement
-
Configuration
- Add configuration options to select inference backend (candle-binding vs vLLM)
- Support per-classifier backend selection (e.g., use vLLM for category, candle for PII)
- Configuration for vLLM-specific settings (tensor parallelism, max batch size, etc.)
Technical Approach
Phase 1: vLLM Backend Foundation
-
Create vLLM Integration Package
- New package:
pkg/classification/vllm_backend/ - vLLM client wrapper (OpenAI-compatible API or gRPC)
- Model loading and initialization logic
- New package:
-
Implement Base Interfaces
VLLMInitializerinterfaceVLLMInferenceinterface- Error handling and fallback mechanisms
-
Configuration Extensions
- Extend
config.RouterConfigwith vLLM settings - Add backend selection flags (
UseVLLM,VLLMEndpoint, etc.)
- Extend
Phase 2: Category Classification Integration
-
Implement VLLM Category Classifier
VLLMCategoryInitializerimplementationVLLMCategoryInferenceimplementation- Support for probability distributions (entropy-based reasoning)
-
Integration with Existing Flow
- Update
createCategoryInitializer()andcreateCategoryInference()functions - Maintain backward compatibility with candle-binding
- Update
Phase 3: Jailbreak and PII Integration
-
Implement VLLM Jailbreak Classifier
VLLMJailbreakInitializerimplementationVLLMJailbreakInferenceimplementation
-
Implement VLLM PII Classifier
VLLMPIIInitializerimplementationVLLMPIIInferenceimplementation- Support for token-level classification
Phase 4: Performance Optimizations
-
Batch Inference Support
- Implement batch classification endpoints
- Optimize for continuous batching in vLLM
-
Caching and Optimization
- KV cache management
- Request queuing and prioritization
-
Monitoring and Observability
- Metrics for vLLM inference latency and throughput
- Comparison metrics between backends
Benefits
-
Performance Improvements
- Throughput: 10-100x improvement for GPU-based models through continuous batching
- Latency: Reduced P99 latency through optimized attention mechanisms
- Scalability: Support for larger models through tensor parallelism
-
Model Support
- Enable use of larger, more accurate classification models
- Support for general LLM models in classification tasks
- Better handling of variable-length input sequences
-
Production Readiness
- Industry-standard inference engine (vLLM)
- Better resource utilization (GPU memory, compute)
- Improved reliability and observability
-
Backward Compatibility
- Existing candle-binding implementations remain functional
- Gradual migration path
- Per-classifier backend selection
Implementation Considerations
Dependencies
- vLLM: Python-based inference engine (may require Python bindings or HTTP/gRPC client)
- Go Integration: Consider options:
- HTTP client to vLLM OpenAI-compatible API (recommended)
- Python CGO bindings (more complex)
- gRPC client (if vLLM supports gRPC)
Configuration Example
category_model:
model_id: "meta-llama/Llama-3.1-8B-Instruct"
use_vllm: true
vllm_endpoint: "http://localhost:8000/v1"
vllm_config:
tensor_parallel_size: 1
max_model_len: 8192
max_batch_size: 256
use_modern_bert: false # Ignored when use_vllm=true
jailbreak_model:
model_id: "./models/jailbreak_classifier_bert"
use_vllm: false # Continue using candle-binding
use_modern_bert: trueError Handling
- Graceful fallback to candle-binding if vLLM unavailable
- Health checks and retry logic
- Clear error messages for misconfiguration
Testing Strategy
- Unit Tests: Test vLLM backend implementations in isolation
- Integration Tests: Test end-to-end classification flow with vLLM
- Performance Tests: Benchmark throughput and latency improvements
- Compatibility Tests: Ensure backward compatibility with existing configurations
Acceptance Criteria
- vLLM backend can be selected via configuration for category classification
- vLLM backend can be selected via configuration for jailbreak detection
- vLLM backend can be selected via configuration for PII detection
- Existing candle-binding implementations continue to work unchanged
- Configuration allows per-classifier backend selection
- vLLM integration supports batch inference for improved throughput
- Error handling includes graceful fallback mechanisms
- Documentation includes vLLM setup and configuration guide
- Performance benchmarks show improvement for GPU-based models
- Integration tests pass for all classification types with vLLM backend
Open Questions
-
Deployment Model: Should vLLM be deployed as a separate service or embedded?
- Recommendation: Separate service for better resource isolation and scaling
-
Model Format: What model formats does vLLM support? (HuggingFace, Safetensors, etc.)
- Recommendation: Support HuggingFace format as primary, document others
-
Prompt Format: How should classification prompts be formatted for LLM-based classifiers?
- Recommendation: Define standard prompt templates for each classification type
-
Response Parsing: How to parse LLM outputs into
ClassResultformat?- Recommendation: Use structured output (JSON mode) or function calling
-
Cost Considerations: GPU resources and vLLM deployment costs
- Recommendation: Document resource requirements and cost estimates
Related Work
- Existing MCP classifier integration (
mcp_classifier.go) provides a pattern for external classifier integration - vLLM project: https://github.com/vllm-project/vllm
- vLLM OpenAI-compatible API documentation
Timeline Estimate
- Phase 1 (Foundation): 2-3 weeks
- Phase 2 (Category): 2-3 weeks
- Phase 3 (Jailbreak/PII): 2-3 weeks
- Phase 4 (Optimization): 2-3 weeks
Total: ~8-12 weeks for complete implementation
Priority
High - Enables support for larger, more accurate classification models and improves production readiness for high-throughput scenarios.
Why do you need this feature?
No response
Additional context
No response