Skip to content

feature: Integrate vLLM Inference Engine for GPU-Based Classification Models #723

@wangchen615

Description

@wangchen615

Problem Statement

The current semantic router uses fine-tuned models as classifiers to determine routing decisions (model selection, reasoning chain selection, jailbreak detection, PII detection). These classifiers currently leverage the candle-binding framework (Rust FFI based on Candle ML framework) for inference.

While this approach works well for CPU-based models like BERT (which can run fast enough for classification tasks), it lacks the necessary optimizations for larger GPU-based guardrail models or general LLM models that require GPU acceleration. Specifically, the current framework does not include:

  • Paged Attention: Efficient memory management for variable-length sequences
  • Continuous Batching: Dynamic batching to maximize GPU utilization
  • Tensor Parallelism: Distribution across multiple GPUs for large models
  • Optimized KV Cache Management: Efficient caching for repeated computations
  • High Throughput Serving: Optimizations for serving multiple concurrent requests

This limitation prevents the router from efficiently serving larger classification models that require GPU resources, such as:

  • Large guardrail models (e.g., 7B+ parameter models for advanced security classification)
  • General LLM models used for in-context configuration of routing/classification
  • Models that benefit from GPU acceleration but are too large for efficient CPU routing.

Current Architecture

Classification Flow

The router currently uses an interface-based architecture for classification:

// Current interfaces in classifier.go
type CategoryInitializer interface {
    Init(modelID string, useCPU bool, numClasses ...int) error
}

type CategoryInference interface {
    Classify(text string) (candle_binding.ClassResult, error)
    ClassifyWithProbabilities(text string) (candle_binding.ClassResultWithProbs, error)
}

Current Implementation

  • Initialization: Models are initialized via candle_binding.InitClassifier(), candle_binding.InitModernBertClassifier(), etc.
  • Inference: Classification calls candle_binding.ClassifyText(), candle_binding.ClassifyModernBertText(), etc.
  • Backend: Rust FFI bindings to Candle ML framework
  • Limitations:
    • Single-request inference (no batching optimizations)
    • No GPU-specific optimizations (paged attention, continuous batching)
    • Limited to models that can run efficiently on CPU or simple GPU inference

Use Cases Affected

  1. Large Guardrail Models: Advanced security/jailbreak detection models that require GPU acceleration
  2. In-Context Routing: LLM-based classifiers that use in-context learning for routing decisions
  3. High-Throughput Scenarios: Production deployments requiring high concurrent request handling
  4. Multi-GPU Deployments: Large models that need to be distributed across multiple GPUs

Proposed Solution

Integrate vLLM (a popular and efficient open-source inference engine) as an alternative routing backend for GPU-based classification models, while maintaining backward compatibility with the existing candle-binding implementation.

Architecture Design

┌─────────────────────────────────────────────────────────────┐
│                    Classification Layer                       │
│  (CategoryInference, JailbreakInference, PIIInference)      │
└───────────────────────┬───────────────────────────────────────┘
                        │
        ┌───────────────┴───────────────┐
        │                               │
┌───────▼────────┐            ┌─────────▼──────────┐
│  Candle Binding │            │   vLLM Backend     │
│   (CPU/Simple)  │            │   (GPU Optimized) │
│                 │            │                    │
│ - BERT          │            │ - Paged Attention  │
│ - ModernBERT    │            │ - Continuous Batch │
│ - Small Models  │            │ - Tensor Parallel  │
└─────────────────┘            │ - KV Cache Opt     │
                               └────────────────────┘

Key Components

  1. vLLM Backend Integration

    • Create a new VLLMInference implementation that implements existing interfaces
    • Support for vLLM's OpenAI-compatible API or direct Python bindings
    • Configuration-based selection between candle-binding and vLLM backends
  2. Model Initialization

    • Extend CategoryInitializer, JailbreakInitializer, PIIInitializer interfaces
    • Add VLLMCategoryInitializer, VLLMJailbreakInitializer, VLLMPIIInitializer
    • Support vLLM model loading and configuration
  3. Inference Interface

    • Implement VLLMCategoryInference, VLLMJailbreakInference, VLLMPIIInference
    • Maintain compatibility with existing ClassResult and ClassResultWithProbs structures
    • Support batch inference for improved throughput
  4. Configuration

    • Add configuration options to select inference backend (candle-binding vs vLLM)
    • Support per-classifier backend selection (e.g., use vLLM for category, candle for PII)
    • Configuration for vLLM-specific settings (tensor parallelism, max batch size, etc.)

Technical Approach

Phase 1: vLLM Backend Foundation

  1. Create vLLM Integration Package

    • New package: pkg/classification/vllm_backend/
    • vLLM client wrapper (OpenAI-compatible API or gRPC)
    • Model loading and initialization logic
  2. Implement Base Interfaces

    • VLLMInitializer interface
    • VLLMInference interface
    • Error handling and fallback mechanisms
  3. Configuration Extensions

    • Extend config.RouterConfig with vLLM settings
    • Add backend selection flags (UseVLLM, VLLMEndpoint, etc.)

Phase 2: Category Classification Integration

  1. Implement VLLM Category Classifier

    • VLLMCategoryInitializer implementation
    • VLLMCategoryInference implementation
    • Support for probability distributions (entropy-based reasoning)
  2. Integration with Existing Flow

    • Update createCategoryInitializer() and createCategoryInference() functions
    • Maintain backward compatibility with candle-binding

Phase 3: Jailbreak and PII Integration

  1. Implement VLLM Jailbreak Classifier

    • VLLMJailbreakInitializer implementation
    • VLLMJailbreakInference implementation
  2. Implement VLLM PII Classifier

    • VLLMPIIInitializer implementation
    • VLLMPIIInference implementation
    • Support for token-level classification

Phase 4: Performance Optimizations

  1. Batch Inference Support

    • Implement batch classification endpoints
    • Optimize for continuous batching in vLLM
  2. Caching and Optimization

    • KV cache management
    • Request queuing and prioritization
  3. Monitoring and Observability

    • Metrics for vLLM inference latency and throughput
    • Comparison metrics between backends

Benefits

  1. Performance Improvements

    • Throughput: 10-100x improvement for GPU-based models through continuous batching
    • Latency: Reduced P99 latency through optimized attention mechanisms
    • Scalability: Support for larger models through tensor parallelism
  2. Model Support

    • Enable use of larger, more accurate classification models
    • Support for general LLM models in classification tasks
    • Better handling of variable-length input sequences
  3. Production Readiness

    • Industry-standard inference engine (vLLM)
    • Better resource utilization (GPU memory, compute)
    • Improved reliability and observability
  4. Backward Compatibility

    • Existing candle-binding implementations remain functional
    • Gradual migration path
    • Per-classifier backend selection

Implementation Considerations

Dependencies

  • vLLM: Python-based inference engine (may require Python bindings or HTTP/gRPC client)
  • Go Integration: Consider options:
    • HTTP client to vLLM OpenAI-compatible API (recommended)
    • Python CGO bindings (more complex)
    • gRPC client (if vLLM supports gRPC)

Configuration Example

category_model:
  model_id: "meta-llama/Llama-3.1-8B-Instruct"
  use_vllm: true
  vllm_endpoint: "http://localhost:8000/v1"
  vllm_config:
    tensor_parallel_size: 1
    max_model_len: 8192
    max_batch_size: 256
  use_modern_bert: false  # Ignored when use_vllm=true

jailbreak_model:
  model_id: "./models/jailbreak_classifier_bert"
  use_vllm: false  # Continue using candle-binding
  use_modern_bert: true

Error Handling

  • Graceful fallback to candle-binding if vLLM unavailable
  • Health checks and retry logic
  • Clear error messages for misconfiguration

Testing Strategy

  1. Unit Tests: Test vLLM backend implementations in isolation
  2. Integration Tests: Test end-to-end classification flow with vLLM
  3. Performance Tests: Benchmark throughput and latency improvements
  4. Compatibility Tests: Ensure backward compatibility with existing configurations

Acceptance Criteria

  • vLLM backend can be selected via configuration for category classification
  • vLLM backend can be selected via configuration for jailbreak detection
  • vLLM backend can be selected via configuration for PII detection
  • Existing candle-binding implementations continue to work unchanged
  • Configuration allows per-classifier backend selection
  • vLLM integration supports batch inference for improved throughput
  • Error handling includes graceful fallback mechanisms
  • Documentation includes vLLM setup and configuration guide
  • Performance benchmarks show improvement for GPU-based models
  • Integration tests pass for all classification types with vLLM backend

Open Questions

  1. Deployment Model: Should vLLM be deployed as a separate service or embedded?

    • Recommendation: Separate service for better resource isolation and scaling
  2. Model Format: What model formats does vLLM support? (HuggingFace, Safetensors, etc.)

    • Recommendation: Support HuggingFace format as primary, document others
  3. Prompt Format: How should classification prompts be formatted for LLM-based classifiers?

    • Recommendation: Define standard prompt templates for each classification type
  4. Response Parsing: How to parse LLM outputs into ClassResult format?

    • Recommendation: Use structured output (JSON mode) or function calling
  5. Cost Considerations: GPU resources and vLLM deployment costs

    • Recommendation: Document resource requirements and cost estimates

Related Work

  • Existing MCP classifier integration (mcp_classifier.go) provides a pattern for external classifier integration
  • vLLM project: https://github.com/vllm-project/vllm
  • vLLM OpenAI-compatible API documentation

Timeline Estimate

  • Phase 1 (Foundation): 2-3 weeks
  • Phase 2 (Category): 2-3 weeks
  • Phase 3 (Jailbreak/PII): 2-3 weeks
  • Phase 4 (Optimization): 2-3 weeks

Total: ~8-12 weeks for complete implementation

Priority

High - Enables support for larger, more accurate classification models and improves production readiness for high-throughput scenarios.

Why do you need this feature?

No response

Additional context

No response

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions