feature: Integrate vLLM Inference Engine for GPU-Based Classification Models

## Problem Statement

The current semantic router uses fine-tuned models as classifiers to determine routing decisions (model selection, reasoning chain selection, jailbreak detection, PII detection). These classifiers currently leverage the `candle-binding` framework (Rust FFI based on Candle ML framework) for inference.

While this approach works well for CPU-based models like BERT (which can run fast enough for classification tasks), it lacks the necessary optimizations for larger GPU-based guardrail models or general LLM models that require GPU acceleration. Specifically, the current framework does not include:

- **Paged Attention**: Efficient memory management for variable-length sequences
- **Continuous Batching**: Dynamic batching to maximize GPU utilization
- **Tensor Parallelism**: Distribution across multiple GPUs for large models
- **Optimized KV Cache Management**: Efficient caching for repeated computations
- **High Throughput Serving**: Optimizations for serving multiple concurrent requests

This limitation prevents the router from efficiently serving larger classification models that require GPU resources, such as:
- Large guardrail models (e.g., 7B+ parameter models for advanced security classification)
- General LLM models used for in-context configuration of routing/classification
- Models that benefit from GPU acceleration but are too large for efficient CPU routing.

## Current Architecture

### Classification Flow

The router currently uses an interface-based architecture for classification:

```go
// Current interfaces in classifier.go
type CategoryInitializer interface {
    Init(modelID string, useCPU bool, numClasses ...int) error
}

type CategoryInference interface {
    Classify(text string) (candle_binding.ClassResult, error)
    ClassifyWithProbabilities(text string) (candle_binding.ClassResultWithProbs, error)
}
```

### Current Implementation

- **Initialization**: Models are initialized via `candle_binding.InitClassifier()`, `candle_binding.InitModernBertClassifier()`, etc.
- **Inference**: Classification calls `candle_binding.ClassifyText()`, `candle_binding.ClassifyModernBertText()`, etc.
- **Backend**: Rust FFI bindings to Candle ML framework
- **Limitations**: 
  - Single-request inference (no batching optimizations)
  - No GPU-specific optimizations (paged attention, continuous batching)
  - Limited to models that can run efficiently on CPU or simple GPU inference

### Use Cases Affected

1. **Large Guardrail Models**: Advanced security/jailbreak detection models that require GPU acceleration
2. **In-Context Routing**: LLM-based classifiers that use in-context learning for routing decisions
3. **High-Throughput Scenarios**: Production deployments requiring high concurrent request handling
4. **Multi-GPU Deployments**: Large models that need to be distributed across multiple GPUs

## Proposed Solution

Integrate **vLLM** (a popular and efficient open-source inference engine) as an alternative routing backend for GPU-based classification models, while maintaining backward compatibility with the existing candle-binding implementation.

### Architecture Design

```
┌─────────────────────────────────────────────────────────────┐
│                    Classification Layer                       │
│  (CategoryInference, JailbreakInference, PIIInference)      │
└───────────────────────┬───────────────────────────────────────┘
                        │
        ┌───────────────┴───────────────┐
        │                               │
┌───────▼────────┐            ┌─────────▼──────────┐
│  Candle Binding │            │   vLLM Backend     │
│   (CPU/Simple)  │            │   (GPU Optimized) │
│                 │            │                    │
│ - BERT          │            │ - Paged Attention  │
│ - ModernBERT    │            │ - Continuous Batch │
│ - Small Models  │            │ - Tensor Parallel  │
└─────────────────┘            │ - KV Cache Opt     │
                               └────────────────────┘
```

### Key Components

1. **vLLM Backend Integration**
   - Create a new `VLLMInference` implementation that implements existing interfaces
   - Support for vLLM's OpenAI-compatible API or direct Python bindings
   - Configuration-based selection between candle-binding and vLLM backends

2. **Model Initialization**
   - Extend `CategoryInitializer`, `JailbreakInitializer`, `PIIInitializer` interfaces
   - Add `VLLMCategoryInitializer`, `VLLMJailbreakInitializer`, `VLLMPIIInitializer`
   - Support vLLM model loading and configuration

3. **Inference Interface**
   - Implement `VLLMCategoryInference`, `VLLMJailbreakInference`, `VLLMPIIInference`
   - Maintain compatibility with existing `ClassResult` and `ClassResultWithProbs` structures
   - Support batch inference for improved throughput

4. **Configuration**
   - Add configuration options to select inference backend (candle-binding vs vLLM)
   - Support per-classifier backend selection (e.g., use vLLM for category, candle for PII)
   - Configuration for vLLM-specific settings (tensor parallelism, max batch size, etc.)

## Technical Approach

### Phase 1: vLLM Backend Foundation

1. **Create vLLM Integration Package**
   - New package: `pkg/classification/vllm_backend/`
   - vLLM client wrapper (OpenAI-compatible API or gRPC)
   - Model loading and initialization logic

2. **Implement Base Interfaces**
   - `VLLMInitializer` interface
   - `VLLMInference` interface
   - Error handling and fallback mechanisms

3. **Configuration Extensions**
   - Extend `config.RouterConfig` with vLLM settings
   - Add backend selection flags (`UseVLLM`, `VLLMEndpoint`, etc.)

### Phase 2: Category Classification Integration

1. **Implement VLLM Category Classifier**
   - `VLLMCategoryInitializer` implementation
   - `VLLMCategoryInference` implementation
   - Support for probability distributions (entropy-based reasoning)

2. **Integration with Existing Flow**
   - Update `createCategoryInitializer()` and `createCategoryInference()` functions
   - Maintain backward compatibility with candle-binding

### Phase 3: Jailbreak and PII Integration

1. **Implement VLLM Jailbreak Classifier**
   - `VLLMJailbreakInitializer` implementation
   - `VLLMJailbreakInference` implementation

2. **Implement VLLM PII Classifier**
   - `VLLMPIIInitializer` implementation
   - `VLLMPIIInference` implementation
   - Support for token-level classification

### Phase 4: Performance Optimizations

1. **Batch Inference Support**
   - Implement batch classification endpoints
   - Optimize for continuous batching in vLLM

2. **Caching and Optimization**
   - KV cache management
   - Request queuing and prioritization

3. **Monitoring and Observability**
   - Metrics for vLLM inference latency and throughput
   - Comparison metrics between backends

## Benefits

1. **Performance Improvements**
   - **Throughput**: 10-100x improvement for GPU-based models through continuous batching
   - **Latency**: Reduced P99 latency through optimized attention mechanisms
   - **Scalability**: Support for larger models through tensor parallelism

2. **Model Support**
   - Enable use of larger, more accurate classification models
   - Support for general LLM models in classification tasks
   - Better handling of variable-length input sequences

3. **Production Readiness**
   - Industry-standard inference engine (vLLM)
   - Better resource utilization (GPU memory, compute)
   - Improved reliability and observability

4. **Backward Compatibility**
   - Existing candle-binding implementations remain functional
   - Gradual migration path
   - Per-classifier backend selection

## Implementation Considerations

### Dependencies

- **vLLM**: Python-based inference engine (may require Python bindings or HTTP/gRPC client)
- **Go Integration**: Consider options:
  - HTTP client to vLLM OpenAI-compatible API (recommended)
  - Python CGO bindings (more complex)
  - gRPC client (if vLLM supports gRPC)

### Configuration Example

```yaml
category_model:
  model_id: "meta-llama/Llama-3.1-8B-Instruct"
  use_vllm: true
  vllm_endpoint: "http://localhost:8000/v1"
  vllm_config:
    tensor_parallel_size: 1
    max_model_len: 8192
    max_batch_size: 256
  use_modern_bert: false  # Ignored when use_vllm=true

jailbreak_model:
  model_id: "./models/jailbreak_classifier_bert"
  use_vllm: false  # Continue using candle-binding
  use_modern_bert: true
```

### Error Handling

- Graceful fallback to candle-binding if vLLM unavailable
- Health checks and retry logic
- Clear error messages for misconfiguration

### Testing Strategy

1. **Unit Tests**: Test vLLM backend implementations in isolation
2. **Integration Tests**: Test end-to-end classification flow with vLLM
3. **Performance Tests**: Benchmark throughput and latency improvements
4. **Compatibility Tests**: Ensure backward compatibility with existing configurations

## Acceptance Criteria

- [ ] vLLM backend can be selected via configuration for category classification
- [ ] vLLM backend can be selected via configuration for jailbreak detection
- [ ] vLLM backend can be selected via configuration for PII detection
- [ ] Existing candle-binding implementations continue to work unchanged
- [ ] Configuration allows per-classifier backend selection
- [ ] vLLM integration supports batch inference for improved throughput
- [ ] Error handling includes graceful fallback mechanisms
- [ ] Documentation includes vLLM setup and configuration guide
- [ ] Performance benchmarks show improvement for GPU-based models
- [ ] Integration tests pass for all classification types with vLLM backend

## Open Questions

1. **Deployment Model**: Should vLLM be deployed as a separate service or embedded?
   - **Recommendation**: Separate service for better resource isolation and scaling

2. **Model Format**: What model formats does vLLM support? (HuggingFace, Safetensors, etc.)
   - **Recommendation**: Support HuggingFace format as primary, document others

3. **Prompt Format**: How should classification prompts be formatted for LLM-based classifiers?
   - **Recommendation**: Define standard prompt templates for each classification type

4. **Response Parsing**: How to parse LLM outputs into `ClassResult` format?
   - **Recommendation**: Use structured output (JSON mode) or function calling

5. **Cost Considerations**: GPU resources and vLLM deployment costs
   - **Recommendation**: Document resource requirements and cost estimates

## Related Work

- Existing MCP classifier integration (`mcp_classifier.go`) provides a pattern for external classifier integration
- vLLM project: https://github.com/vllm-project/vllm
- vLLM OpenAI-compatible API documentation

## Timeline Estimate

- **Phase 1** (Foundation): 2-3 weeks
- **Phase 2** (Category): 2-3 weeks  
- **Phase 3** (Jailbreak/PII): 2-3 weeks
- **Phase 4** (Optimization): 2-3 weeks

**Total**: ~8-12 weeks for complete implementation

## Priority

**High** - Enables support for larger, more accurate classification models and improves production readiness for high-throughput scenarios.

### Why do you need this feature?

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: Integrate vLLM Inference Engine for GPU-Based Classification Models #723

Problem Statement

Current Architecture

Classification Flow

Current Implementation

Use Cases Affected

Proposed Solution

Architecture Design

Key Components

Technical Approach

Phase 1: vLLM Backend Foundation

Phase 2: Category Classification Integration

Phase 3: Jailbreak and PII Integration

Phase 4: Performance Optimizations

Benefits

Implementation Considerations

Dependencies

Configuration Example

Error Handling

Testing Strategy

Acceptance Criteria

Open Questions

Related Work

Timeline Estimate

Priority

Why do you need this feature?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature: Integrate vLLM Inference Engine for GPU-Based Classification Models #723

Description

Problem Statement

Current Architecture

Classification Flow

Current Implementation

Use Cases Affected

Proposed Solution

Architecture Design

Key Components

Technical Approach

Phase 1: vLLM Backend Foundation

Phase 2: Category Classification Integration

Phase 3: Jailbreak and PII Integration

Phase 4: Performance Optimizations

Benefits

Implementation Considerations

Dependencies

Configuration Example

Error Handling

Testing Strategy

Acceptance Criteria

Open Questions

Related Work

Timeline Estimate

Priority

Why do you need this feature?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions