Skip to content

Feature Request: Dynamic CPU/GPU Switching #338

@FeiDaLI

Description

@FeiDaLI

Is your feature request related to a problem? Please describe.

Currently, BERT processing typically runs on a fixed device (CPU or GPU), without dynamic adaptation based on query complexity. This leads to suboptimal resource utilization:

  • Simple queries processed on GPU waste valuable GPU resources.
  • Complex queries processed on CPU suffer from slow inference times.
  • There is no mechanism to dynamically switch between CPU and GPU based on actual computational demands.

Describe the solution you'd like

Automatic CPU/GPU Switching

Implement a dynamic resource manager that:

  • Estimates the computational complexity of incoming queries.
  • Automatically routes simple queries to CPU and complex ones to GPU.
  • Balances latency, throughput, and hardware utilization.

This s could leverage profiling metrics such as token length, syntactic complexity, or historical processing times to make real-time decisions.

To reduce GPU transfer overhead and improve throughput, introduce a queue-based batching inspired by continuous batching in vLLM:

  • GPU-bound queries are batched efficiently.
  • Data transfers and computation are overlapped where possible.
  • CPU and GPU workloads are decoupled via a shared batch queue.

Example design:

type ResourceManager struct {
    CPUProcessor  *CPUProcessor
    GPUProcessor  *GPUProcessor
    Profiler      *ComputeProfiler
    Queue         *BatchQueue
}

func (rm *ResourceManager) ProcessQuery(query string) (*ClassificationResult, error) {
    // 1. Estimate query complexity
    complexity := rm.Profiler.EstimateComplexity(query)
    
    // 2. Route based on compute bounds
    if complexity.IsCPUBound() {
        return rm.CPUProcessor.Process(query)
    } else if complexity.IsGPUBound() {
        return rm.Queue.AddToBatch(query) // Enqueue for GPU batch processing
    }
    
    // Fallback to CPU
    return rm.CPUProcessor.Process(query)
}

Describe alternatives you've considered

Static Profiling at Startup:

Run a benchmark with sample data during initialization to calibrate CPU compute-bound vs. I/O-bound thresholds based on system configuration.

Additional context

If this feature aligns with the project's roadmap, would it be possible to assign this issue to me? I’d appreciate the opportunity to start working on it.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions