[ENHANCEMENT] Query-Aware Cache Selection for Efficient LLM Serving

### Problem Description
I would like to enhance the cache mechanism for LLM serving in Volcengine. The goal is to make the cache usage more adaptive to heterogeneous query workloads by dynamically adjusting reuse strategies.

### Proposed Solution
The proposed enhancement is a **Query-Aware Cache Selection (QACS)** mechanism:  
- **Lightweight Query Profiling:** Efficiently estimate query complexity (length, entropy, semantic redundancy).  
- **Dynamic Cache Selection:** Adaptively choose between aggressive reuse, partial reuse, or fresh recomputation.  
- **Policy Control:** Expose configurable knobs (`reuse_ratio`, `cache_budget`, `latency_accuracy_tradeoff`) for fine-grained control.  
- **Scheduler Integration:** Integrate into the Volcengine serving scheduler for runtime-aware resource allocation.

### Alternatives Considered
- **Uniform cache reuse** across all queries, which is simple but cannot adapt to different workloads.  
- **Manual configuration** at deployment time, which lacks flexibility and incurs higher operational overhead.  
The proposed TinyServe provides adaptive, per-query cache control without requiring manual tuning.

### Implementation Plan
1. Implement a **query profiler** module (<1ms overhead).  
2. Add a **policy manager** to select caching strategies on the fly.  
3. Integrate with existing cache storage and serving scheduler.  
4. Provide developer-facing APIs for configuration.

### Additional Context
- This enhancement has been **validated at scale across multiple LLMs**, consistently improving efficiency and robustness.  
- Reference: *Dong Liu et al., “TinyServe: Query-Aware Cache Selection for Efficient LLM Serving,” ACM Multimedia 2025 (Oral).*  
- Expected benefits:  
  - ⚡ **Efficiency:** Up to 3x latency reduction under mixed workloads.  
  - 🎯 **Quality:** Maintains high generation quality even with aggressive cache usage.  
  - 📈 **Scalability:** Supports larger models within the same GPU budget.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENHANCEMENT] Query-Aware Cache Selection for Efficient LLM Serving #65

Problem Description

Proposed Solution

Alternatives Considered

Implementation Plan

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT] Query-Aware Cache Selection for Efficient LLM Serving #65

Description

Problem Description

Proposed Solution

Alternatives Considered

Implementation Plan

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions