-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
Problem Description
I would like to enhance the cache mechanism for LLM serving in Volcengine. The goal is to make the cache usage more adaptive to heterogeneous query workloads by dynamically adjusting reuse strategies.
Proposed Solution
The proposed enhancement is a Query-Aware Cache Selection (QACS) mechanism:
- Lightweight Query Profiling: Efficiently estimate query complexity (length, entropy, semantic redundancy).
- Dynamic Cache Selection: Adaptively choose between aggressive reuse, partial reuse, or fresh recomputation.
- Policy Control: Expose configurable knobs (
reuse_ratio,cache_budget,latency_accuracy_tradeoff) for fine-grained control. - Scheduler Integration: Integrate into the Volcengine serving scheduler for runtime-aware resource allocation.
Alternatives Considered
- Uniform cache reuse across all queries, which is simple but cannot adapt to different workloads.
- Manual configuration at deployment time, which lacks flexibility and incurs higher operational overhead.
The proposed TinyServe provides adaptive, per-query cache control without requiring manual tuning.
Implementation Plan
- Implement a query profiler module (<1ms overhead).
- Add a policy manager to select caching strategies on the fly.
- Integrate with existing cache storage and serving scheduler.
- Provide developer-facing APIs for configuration.
Additional Context
- This enhancement has been validated at scale across multiple LLMs, consistently improving efficiency and robustness.
- Reference: Dong Liu et al., “TinyServe: Query-Aware Cache Selection for Efficient LLM Serving,” ACM Multimedia 2025 (Oral).
- Expected benefits:
- ⚡ Efficiency: Up to 3x latency reduction under mixed workloads.
- 🎯 Quality: Maintains high generation quality even with aggressive cache usage.
- 📈 Scalability: Supports larger models within the same GPU budget.
Metadata
Metadata
Assignees
Labels
No labels