Skip to content

[ENHANCEMENT] Query-Aware Cache Selection for Efficient LLM Serving #65

@NoakLiu

Description

@NoakLiu

Problem Description

I would like to enhance the cache mechanism for LLM serving in Volcengine. The goal is to make the cache usage more adaptive to heterogeneous query workloads by dynamically adjusting reuse strategies.

Proposed Solution

The proposed enhancement is a Query-Aware Cache Selection (QACS) mechanism:

  • Lightweight Query Profiling: Efficiently estimate query complexity (length, entropy, semantic redundancy).
  • Dynamic Cache Selection: Adaptively choose between aggressive reuse, partial reuse, or fresh recomputation.
  • Policy Control: Expose configurable knobs (reuse_ratio, cache_budget, latency_accuracy_tradeoff) for fine-grained control.
  • Scheduler Integration: Integrate into the Volcengine serving scheduler for runtime-aware resource allocation.

Alternatives Considered

  • Uniform cache reuse across all queries, which is simple but cannot adapt to different workloads.
  • Manual configuration at deployment time, which lacks flexibility and incurs higher operational overhead.
    The proposed TinyServe provides adaptive, per-query cache control without requiring manual tuning.

Implementation Plan

  1. Implement a query profiler module (<1ms overhead).
  2. Add a policy manager to select caching strategies on the fly.
  3. Integrate with existing cache storage and serving scheduler.
  4. Provide developer-facing APIs for configuration.

Additional Context

  • This enhancement has been validated at scale across multiple LLMs, consistently improving efficiency and robustness.
  • Reference: Dong Liu et al., “TinyServe: Query-Aware Cache Selection for Efficient LLM Serving,” ACM Multimedia 2025 (Oral).
  • Expected benefits:
    • Efficiency: Up to 3x latency reduction under mixed workloads.
    • 🎯 Quality: Maintains high generation quality even with aggressive cache usage.
    • 📈 Scalability: Supports larger models within the same GPU budget.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions