-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
Recently, KV cache compression has emerged as a critical optimization technique for large language models (LLMs). The KV cache exhibits strong temporal and spatial locality—similar to time-series data—where adjacent tokens and attention heads often share redundant patterns. Given these characteristics, does this algorithm (or approach) effectively adapt to KV cache compression while maintaining efficient GPU execution?
Metadata
Metadata
Assignees
Labels
No labels