-
Notifications
You must be signed in to change notification settings - Fork 550
Description
Motivation
Currently, we are planning to implement an indexer on Mooncake to track the global status of KVCache (including VRAM, DRAM, SSD, etc.), which can assist the gateway in achieving precise cache-aware request scheduling.
Goals
1.Use KVEvents API as a Public Contract
The KVEvents schema is clearly defined in both vLLM and SGLang. Therefore, we will use KVEvents as the data structure for indexer updates.
2.Support different storage media
Enable KV-cache management across multiple tiers, including:
- G1: Inference engine layer (VRAM/HBM/..)
- G2: CPU offload / pooling DRAM storage/..
- G3: SSD offload / 3FS/DFS/NFS/..
3.DP rank Awareness
Track KV-cache status across different DP ranks to assist the gateway in achieving global cache-aware DPLB
4.Support querying by Token ID or hash key
Long sequences context impose high network overhead if token IDs are directly transmitted, so we should provide an API supporting hash keys as query input.
Proposed Change
/query
POST api
// input
{
"model": "deepseek",
"lora_name": "xx-adapter",
"lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
"token_ids": [1, 15, 100],
"tenant_id": 99, // used for multi-tenant scenario
"cache_salt": 334455, // ensure cached data blocks are kept separate for different customers
}
// output
{
"tenant_id": {
"api_server_unique_name": {
"longest_matched": 100, // the number of longest prefix matched token among multiple DPs(if there are)
"GPU": 20, // device pool cache, managered by inference engine (vLLM/SGLang, etc.)
"DP": {
0: 10,
1: 20
},
// G2 cache, p2p/mooncake-master store (DRAM)
"CPU": 60,
// G3 cache, p2p/mooncake-master store (SSD, 3fs, nfs, dfs, etc.)
"DISK": 10 // disk-pool space matched prefix token number
},
... // other engine instance
},
... // other tenant
}/query_by_hash
POST api
// input
{
"model": "deepseek",
"lora_name": "xx-adapter",
"lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
"block_hash": ["hash_key_by_chunked_tokens"]
"tenant_id": 99,
"cache_salt": 334455,
}
// output
{
"tenant_id": {
"api_server_unique_name": {
// Each model service uses its own independent block_size, token_num = block_size * matched_hash_key
"longest_matched": 100,
"GPU": 20,
"DP": {
0: 10,
1: 20
},
"CPU": 60,
"DISK": 10
},
... // other engine instance
},
... // other tenant
}CC List: @stmatengss @doujiang24 @Asher-XunZhang
Before submitting a new issue...
- Make sure you already searched for relevant issues and read the documentation