Skip to content

[RFC]: Mooncake KV-Store Indexer API Standardization #1403

@yejj710

Description

@yejj710

Motivation

Currently, we are planning to implement an indexer on Mooncake to track the global status of KVCache (including VRAM, DRAM, SSD, etc.), which can assist the gateway in achieving precise cache-aware request scheduling.

Goals

1.Use KVEvents API as a Public Contract

​ The KVEvents schema is clearly defined in both vLLM and SGLang. Therefore, we will use KVEvents as the data structure for indexer updates.

2.Support different storage media
​Enable KV-cache management across multiple tiers, including:

  • G1: Inference engine layer (VRAM/HBM/..)
  • G2: CPU offload / pooling DRAM storage/..
  • G3: SSD offload / 3FS/DFS/NFS/..

3.DP rank Awareness

​ Track KV-cache status across different DP ranks to assist the gateway in achieving global cache-aware DPLB

4.Support querying by Token ID or hash key

​ Long sequences context impose high network overhead if token IDs are directly transmitted, so we should provide an API supporting hash keys as query input.

Proposed Change

/query

POST api

// input
{
    "model": "deepseek",
    "lora_name": "xx-adapter",
    "lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
    "token_ids": [1, 15, 100],
    "tenant_id":  99, // used for multi-tenant scenario
    "cache_salt":  334455,  // ensure cached data blocks are kept separate for different customers
}

// output
{
    "tenant_id": {
        "api_server_unique_name": {
            "longest_matched": 100, // the number of longest prefix matched token among multiple DPs(if there are)
            "GPU": 20, // device pool cache, managered by inference engine (vLLM/SGLang, etc.)
            "DP": {
                0: 10,
                1: 20
            },
            // G2 cache, p2p/mooncake-master store (DRAM)
            "CPU": 60,
            // G3 cache, p2p/mooncake-master store (SSD, 3fs, nfs, dfs, etc.)
            "DISK": 10 // disk-pool space matched prefix token number
        },
        ... // other engine instance
    },
    ... // other tenant
}

/query_by_hash

POST api

// input
{
    "model": "deepseek",
    "lora_name": "xx-adapter",
    "lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
    "block_hash": ["hash_key_by_chunked_tokens"]
    "tenant_id":  99, 
    "cache_salt":  334455, 
}
// output
{
    "tenant_id": {
        "api_server_unique_name": {
            // Each model service uses its own independent block_size, token_num = block_size * matched_hash_key
            "longest_matched": 100, 
            "GPU": 20, 
            "DP": {
                0: 10,
                1: 20
            },
            "CPU": 60,
            "DISK": 10
        },
        ... // other engine instance
    },
    ... // other tenant
}

CC List: @stmatengss @doujiang24 @Asher-XunZhang

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions