[RFC]: Mooncake KV-Store Indexer API Standardization

## Motivation

Currently, we are planning to implement an indexer on Mooncake to track the global status of KVCache (including VRAM, DRAM, SSD, etc.), which can assist the gateway in achieving precise cache-aware request scheduling.

## Goals

   1.Use KVEvents API as a Public Contract

​	The KVEvents schema is clearly defined in both [vLLM](https://github.com/vllm-project/vllm/pull/16750) and [SGLang](https://github.com/sgl-project/sglang/pull/6098). Therefore, we will use KVEvents as the data structure for indexer updates.

   2.Support different storage media
​Enable KV-cache management across multiple tiers, including:
  - G1: Inference engine layer (VRAM/HBM/..)
  - G2: CPU offload / pooling DRAM storage/..
  - G3: SSD offload / 3FS/DFS/NFS/..

   3.DP rank Awareness

​	Track KV-cache status across different DP ranks to assist the gateway in achieving global cache-aware DPLB

   4.Support querying by Token ID or hash key

​	Long sequences context impose high network overhead if token IDs are directly transmitted, so we should provide an API  supporting hash keys as query input. 

## Proposed Change

### /query
POST api
```json
// input
{
    "model": "deepseek",
    "lora_name": "xx-adapter",
    "lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
    "token_ids": [1, 15, 100],
    "tenant_id":  99, // used for multi-tenant scenario
    "cache_salt":  334455,  // ensure cached data blocks are kept separate for different customers
}

// output
{
    "tenant_id": {
        "api_server_unique_name": {
            "longest_matched": 100, // the number of longest prefix matched token among multiple DPs(if there are)
            "GPU": 20, // device pool cache, managered by inference engine (vLLM/SGLang, etc.)
            "DP": {
                0: 10,
                1: 20
            },
            // G2 cache, p2p/mooncake-master store (DRAM)
            "CPU": 60,
            // G3 cache, p2p/mooncake-master store (SSD, 3fs, nfs, dfs, etc.)
            "DISK": 10 // disk-pool space matched prefix token number
        },
        ... // other engine instance
    },
    ... // other tenant
}
```


### /query_by_hash
POST api
```json
// input
{
    "model": "deepseek",
    "lora_name": "xx-adapter",
    "lora_id": 12, // defined for backward compatibility and should not be used together with `lora_name`
    "block_hash": ["hash_key_by_chunked_tokens"]
    "tenant_id":  99, 
    "cache_salt":  334455, 
}
// output
{
    "tenant_id": {
        "api_server_unique_name": {
            // Each model service uses its own independent block_size, token_num = block_size * matched_hash_key
            "longest_matched": 100, 
            "GPU": 20, 
            "DP": {
                0: 10,
                1: 20
            },
            "CPU": 60,
            "DISK": 10
        },
        ... // other engine instance
    },
    ... // other tenant
}
```

CC List: @stmatengss @doujiang24 @Asher-XunZhang

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Mooncake KV-Store Indexer API Standardization #1403

Motivation

Goals

Proposed Change

/query

/query_by_hash

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Mooncake KV-Store Indexer API Standardization #1403

Description

Motivation

Goals

Proposed Change

/query

/query_by_hash

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions