diff --git a/docs/proposals/002-api-proposal/README.md b/docs/proposals/002-api-proposal/README.md index b8bf14b35..e88600844 100644 --- a/docs/proposals/002-api-proposal/README.md +++ b/docs/proposals/002-api-proposal/README.md @@ -2,7 +2,11 @@ # Gateway API Inference Extension ## Proposal Status - ***Draft*** + ***Implemented/Obsolete*** + - Refer to [the InferencePool v1 API review](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1173) for the InferencePool modifications + - Refer to [the InferenceModel evolution proposal](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/1199-inferencemodel-api-evolution) for the InferenceModel modifications + - Refer to the `/api/` & `/apix/` directories for the current status + ## Table of Contents diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md index 4b277436a..65e440a12 100644 --- a/docs/proposals/003-model-server-protocol/README.md +++ b/docs/proposals/003-model-server-protocol/README.md @@ -2,6 +2,12 @@ This is the protocol between the EPP and the model servers. +## Proposal status +***Partially implemented*** + +Notes +- With the creation of the [pluggable architecture](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal) this protocol can, by definition, not be as strict + ### Inference API Protocol The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) diff --git a/docs/proposals/004-endpoint-picker-protocol/README.md b/docs/proposals/004-endpoint-picker-protocol/README.md index 18cb10d5b..03b96bb71 100644 --- a/docs/proposals/004-endpoint-picker-protocol/README.md +++ b/docs/proposals/004-endpoint-picker-protocol/README.md @@ -1,5 +1,10 @@ # Endpoint Picker Protocol +## Proposal Status +***Implemented*** + +# Proposal + The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's responsible for picking an endpoint from the `InferencePool`. A reference implementation can be found [here](../../../pkg/epp/). diff --git a/docs/proposals/006-scheduler/README.md b/docs/proposals/006-scheduler/README.md index 77fc4c258..21aa31f3a 100644 --- a/docs/proposals/006-scheduler/README.md +++ b/docs/proposals/006-scheduler/README.md @@ -3,7 +3,7 @@ Authors: @kfswain, @smarterclayton ## Proposal Status - ***Draft*** + ***Implemented*** ## Table of Contents diff --git a/docs/proposals/0602-prefix-cache-aware-routing-proposal/README.md b/docs/proposals/0602-prefix-cache-aware-routing-proposal/README.md index 468e3be8e..7388d9126 100644 --- a/docs/proposals/0602-prefix-cache-aware-routing-proposal/README.md +++ b/docs/proposals/0602-prefix-cache-aware-routing-proposal/README.md @@ -1,5 +1,10 @@ # Prefix Cache Aware Request Scheduling +## Proposal Status +***Implemented*** + +# Proposal + ## Overview Prefix caching is a well-known technique in LLM inference to save duplicate tensor computation for prompts with the same prefix tokens, and is available in many model servers or model as a service providers. Leveraging prefix caching can significantly boost system performance, especially the time to first token (TTFT). Given that EPP has a global view of requests and model servers in the `InferencePool`, it can schedule requests intelligently to maximize the global prefix cache hit rate. diff --git a/docs/proposals/0683-epp-architecture-proposal/README.md b/docs/proposals/0683-epp-architecture-proposal/README.md index 7bd688c73..c788e45c8 100644 --- a/docs/proposals/0683-epp-architecture-proposal/README.md +++ b/docs/proposals/0683-epp-architecture-proposal/README.md @@ -2,7 +2,7 @@ Author(s): @kfswain ## Proposal Status - ***Draft*** + ***Implemented*** ## Summary diff --git a/docs/proposals/0845-scheduler-architecture-proposal/README.md b/docs/proposals/0845-scheduler-architecture-proposal/README.md index 4141ce6a2..8c3e5d941 100644 --- a/docs/proposals/0845-scheduler-architecture-proposal/README.md +++ b/docs/proposals/0845-scheduler-architecture-proposal/README.md @@ -2,7 +2,7 @@ Author(s): @kfswain, @ahg-g, @nirrozenbaum ## Proposal Status - ***Draft*** + ***Implemented*** ## Summary The Scheduling Subsystem is a framework used to implement scheduling algorithms. High level definition [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/006-scheduler) & EPP Architecture [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). diff --git a/docs/proposals/1023-data-layer-architecture/README.md b/docs/proposals/1023-data-layer-architecture/README.md index e39264319..5af12ab5f 100644 --- a/docs/proposals/1023-data-layer-architecture/README.md +++ b/docs/proposals/1023-data-layer-architecture/README.md @@ -4,7 +4,7 @@ Author(s): @elevran @nirrozenbaum ## Proposal Status -***Draft*** +***Accepted*** ## Summary diff --git a/docs/proposals/1199-inferencemodel-api-evolution/README.md b/docs/proposals/1199-inferencemodel-api-evolution/README.md index 1ad687d15..5478fd402 100644 --- a/docs/proposals/1199-inferencemodel-api-evolution/README.md +++ b/docs/proposals/1199-inferencemodel-api-evolution/README.md @@ -2,7 +2,11 @@ Author(s): @kfswain, @ahg-g, @lukeavandrie ## Proposal Status - ***Draft*** + ***Implemented*** + + Note + - Phase 1 is complete + - Phase 2 is still WIP ## Summary Multiple docs have discussed the restructuring of the InferenceModel API. This [doc](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo) proposes an InferenceSchedulingObjective CRD, and this [doc](https://docs.google.com/document/d/1G-CQ17CM4j1vNE3T6u9uP2q-m6jK14ANPCwTfJ2qLS4/edit?tab=t.0) builds upon the previous document to solidify the requirement for the new iteration of the InferenceModel API to continue to solve the identity problem. Both these documents were useful in continuing to gather feedback & iterate on a proper solution.