Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/proposals/002-api-proposal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
# Gateway API Inference Extension

## Proposal Status
***Draft***
***Implemented/Obsolete***
- Refer to [the InferencePool v1 API review](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1173) for the InferencePool modifications
- Refer to [the InferenceModel evolution proposal](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/1199-inferencemodel-api-evolution) for the InferenceModel modifications
- Refer to the `/api/` & `/apix/` directories for the current status


## Table of Contents

Expand Down
6 changes: 6 additions & 0 deletions docs/proposals/003-model-server-protocol/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

This is the protocol between the EPP and the model servers.

## Proposal status
***Partially implemented***

Notes
- With the creation of the [pluggable architecture](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal) this protocol can, by definition, not be as strict

### Inference API Protocol

The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
Expand Down
5 changes: 5 additions & 0 deletions docs/proposals/004-endpoint-picker-protocol/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Endpoint Picker Protocol

## Proposal Status
***Implemented***

# Proposal

The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's
responsible for picking an endpoint from the `InferencePool`. A reference implementation can be
found [here](../../../pkg/epp/).
Expand Down
2 changes: 1 addition & 1 deletion docs/proposals/006-scheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Authors: @kfswain, @smarterclayton

## Proposal Status
***Draft***
***Implemented***

## Table of Contents

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Prefix Cache Aware Request Scheduling

## Proposal Status
***Implemented***

# Proposal

## Overview

Prefix caching is a well-known technique in LLM inference to save duplicate tensor computation for prompts with the same prefix tokens, and is available in many model servers or model as a service providers. Leveraging prefix caching can significantly boost system performance, especially the time to first token (TTFT). Given that EPP has a global view of requests and model servers in the `InferencePool`, it can schedule requests intelligently to maximize the global prefix cache hit rate.
Expand Down
2 changes: 1 addition & 1 deletion docs/proposals/0683-epp-architecture-proposal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Author(s): @kfswain
## Proposal Status
***Draft***
***Implemented***

## Summary

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Author(s): @kfswain, @ahg-g, @nirrozenbaum
## Proposal Status
***Draft***
***Implemented***

## Summary
The Scheduling Subsystem is a framework used to implement scheduling algorithms. High level definition [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/006-scheduler) & EPP Architecture [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
Expand Down
2 changes: 1 addition & 1 deletion docs/proposals/1023-data-layer-architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Author(s): @elevran @nirrozenbaum

## Proposal Status

***Draft***
***Accepted***

## Summary

Expand Down
6 changes: 5 additions & 1 deletion docs/proposals/1199-inferencemodel-api-evolution/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

Author(s): @kfswain, @ahg-g, @lukeavandrie
## Proposal Status
***Draft***
***Implemented***

Note
- Phase 1 is complete
- Phase 2 is still WIP

## Summary
Multiple docs have discussed the restructuring of the InferenceModel API. This [doc](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo) proposes an InferenceSchedulingObjective CRD, and this [doc](https://docs.google.com/document/d/1G-CQ17CM4j1vNE3T6u9uP2q-m6jK14ANPCwTfJ2qLS4/edit?tab=t.0) builds upon the previous document to solidify the requirement for the new iteration of the InferenceModel API to continue to solve the identity problem. Both these documents were useful in continuing to gather feedback & iterate on a proper solution.
Expand Down