generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 182
Closed
Labels
triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.
Milestone
Description
The InferencePool
API is fairly lightweight right now, and what it conceptually represents seems to be well understood: “it is a special type of Service for inference deployments” .
The InferenceModel
API however is less so, this issue proposes to revise the semantics and name of the InferenceModel API for the following reasons:
- The API name gives the impression that it represents and manages the deployment of a model/adapter, which is not true.
- The API specs the model name as the only request matching rule. This means only one InferenceModel object can apply to the requests targeting a given model within a single
InferencePool
, this limits the ability to define different policies for different users or applications targeting the same model. - Defining an
InferenceModel
for each adapter served by the pool can sometimes be seen as an operational overhead. Consider the case where the pool is serving hundreds of adapters.
The proposal is to introduce a new API named InferenceSchedulingObjective
that is focused on defining scheduling policies for a matching request flow, concretely:
- Drop the InferenceModel API, but no changes to the InferencePool API
Create a new API namedInferenceSchedulingObjective
. The API will be limited to defining endpoint picking scheduling policies (aka serving objectives) for matching requests. The expectation is that the inference-scheduler (run by the EPP) will be the main controller actuating on this API. - Extend request matching beyond model name to include headers. This allows defining different scheduling policies for different request flows (apps or users) while targeting the same model.
- Allow defining a default
InferenceSchedulingObjective
perInferencePool
as a fall back policy when no one matches the request. - Traffic splitting is not part of the InferenceSchedulingObjective API. Traffic splitting is not an endpoint scheduling objective, it is a request routing objective. As we describe below, with some creativity, we can offload traffic splitting to
HTTPRoute
. An intended side effect of this is that users will be able to define different scheduling policies for different target models, something they can’t do with the current API.
Please check the following doc for a more detailed discussion of the proposal: https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0
Metadata
Metadata
Assignees
Labels
triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.