Skip to content

Commit 63d6b67

Browse files
authored
Initial commit (#1199)
1 parent 0984c50 commit 63d6b67

File tree

1 file changed

+171
-0
lines changed
  • docs/proposals/1199-inferencemodel-api-evolution

1 file changed

+171
-0
lines changed
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Scheduling Subsystem Architecture
2+
3+
Author(s): @kfswain, @ahg-g, @lukeavandrie
4+
## Proposal Status
5+
***Draft***
6+
7+
## Summary
8+
Multiple docs have discussed the restructuring of the InferenceModel API. This [doc](https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo) proposes an InferenceSchedulingObjective CRD, and this [doc](https://docs.google.com/document/d/1G-CQ17CM4j1vNE3T6u9uP2q-m6jK14ANPCwTfJ2qLS4/edit?tab=t.0) builds upon the previous document to solidify the requirement for the new iteration of the InferenceModel API to continue to solve the identity problem. Both these documents were useful in continuing to gather feedback & iterate on a proper solution.
9+
10+
This proposal is intended to act as the plan of record for solution that will be implemented.
11+
12+
## Implementation Phases
13+
14+
### Phase 1 - Rename, Split, & Modify InferenceModel
15+
A few points were used in composing justification & structure of this change:
16+
- the Criticality field of InferenceModel is in use, & provides functionality
17+
- InferenceModel is an Alpha API
18+
- InferenceModel is not depended upon by upstream or downstream components
19+
20+
Phase 1 will retain the Criticality functionality, but will rename the InferenceModel API and slim down the spec. Additionally, this slimmed down spec will be able to be applied at a _per request_ level. Justification in [Phase 1](#phase-1).
21+
22+
### Phase 2 - Introduce new Usage Tracking, Fairness, & SLO CRDs
23+
Phase 2 will happen over a longer period of time & slowly introduce new CRDs to Inference Gateway, much of what is discussed in this proposal is keeping Phase 2 in mind, but phase 2 can be considered experimental & subject to change.
24+
25+
Primarily phase 2 will introduce and develop these these CRDs:
26+
- Usage tracking (used in fairness)
27+
- Fairness configuration
28+
- SLO configuration
29+
30+
## Design Principles
31+
32+
### Goals
33+
- Reliable and predictable fairness allocation
34+
- Disconnect identity from policy-like objects where possible
35+
- Anonymous identity/defaults are graceful (fault-tolerant) & unsurprising
36+
- Scalable, simple, and reusable config
37+
- Retain the functionality of InferenceModel
38+
- Traffic splitting models & modelName rewrite
39+
- Criticality
40+
41+
### Non-Goals
42+
- Addressing security concerns with the API, this is currently expected to either be:
43+
- Entirely contained within a trusted system
44+
- Or auth handled upstream
45+
- IGW implementing a custom auth mechanism
46+
47+
48+
## Definitions
49+
50+
- **Tenant** Kuberenetes chooses the term ***tenant*** as described [here](https://kubernetes.io/docs/concepts/security/multi-tenancy/#tenants). Fairness APIs _may_ be used in multitenant scenarios, so as an example, multi-tenancy may be used.
51+
52+
# Proposal
53+
54+
Discussion of the problem(s) can be seen in the linked documents. Here we will describe the new API surface.
55+
56+
## Phase 1
57+
58+
### Structure change
59+
This API solves 3 general pillars of problem, that can also be categorized into 2 areas:
60+
61+
Higher-order Request Gropuing (Usage tracking):
62+
- This API describes Resource Sharing (Criticality/Fairness)
63+
- This API describes Identification (used in Fairness)
64+
65+
Request specific objectives:
66+
- This API describes Specific Request Policy (SLO/Criticality)
67+
68+
69+
As such, the InferenceModel API will be split into separate CRDs to reflect the difference in these scopes. Phase 1 will focus on the **Request specific objectives**. Specifically it will maintain the inclusion of criticality. Other phase 1 changes:
70+
71+
- The EPP will expose a flag to define the header key (default: `x-gateway-inference-objectives`) that will be used to assign InferenceObjectives to requests
72+
- The EPP will expose a flag to define the header key (default: `x-gateway-inference-fairness-id`) that will be used in tracking Request Usage (which will act as the identifier for simple fairness implementation)
73+
- The EPP will expose a flag to define the header key (default: `x-gateway-model-name-rewrite`) that will be used when the provided model name is desired to be overwritten
74+
- The modelName rewrite functionality will be included into EPP as a core feature (also handled by header) **NOTE**: _Relying on this feature for writing a proper model name disables the ability to use the fail-open feature_
75+
- Continue to support traffic splitting across models, although not necessarily via GIE CRDs directly (e.g., delegated to GW API/HTTPRoute) - example [here](https://docs.google.com/document/d/1s4U4T_cjQkk4UeIDyAJl2Ox6FZoBigXBXn9Ai0qV7As/edit?tab=t.0#heading=h.bkttj79mzxlz)
76+
77+
### Naming
78+
The current name for the CRD that will house **Request specific objectives** is planned to be `InferenceObjectives`
79+
80+
81+
### CRD spec
82+
83+
This CRD definition is a slimmed version of InferenceModel with a name change. Example here:
84+
85+
```golang
86+
type InferenceObjectives struct {
87+
metav1.TypeMeta
88+
metav1.ObjectMeta
89+
90+
Spec InferenceObjectivesSpec
91+
}
92+
93+
type InferenceObjectivesSpec struct {
94+
PoolRef InferenceObjectReference
95+
96+
// this is a departure from InferenceModel that used string for criticality.
97+
// We got quite a bit of feedback around allowing for custom criticality bands, so an int/enum is more flexible & carries inherent stack rank value.
98+
Criticality *int
99+
}
100+
101+
```
102+
103+
## Phase 2 - SUBJECT TO CHANGE
104+
105+
***NOTE: `InferenceUsageMeter` Name is a placeholder***
106+
107+
### CRD spec
108+
```golang
109+
110+
type InferenceUsageMeter struct {
111+
metav1.TypeMeta
112+
metav1.ObjectMeta
113+
114+
Spec InferenceUsageMeterSpec
115+
}
116+
117+
type InferenceUsageMeterSpec struct {
118+
// optional field that defaults to kube object name if not included
119+
ID *string
120+
PoolRef InferenceObjectReference
121+
122+
// one of; This allows for embedded configuration or reference to a commonly used config.
123+
UsageLimits *NotYetDefinedFairnessCRD
124+
UsageLimitsRef *InferenceObjectReference
125+
}
126+
127+
type InferenceObjectives struct {
128+
metav1.TypeMeta
129+
metav1.ObjectMeta
130+
131+
Spec InferenceObjectivesSpec
132+
}
133+
134+
type InferenceObjectivesSpec struct {
135+
PoolRef InferenceObjectReference
136+
137+
// this is a departure from InferenceModel that used string for criticality.
138+
// We got quite a bit of feedback around allowing for custom criticality bands, so an int/enum is more flexible & carries inherent stack rank value.
139+
Criticality *int
140+
PerformanceObjectives NotYetDefinedSLOCRD
141+
PerformanceObjectivesRef *InferenceObjectReference
142+
// Doc on SLO CRD here: https://docs.google.com/document/d/1j2KRAT68_FYxq1iVzG0xVL-DHQhGVUZBqiM22Hd_0hc/edit?resourcekey=0-5cSovS8QcRQNYXj0_kRMiw&tab=t.0#heading=h.emkaixupvf39
143+
}
144+
```
145+
146+
### Intent
147+
148+
The purpose(s) of the `InferenceUsageMeter` is:
149+
- Create a strong concept of usage tracking within the inference pool; used to associate groups of requests together for the purpose of Flow Clontrol (Fairness) - which can enforce:
150+
- Fair resource sharing
151+
- Inter-tenant prioritization
152+
- SLO attainment
153+
- Detach identification from the modelName field
154+
155+
## Design points
156+
Included is some discussion around specific choices made in the API design
157+
158+
### Identification
159+
**Note**: The ID field would default to the kube name.
160+
161+
The only field associated with identification is the `ID` field. An optional ID field was chosen (rather than strictly using the metadata name), because:
162+
- A user may not want to put the same restrictions on the id that is enfored on a kube resource name
163+
- The ID name may be duplicated across different pools
164+
- This could also be solved by allowing the UsageMeter & Objectives to reference multiple pools
165+
- Use of a kube-generated name would force an upstream Auth mechanism to be aware of the `InferenceObjectives` API
166+
167+
***Discussion point***: In order to support a high volume of tenants, we could allow IGW to accept unique IDs that do not have an explicit InferenceUsageMeter object defined. Instead using a default fairness configuration. **Feedback here requested.**
168+
169+
#### Alternative consideration(s)
170+
- Expanding the PoolRef field to be plural was considered, however that was not selected to maintain simplicity. It is a decision that can be revisited in the future, however.
171+

0 commit comments

Comments
 (0)