Skip to content

Commit 084227c

Browse files
shmuelkkfswain
authored andcommitted
Add documentation for the new Configuration via text feature (kubernetes-sigs#1110)
* Add documentation for the new text based configuration Signed-off-by: Shmuel Kallner <[email protected]> * Update from review comment Signed-off-by: Shmuel Kallner <[email protected]> * Formatting changes Signed-off-by: Shmuel Kallner <[email protected]> * Updated plugin types Signed-off-by: Shmuel Kallner <[email protected]> * Updated plugin types and reformatted to remove HTML Signed-off-by: Shmuel Kallner <[email protected]> --------- Signed-off-by: Shmuel Kallner <[email protected]>
1 parent 7812b81 commit 084227c

File tree

2 files changed

+256
-0
lines changed

2 files changed

+256
-0
lines changed

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ nav:
7070
- InferencePool Rollout: guides/inferencepool-rollout.md
7171
- Metrics and Observability: guides/metrics-and-observability.md
7272
- Configuration Guide:
73+
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
7374
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7475
- Implementer Guides:
7576
- Getting started: guides/implementers.md
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# Configuring Plugins via text
2+
3+
The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how
4+
it is configured. The IGW can be configured in several ways, either by code or via text.
5+
6+
If configured by code either a set of predetermined environment variables must be used or one must
7+
fork the IGW and change code.
8+
9+
A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format
10+
and can either be in a file or specified in-line as a parameter. The configuration defines the set of
11+
plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling
12+
the same plugin type to be instantiated multiple times, if needed. Also defined is a set of
13+
SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. The set
14+
of plugins instantiated must also include a Profile Handler, which determines which SchedulingProfiles
15+
will be used for a particular request.
16+
17+
It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is
18+
**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration
19+
text and in the future will also help in versioning the text.
20+
21+
It should also be noted that even when the configuration text is loaded from a file, it is loaded at
22+
the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored.
23+
24+
The configuration text has the following form:
25+
```yaml
26+
apiVersion: inference.networking.x-k8s.io/v1alpha1
27+
kind: EndpointPickerConfig
28+
plugins:
29+
- ....
30+
- ....
31+
schedulingProfiles:
32+
- ....
33+
- ....
34+
```
35+
36+
The first two lines of the configuration are constant and must appear as is.
37+
38+
The plugins section defines the set of plugins that will be instantiated and their parameters.
39+
Each entry in this section has the following form:
40+
41+
```yaml
42+
- name: aName
43+
type: a-type
44+
parameters:
45+
parm1: val1
46+
parm2: val2
47+
```
48+
49+
The fields in a plugin entry are:
50+
51+
- *name* which is optional, provides a name by which the plugin instance can be referenced. If this
52+
field is omitted, the plugin's type will be used as its name.
53+
- *type* specifies the type of the plugin to be instantiated.
54+
- *parameters* which is optional, defines the set of parameters used to configure the plugin in question.
55+
The actual set of parameters varies from plugin to plugin.
56+
57+
The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
58+
requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple
59+
serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry
60+
in this section has the following form:
61+
62+
```yaml
63+
- name: aName
64+
plugins:
65+
- pluginRef: plugin1
66+
- pluginRef: plugin2
67+
weight: 50
68+
```
69+
70+
The fields in a schedulingProfile entry are:
71+
72+
- *name* specifies the scheduling profile's name.
73+
- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request.
74+
Each entry in the schedulingProfile's plugins section has the following fields:
75+
- *pluginRef* is a reference to the name of the plugin instance to be used
76+
- *weight* is the weight to be used if the referenced plugin is a scorer.
77+
78+
A complete configuration might look like this:
79+
```yaml
80+
apiVersion: inference.networking.x-k8s.io/v1alpha1
81+
kind: EndpointPickerConfig
82+
plugins:
83+
- type: prefix-cache-scorer
84+
parameters:
85+
hashBlockSize: 5
86+
maxPrefixBlocksToMatch: 256
87+
lruCapacityPerServer: 31250
88+
- type: max-score-picker
89+
- type: single-profile-handler
90+
schedulingProfiles:
91+
- name: default
92+
plugins:
93+
- pluginRef: max-score-picker
94+
- pluginRef: prefix-cache-scorer
95+
weight: 50
96+
```
97+
98+
If the configuration is in a file, the EPP command line argument `--configFile`
99+
should be used to specify the full path of the file in question. For example:
100+
101+
```yaml
102+
apiVersion: apps/v1
103+
kind: Deployment
104+
metadata:
105+
name: ${EPP_NAME}
106+
...
107+
spec:
108+
...
109+
template:
110+
...
111+
spec:
112+
...
113+
containers:
114+
- name: epp
115+
image: ghcr.io/llm-d/llm-d-inference-scheduler:latest
116+
imagePullPolicy: IfNotPresent
117+
args:
118+
- -poolName
119+
- "${POOL_NAME}"
120+
...
121+
- --configFile
122+
- "/etc/epp/epp-config.yaml"
123+
```
124+
125+
If the configuration is passed as in-line text the EPP command line argument `--configText`
126+
should be used. For example:
127+
128+
```yaml
129+
apiVersion: apps/v1
130+
kind: Deployment
131+
metadata:
132+
name: ${EPP_NAME}
133+
...
134+
spec:
135+
...
136+
template:
137+
...
138+
spec:
139+
...
140+
containers:
141+
- name: epp
142+
image: ghcr.io/llm-d/llm-d-inference-scheduler:latest
143+
imagePullPolicy: IfNotPresent
144+
args:
145+
- -poolName
146+
- "${POOL_NAME}"
147+
...
148+
- --configText
149+
- |
150+
apiVersion: inference.networking.x-k8s.io/v1alpha1
151+
kind: EndpointPickerConfig
152+
plugins:
153+
- type: prefix-cache-scorer
154+
parameters:
155+
hashBlockSize: 5
156+
maxPrefixBlocksToMatch: 256
157+
lruCapacityPerServer: 31250
158+
- type: max-score-picker
159+
- type: single-profile-handler
160+
schedulingProfiles:
161+
- name: default
162+
plugins:
163+
- pluginRef: max-score-picker
164+
- pluginRef: prefix-cache-scorer
165+
weight: 50
166+
```
167+
168+
## Plugin Configuration
169+
170+
This section describes how to setup the various plugins that are available with the IGW.
171+
172+
#### **SingleProfileHandler**
173+
174+
Selects a single profile which is always the primary profile.
175+
176+
- *Type*: single-profile-handler
177+
- *Parameters*: none
178+
179+
#### **LeastKVCacheFilter**
180+
181+
Finds the max and min KV cache of all pods, divides the whole range (max-min) by the
182+
number of pods, and finds the pods that fall into the first range.
183+
184+
- *Type*: least-kv-cache-filter
185+
- *Parameters*: none
186+
187+
#### **LeastQueueFilter**
188+
189+
Finds the max and min queue size of all pods, divides the whole range (max-min) by the
190+
number of pods, and finds the pods that fall into the first range.
191+
192+
- *Type*: least-queue-filter
193+
- *Parameters*: none
194+
195+
#### **LoraAffinityFilter**
196+
197+
Implements a pod selection strategy that when the use of a LoRA adapter is requested, prioritizes pods
198+
that are believed to have the specific LoRA adapter loaded. It also allows for load balancing through
199+
some randomization.
200+
201+
- *Type*: lora-affinity-filter
202+
- *Parameters*:
203+
- `threshold` a probability threshold to sometimes select pods that don't seem to have the LoRA
204+
adapter loaded to enable load balancing. If not specified defaults to `0.999`
205+
206+
#### **LowQueueFilter**
207+
208+
Filters out pods who's waiting queue size is greater than the specified theshold.
209+
210+
- *Type*: low-queue-filter
211+
- *Parameters*:
212+
- `threshold` the waiting queue threshold. If not specified defaults to `128`
213+
214+
#### **PrefixCachePlugin**
215+
216+
Scores pods based on the amount of the prompt is believed to be in the pod's KvCache.
217+
218+
- *Type*: prefix-cache-scorer
219+
- *Parameters*:
220+
- `hashBlockSize` specified the size of the blocks to break up the input prompt when
221+
calculating the block hashes. If not specified defaults to `64`
222+
- `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If
223+
not specified defaults to `256`
224+
- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries
225+
per server (pod). If not specified defaults to `31250`
226+
227+
#### **MaxScorePicker**
228+
229+
Picks the pod with the maximum score from the list of candidates.
230+
231+
- *Type*: max-score-picker
232+
- *Parameters*: none
233+
234+
#### **RandomPicker**
235+
236+
Picks a random pod from the list of candidates.
237+
238+
- *Type*: random-picker
239+
- *Parameters*: none
240+
241+
#### **KvCacheScorer**
242+
243+
Scores the candidate pods based on their KV cache utilization.
244+
245+
- *Type*: kv-cache-scorer
246+
- *Parameters*: none
247+
248+
#### **QueueScorer**
249+
250+
Scores list of candidate pods based on the pod's waiting queue size. The lower the
251+
waiting queue size the pod has, the higher the score it will get (since it's more
252+
available to serve new request).
253+
254+
- *Type*: queue-scorer
255+
- *Parameters*: none

0 commit comments

Comments
 (0)