You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how
4
-
it is configured. The IGW can be configured in several ways, either by code or via text.
5
-
6
-
If configured by code either a set of predetermined environment variables must be used or one must
7
-
fork the IGW and change code.
3
+
The Inference Gateway (IGW) can be configured via a text based configuration.
8
4
9
-
A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format
10
-
and can either be in a file or specified in-line as a parameter. The configuration defines the set of
11
-
plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling
12
-
the same plugin type to be instantiated multiple times, if needed.
5
+
At this time the text based configuration allows for:
13
6
14
-
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. If one is not defailed, a default one names `default` will be added and will reference all of the
15
-
instantiated plugins.
16
-
17
-
The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
18
-
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
19
-
contains one profile, in which case the `SingleProfileHandler` will be used.
7
+
1. The configuration of the lifecycle hooks (plugins) that are used by the IGW.
8
+
2. The configuration of the saturation detector
9
+
3. A set of feature gates that are used to enable experimental features.
20
10
21
-
In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
22
-
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
23
-
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.
11
+
The configuration text is in YAML format and can either be in a file or specified in-line as a parameter.
24
12
25
13
It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is
26
14
**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration
@@ -39,10 +27,49 @@ plugins:
39
27
schedulingProfiles:
40
28
- ....
41
29
- ....
30
+
saturationDetector:
31
+
...
32
+
featureGates:
33
+
...
42
34
```
43
35
44
36
The first two lines of the configuration are constant and must appear as is.
45
37
38
+
The plugins section defines the set of plugins that will be instantiated and their parameters. This section is described in more detail in the section [Configuring Plugins via text](#configuring-plugins-via-text)
39
+
40
+
The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling
41
+
requests to pods. This section is described in more detail in the section [Configuring Plugins via text](#configuring-plugins-via-text)
42
+
43
+
The saturationDetector section configures the saturation detector, which is used to determine if special
44
+
action needs to eb taken due to the system being overloaded or saturated. This section is described in more detail in the section [Saturation Detector configuration](#saturation-detector-configuration)
45
+
46
+
The featureGates sections allows the enablement of experimental features of the IGW. This section is
47
+
described in more detail in the section [Feature Gates](#feature-gates)
48
+
49
+
## Configuring Plugins via text
50
+
51
+
The set of plugins that are used by the IGW is determined by how
52
+
it is configured. The IGW can be configured in several ways, either by code or via text.
53
+
54
+
If configured by code either a set of predetermined environment variables must be used or one must
55
+
fork the IGW and change code.
56
+
57
+
A simpler way to configure the IGW is to use a text based configuration. The configuration defines the
58
+
set of plugins to be instantiated along with their parameters. Each plugin can also be given a name,
59
+
enabling the same plugin type to be instantiated multiple times, if needed.
60
+
61
+
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling
62
+
a request. If one is not defined, a default one names `default` will be added and will reference all of
63
+
the instantiated plugins.
64
+
65
+
The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
66
+
will be used for a particular request. A Profile Handler must be specified, unless the configuration only
67
+
contains one profile, in which case the `SingleProfileHandler` will be used.
68
+
69
+
In addition, the set of instantiated plugins can also include a picker, which chooses the actual pod to which
70
+
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
71
+
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.
72
+
46
73
The plugins section defines the set of plugins that will be instantiated and their parameters.
47
74
Each entry in this section has the following form:
48
75
@@ -190,7 +217,7 @@ schedulingProfiles:
190
217
-pluginRef: max-score-picker
191
218
```
192
219
193
-
## Plugin Configuration
220
+
### Plugin Configuration
194
221
195
222
This section describes how to setup the various plugins that are available with the IGW.
196
223
@@ -266,3 +293,58 @@ scored higher (since it's more available to serve new request).
266
293
267
294
- *Type*: lora-affinity-scorer
268
295
- *Parameters*: none
296
+
297
+
## Saturation Detector configuration
298
+
299
+
The Saturation Detector is used to determine if the the cluster is overloaded, i.e. saturated. When
300
+
the cluster is saturated special actions will be taken depending what has been enabled. At this time, sheddable requests will be dropped.
301
+
302
+
The Saturation Detector determines that the cluster is saturated by looking at the following metrics provided by the inference servers:
303
+
304
+
- Backed waiting queue size
305
+
- KV cache utilization
306
+
- Metrics staleness
307
+
308
+
The Saturation Detector is configured via the saturationDetector section of the overall configuration.
309
+
It has the following form:
310
+
311
+
```yaml
312
+
saturationDetector:
313
+
queueDepthThreshold: 8
314
+
kvCacheUtilThreshold: 0.75
315
+
metricsStalenessThreshold: 150ms
316
+
```
317
+
318
+
The various sub-fields of the saturationDetector section are:
319
+
320
+
- The `queueDepthThreshold` field which defines the backend waiting queue size above which a
321
+
pod is considered to have insufficient capacity for new requests. This field is optional, if
322
+
omitted a value of `5` will be used.
323
+
- The `kvCacheUtilThreshold` field which defines the KV cache utilization (0.0 to 1.0) above
324
+
which a pod is considered to have insufficient capacity. This field is optional, if omitted
325
+
a value of `0.8` will be used.
326
+
- The `metricsStalenessThreshold` field which defines how old a pod's metrics can be. If a pod's
327
+
metrics are older than this, it might be excluded from "good capacity" considerations or treated
328
+
as having no capacity for safety. This field is optional, if omitted a value of `200ms` will be used.
329
+
330
+
## Feature Gates
331
+
332
+
The Feature Gates section allows for the enabling of experimental features of the IGW. These experimental
333
+
features are all disabled unless you explicitly enable them one by one.
334
+
335
+
The Feature Gates section has the follwoing form:
336
+
337
+
```yaml
338
+
featureGates:
339
+
enableDataLayer: true
340
+
enableFlowControl: false
341
+
```
342
+
343
+
Each sub-field of the Feature Gates section enables one experimental feature. The sub-fields are:
344
+
345
+
- `enableDataLayer`which, if present and has a value of true, enables the experimental Datalayer APIs.
346
+
- `enableFlowControl`which, if present and has a value of true, enables the experimental FlowControl
347
+
feature.
348
+
349
+
In all cases if the sub-field isn't present or has a value of false, that experimental feature will
0 commit comments