|
| 1 | +# Configuring Plugins via text |
| 2 | + |
| 3 | +The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how |
| 4 | +it is configured. The IGW can be configured in several ways, either by code or via text. |
| 5 | + |
| 6 | +If configured by code either a set of predetermined environment variables must be used or one must |
| 7 | +fork the IGW and change code. |
| 8 | + |
| 9 | +A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format |
| 10 | +and can either be in a file or specified in-line as a parameter. The configuration defines the set of |
| 11 | +plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling |
| 12 | +the same plugin type to be instantiated multiple times, if needed. Also defined is a set of |
| 13 | +SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. The set |
| 14 | +of plugins instantiated must also include a Profile Handler, which determines which SchedulingProfiles |
| 15 | +will be used for a particular request. |
| 16 | + |
| 17 | +It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is |
| 18 | +**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration |
| 19 | +text and in the future will also help in versioning the text. |
| 20 | + |
| 21 | +It should also be noted that even when the configuration text is loaded from a file, it is loaded at |
| 22 | +the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored. |
| 23 | + |
| 24 | +The configuration text has the following form: |
| 25 | +```yaml |
| 26 | +apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 27 | +kind: EndpointPickerConfig |
| 28 | +plugins: |
| 29 | +- .... |
| 30 | +- .... |
| 31 | +schedulingProfiles: |
| 32 | +- .... |
| 33 | +- .... |
| 34 | +``` |
| 35 | +
|
| 36 | +The first two lines of the configuration are constant and must appear as is. |
| 37 | +
|
| 38 | +The plugins section defines the set of plugins that will be instantiated and their parameters. |
| 39 | +Each entry in this section has the following form: |
| 40 | +
|
| 41 | +```yaml |
| 42 | +- name: aName |
| 43 | + type: a-type |
| 44 | + parameters: |
| 45 | + parm1: val1 |
| 46 | + parm2: val2 |
| 47 | +``` |
| 48 | +
|
| 49 | +The fields in a plugin entry are: |
| 50 | +
|
| 51 | +- *name* which is optional, provides a name by which the plugin instance can be referenced. If this |
| 52 | +field is omitted, the plugin's type will be used as its name. |
| 53 | +- *type* specifies the type of the plugin to be instantiated. |
| 54 | +- *parameters* which is optional, defines the set of parameters used to configure the plugin in question. |
| 55 | +The actual set of parameters varies from plugin to plugin. |
| 56 | +
|
| 57 | +The schedulingProfiles section defines the set of scheduling profiles that can be used in scheduling |
| 58 | +requests to pods. The number of scheduling profiles one defines, depends on the use case. For simple |
| 59 | +serving of requests, one is enough. For disaggregated prefill, two profiles are required. Each entry |
| 60 | +in this section has the following form: |
| 61 | +
|
| 62 | +```yaml |
| 63 | +- name: aName |
| 64 | + plugins: |
| 65 | + - pluginRef: plugin1 |
| 66 | + - pluginRef: plugin2 |
| 67 | + weight: 50 |
| 68 | +``` |
| 69 | +
|
| 70 | +The fields in a schedulingProfile entry are: |
| 71 | +
|
| 72 | +- *name* specifies the scheduling profile's name. |
| 73 | +- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request. |
| 74 | +Each entry in the schedulingProfile's plugins section has the following fields: |
| 75 | + - *pluginRef* is a reference to the name of the plugin instance to be used |
| 76 | + - *weight* is the weight to be used if the referenced plugin is a scorer. |
| 77 | +
|
| 78 | +A complete configuration might look like this: |
| 79 | +```yaml |
| 80 | +apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 81 | +kind: EndpointPickerConfig |
| 82 | +plugins: |
| 83 | +- type: prefix-cache-scorer |
| 84 | + parameters: |
| 85 | + hashBlockSize: 5 |
| 86 | + maxPrefixBlocksToMatch: 256 |
| 87 | + lruCapacityPerServer: 31250 |
| 88 | +- type: max-score-picker |
| 89 | +- type: single-profile-handler |
| 90 | +schedulingProfiles: |
| 91 | +- name: default |
| 92 | + plugins: |
| 93 | + - pluginRef: max-score-picker |
| 94 | + - pluginRef: prefix-cache-scorer |
| 95 | + weight: 50 |
| 96 | +``` |
| 97 | +
|
| 98 | +If the configuration is in a file, the EPP command line argument `--configFile` |
| 99 | +should be used to specify the full path of the file in question. For example: |
| 100 | + |
| 101 | +```yaml |
| 102 | +apiVersion: apps/v1 |
| 103 | +kind: Deployment |
| 104 | +metadata: |
| 105 | + name: ${EPP_NAME} |
| 106 | + ... |
| 107 | +spec: |
| 108 | + ... |
| 109 | + template: |
| 110 | + ... |
| 111 | + spec: |
| 112 | + ... |
| 113 | + containers: |
| 114 | + - name: epp |
| 115 | + image: ghcr.io/llm-d/llm-d-inference-scheduler:latest |
| 116 | + imagePullPolicy: IfNotPresent |
| 117 | + args: |
| 118 | + - -poolName |
| 119 | + - "${POOL_NAME}" |
| 120 | + ... |
| 121 | + - --configFile |
| 122 | + - "/etc/epp/epp-config.yaml" |
| 123 | +``` |
| 124 | + |
| 125 | +If the configuration is passed as in-line text the EPP command line argument `--configText` |
| 126 | +should be used. For example: |
| 127 | + |
| 128 | +```yaml |
| 129 | +apiVersion: apps/v1 |
| 130 | +kind: Deployment |
| 131 | +metadata: |
| 132 | + name: ${EPP_NAME} |
| 133 | + ... |
| 134 | +spec: |
| 135 | + ... |
| 136 | + template: |
| 137 | + ... |
| 138 | + spec: |
| 139 | + ... |
| 140 | + containers: |
| 141 | + - name: epp |
| 142 | + image: ghcr.io/llm-d/llm-d-inference-scheduler:latest |
| 143 | + imagePullPolicy: IfNotPresent |
| 144 | + args: |
| 145 | + - -poolName |
| 146 | + - "${POOL_NAME}" |
| 147 | + ... |
| 148 | + - --configText |
| 149 | + - | |
| 150 | + apiVersion: inference.networking.x-k8s.io/v1alpha1 |
| 151 | + kind: EndpointPickerConfig |
| 152 | + plugins: |
| 153 | + - type: prefix-cache-scorer |
| 154 | + parameters: |
| 155 | + hashBlockSize: 5 |
| 156 | + maxPrefixBlocksToMatch: 256 |
| 157 | + lruCapacityPerServer: 31250 |
| 158 | + - type: max-score-picker |
| 159 | + - type: single-profile-handler |
| 160 | + schedulingProfiles: |
| 161 | + - name: default |
| 162 | + plugins: |
| 163 | + - pluginRef: max-score-picker |
| 164 | + - pluginRef: prefix-cache-scorer |
| 165 | + weight: 50 |
| 166 | +``` |
| 167 | + |
| 168 | +## Plugin Configuration |
| 169 | + |
| 170 | +This section describes how to setup the various plugins that are available with the IGW. |
| 171 | + |
| 172 | +#### **SingleProfileHandler** |
| 173 | + |
| 174 | +Selects a single profile which is always the primary profile. |
| 175 | + |
| 176 | +- *Type*: single-profile-handler |
| 177 | +- *Parameters*: none |
| 178 | + |
| 179 | +#### **LeastKVCacheFilter** |
| 180 | + |
| 181 | +Finds the max and min KV cache of all pods, divides the whole range (max-min) by the |
| 182 | +number of pods, and finds the pods that fall into the first range. |
| 183 | + |
| 184 | +- *Type*: least-kv-cache-filter |
| 185 | +- *Parameters*: none |
| 186 | + |
| 187 | +#### **LeastQueueFilter** |
| 188 | + |
| 189 | +Finds the max and min queue size of all pods, divides the whole range (max-min) by the |
| 190 | +number of pods, and finds the pods that fall into the first range. |
| 191 | + |
| 192 | +- *Type*: least-queue-filter |
| 193 | +- *Parameters*: none |
| 194 | + |
| 195 | +#### **LoraAffinityFilter** |
| 196 | + |
| 197 | +Implements a pod selection strategy that when the use of a LoRA adapter is requested, prioritizes pods |
| 198 | +that are believed to have the specific LoRA adapter loaded. It also allows for load balancing through |
| 199 | +some randomization. |
| 200 | + |
| 201 | +- *Type*: lora-affinity-filter |
| 202 | +- *Parameters*: |
| 203 | + - `threshold` a probability threshold to sometimes select pods that don't seem to have the LoRA |
| 204 | + adapter loaded to enable load balancing. If not specified defaults to `0.999` |
| 205 | + |
| 206 | +#### **LowQueueFilter** |
| 207 | + |
| 208 | +Filters out pods who's waiting queue size is greater than the specified theshold. |
| 209 | + |
| 210 | +- *Type*: low-queue-filter |
| 211 | +- *Parameters*: |
| 212 | + - `threshold` the waiting queue threshold. If not specified defaults to `128` |
| 213 | + |
| 214 | +#### **PrefixCachePlugin** |
| 215 | + |
| 216 | +Scores pods based on the amount of the prompt is believed to be in the pod's KvCache. |
| 217 | + |
| 218 | +- *Type*: prefix-cache-scorer |
| 219 | +- *Parameters*: |
| 220 | + - `hashBlockSize` specified the size of the blocks to break up the input prompt when |
| 221 | + calculating the block hashes. If not specified defaults to `64` |
| 222 | + - `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If |
| 223 | + not specified defaults to `256` |
| 224 | + - `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries |
| 225 | + per server (pod). If not specified defaults to `31250` |
| 226 | + |
| 227 | +#### **MaxScorePicker** |
| 228 | + |
| 229 | +Picks the pod with the maximum score from the list of candidates. |
| 230 | + |
| 231 | +- *Type*: max-score-picker |
| 232 | +- *Parameters*: none |
| 233 | + |
| 234 | +#### **RandomPicker** |
| 235 | + |
| 236 | +Picks a random pod from the list of candidates. |
| 237 | + |
| 238 | +- *Type*: random-picker |
| 239 | +- *Parameters*: none |
| 240 | + |
| 241 | +#### **KvCacheScorer** |
| 242 | + |
| 243 | +Scores the candidate pods based on their KV cache utilization. |
| 244 | + |
| 245 | +- *Type*: kv-cache-scorer |
| 246 | +- *Parameters*: none |
| 247 | + |
| 248 | +#### **QueueScorer** |
| 249 | + |
| 250 | +Scores list of candidate pods based on the pod's waiting queue size. The lower the |
| 251 | +waiting queue size the pod has, the higher the score it will get (since it's more |
| 252 | +available to serve new request). |
| 253 | + |
| 254 | +- *Type*: queue-scorer |
| 255 | +- *Parameters*: none |
0 commit comments