diff --git a/site-src/concepts/roles-and-personas.md b/site-src/concepts/roles-and-personas.md index f1d17a59d..3342984ab 100644 --- a/site-src/concepts/roles-and-personas.md +++ b/site-src/concepts/roles-and-personas.md @@ -6,12 +6,12 @@ Before diving into the details of the API, descriptions of the personas these AP The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads, including handling Ops for: - - Hardware - - Model Server - - Base Model - - Resource Allocation for Workloads - - Gateway configuration - - etc +- Hardware +- Model Server +- Base Model +- Resource Allocation for Workloads +- Gateway configuration +- etc ## Inference Workload Owner @@ -19,8 +19,8 @@ An Inference Workload Owner persona owns and manages one or many Generative AI W - Defining priority - Managing fine-tunes - - LoRA Adapters - - System Prompts - - Prompt Cache - - etc. + - LoRA Adapters + - System Prompts + - Prompt Cache + - etc. - Managing rollout of adapters diff --git a/site-src/guides/epp-configuration/config-text.md b/site-src/guides/epp-configuration/config-text.md index 43a0e6cf7..684c0b034 100644 --- a/site-src/guides/epp-configuration/config-text.md +++ b/site-src/guides/epp-configuration/config-text.md @@ -74,9 +74,9 @@ The fields in a schedulingProfile entry are: - *name* specifies the scheduling profile's name. - *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request. Each entry in the schedulingProfile's plugins section has the following fields: - - *pluginRef* is a reference to the name of the plugin instance to be used - - *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one - will be used. + - *pluginRef* is a reference to the name of the plugin instance to be used + - *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one + will be used. A complete configuration might look like this: ```yaml @@ -201,12 +201,12 @@ Scores pods based on the amount of the prompt is believed to be in the pod's KvC - *Type*: prefix-cache-scorer - *Parameters*: - - `blockSize` specified the size of the blocks to break up the input prompt when - calculating the block hashes. If not specified defaults to `64` - - `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If - not specified defaults to `256` - - `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries - per server (pod). If not specified defaults to `31250` + - `blockSize` specified the size of the blocks to break up the input prompt when + calculating the block hashes. If not specified defaults to `64` + - `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If + not specified defaults to `256` + - `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries + per server (pod). If not specified defaults to `31250` #### **LoRAAffinityScorer** @@ -222,18 +222,18 @@ Picks the pod with the maximum score from the list of candidates. This is the de if not specified. - *Type*: max-score-picker -- *Parameters*: - - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on - the scores of those endpoints. If not specified defaults to `1`. +- *Parameters*: + - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on + the scores of those endpoints. If not specified defaults to `1`. #### **RandomPicker** Picks a random pod from the list of candidates. - *Type*: random-picker -- *Parameters*: - - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not - specified defaults to `1`. +- *Parameters*: + - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not + specified defaults to `1`. #### **WeightedRandomPicker** @@ -241,8 +241,8 @@ Picks pod(s) from the list of candidates based on weighted random sampling using - *Type*: weighted-random-picker - *Parameters*: - - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not - specified defaults to `1`. + - `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not + specified defaults to `1`. #### **KvCacheScorer** diff --git a/site-src/implementations/gateways.md b/site-src/implementations/gateways.md index 8c7ee8dea..7307a6996 100644 --- a/site-src/implementations/gateways.md +++ b/site-src/implementations/gateways.md @@ -3,12 +3,12 @@ This project has several implementations that are planned or in progress: - [Gateway Implementations](#gateway-implementations) - - [Alibaba Cloud Container Service for Kubernetes](#alibaba-cloud-container-service-for-kubernetes) - - [Envoy AI Gateway](#envoy-ai-gateway) - - [Google Kubernetes Engine](#google-kubernetes-engine) - - [Istio](#istio) - - [Kgateway](#kgateway) - - [Kubvernor](#kubvernor) + - [Alibaba Cloud Container Service for Kubernetes](#alibaba-cloud-container-service-for-kubernetes) + - [Envoy AI Gateway](#envoy-ai-gateway) + - [Google Kubernetes Engine](#google-kubernetes-engine) + - [Istio](#istio) + - [Kgateway](#kgateway) + - [Kubvernor](#kubvernor) [1]:#alibaba-cloud-container-service-for-kubernetes [2]:#envoy-ai-gateway diff --git a/site-src/performance/regression-testing/index.md b/site-src/performance/regression-testing/index.md index e4974c1e4..5f2f4cbf0 100644 --- a/site-src/performance/regression-testing/index.md +++ b/site-src/performance/regression-testing/index.md @@ -60,9 +60,9 @@ Refer to example manifest: - **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical) - **Traffic Distribution:** - - 60 % on first 5 adapters (12 % each) - - 30 % on next 5 adapters (6 % each) - - 10 % on last 5 adapters (2 % each) + - 60 % on first 5 adapters (12 % each) + - 30 % on next 5 adapters (6 % each) + - 10 % on last 5 adapters (2 % each) - **Max LoRA:** 3 - **Replicas:** 10 (vLLM) - **Request Rates:** 20–200 QPS (increments of 20) @@ -99,8 +99,8 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy - Update benchmark IDs to `regression-before` and `regression-after`. - Compare latency and throughput metrics, performing regression analysis. - Check R² values specifically: - - **Prompts Attempted/Succeeded:** Expect R² ≈ 1 - - **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance). + - **Prompts Attempted/Succeeded:** Expect R² ≈ 1 + - **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance). Identify significant deviations, investigate causes, and confirm performance meets expected standards. diff --git a/site-src/reference/spec.md b/site-src/reference/spec.md index 666ebe36b..3eb37756d 100644 --- a/site-src/reference/spec.md +++ b/site-src/reference/spec.md @@ -32,11 +32,13 @@ Invalid values include: * "foo.example.com" - must include path _Validation:_ + - MaxLength: 253 - MinLength: 1 - Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\/[A-Za-z0-9\/\-._~%!$&'()*+,;=:]+$` _Appears in:_ + - [ParentStatus](#parentstatus) @@ -49,9 +51,11 @@ EndpointPickerFailureMode defines the options for how the parent handles the cas Endpoint Picker extension is non-responsive. _Validation:_ + - Enum: [FailOpen FailClose] _Appears in:_ + - [EndpointPickerRef](#endpointpickerref) | Field | Description | @@ -70,6 +74,7 @@ associated configuration. _Appears in:_ + - [InferencePoolSpec](#inferencepoolspec) | Field | Description | Default | Validation | @@ -102,11 +107,13 @@ Invalid values include: * "example.com/bar" - "/" is an invalid character _Validation:_ + - MaxLength: 253 - MinLength: 0 - Pattern: `^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` _Appears in:_ + - [EndpointPickerRef](#endpointpickerref) - [ParentReference](#parentreference) @@ -145,6 +152,7 @@ InferencePoolSpec defines the desired state of the InferencePool. _Appears in:_ + - [InferencePool](#inferencepool) | Field | Description | Default | Validation | @@ -163,6 +171,7 @@ InferencePoolStatus defines the observed state of the InferencePool. _Appears in:_ + - [InferencePool](#inferencepool) | Field | Description | Default | Validation | @@ -186,11 +195,13 @@ Invalid values include: * "invalid/kind" - "/" is an invalid character _Validation:_ + - MaxLength: 63 - MinLength: 1 - Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` _Appears in:_ + - [EndpointPickerRef](#endpointpickerref) - [ParentReference](#parentreference) @@ -220,11 +231,13 @@ Invalid values include: * example.com. - can not start or end with "." _Validation:_ + - MaxLength: 253 - MinLength: 1 - Pattern: `^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$` _Appears in:_ + - [LabelSelector](#labelselector) @@ -239,6 +252,7 @@ This simplified version uses only the matchLabels field. _Appears in:_ + - [InferencePoolSpec](#inferencepoolspec) | Field | Description | Default | Validation | @@ -263,11 +277,13 @@ Valid values include: * 123-my-value _Validation:_ + - MaxLength: 63 - MinLength: 0 - Pattern: `^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$` _Appears in:_ + - [LabelSelector](#labelselector) @@ -293,11 +309,13 @@ Invalid values include: * "example.com" - "." is an invalid character _Validation:_ + - MaxLength: 63 - MinLength: 1 - Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?$` _Appears in:_ + - [ParentReference](#parentreference) @@ -311,10 +329,12 @@ Object names can have a variety of forms, including RFC 1123 subdomains, RFC 1123 labels, or RFC 1035 labels. _Validation:_ + - MaxLength: 253 - MinLength: 1 _Appears in:_ + - [EndpointPickerRef](#endpointpickerref) - [ParentReference](#parentreference) @@ -330,6 +350,7 @@ parent resource, such as a Gateway. _Appears in:_ + - [ParentStatus](#parentstatus) | Field | Description | Default | Validation | @@ -349,6 +370,7 @@ ParentStatus defines the observed state of InferencePool from a Parent, i.e. Gat _Appears in:_ + - [InferencePoolStatus](#inferencepoolstatus) | Field | Description | Default | Validation | @@ -367,6 +389,7 @@ Port defines the network port that will be exposed by this InferencePool. _Appears in:_ + - [EndpointPickerRef](#endpointpickerref) - [InferencePoolSpec](#inferencepoolspec) @@ -382,11 +405,10 @@ _Underlying type:_ _integer_ PortNumber defines a network port. _Validation:_ + - Maximum: 65535 - Minimum: 1 _Appears in:_ -- [Port](#port) - - +- [Port](#port) diff --git a/site-src/reference/x-v1a1-spec.md b/site-src/reference/x-v1a1-spec.md index 55bec76e0..efa91f2fa 100644 --- a/site-src/reference/x-v1a1-spec.md +++ b/site-src/reference/x-v1a1-spec.md @@ -22,10 +22,12 @@ _Underlying type:_ _string_ ClusterName is the name of a cluster that exported the InferencePool. _Validation:_ + - MaxLength: 253 - MinLength: 1 _Appears in:_ + - [ExportingCluster](#exportingcluster) @@ -38,19 +40,21 @@ ControllerName is the name of a controller that manages a resource. It must be a Valid values include: - - "example.com/bar" +- "example.com/bar" Invalid values include: - - "example.com" - must include path - - "foo.example.com" - must include path +- "example.com" - must include path +- "foo.example.com" - must include path _Validation:_ + - MaxLength: 253 - MinLength: 1 - Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\/[A-Za-z0-9\/\-._~%!$&'()*+,;=:]+$` _Appears in:_ + - [ImportController](#importcontroller) @@ -64,6 +68,7 @@ ExportingCluster defines a cluster that exported the InferencePool that backs th _Appears in:_ + - [ImportController](#importcontroller) | Field | Description | Default | Validation | @@ -80,6 +85,7 @@ ImportController defines a controller that is responsible for managing the Infer _Appears in:_ + - [InferencePoolImportStatus](#inferencepoolimportstatus) | Field | Description | Default | Validation | @@ -117,10 +123,9 @@ InferencePoolImportStatus defines the observed state of the InferencePoolImport. _Appears in:_ + - [InferencePoolImport](#inferencepoolimport) | Field | Description | Default | Validation | | --- | --- | --- | --- | | `controllers` _[ImportController](#importcontroller) array_ | Controllers is a list of controllers that are responsible for managing the InferencePoolImport. | | MaxItems: 8
Required: \{\}
| - - diff --git a/site-src/reference/x-v1a2-spec.md b/site-src/reference/x-v1a2-spec.md index c1a57ce3f..e00d5f592 100644 --- a/site-src/reference/x-v1a2-spec.md +++ b/site-src/reference/x-v1a2-spec.md @@ -25,6 +25,7 @@ Extension specifies how to configure an extension that runs the endpoint picker. _Appears in:_ + - [InferencePoolSpec](#inferencepoolspec) | Field | Description | Default | Validation | @@ -44,9 +45,11 @@ ExtensionFailureMode defines the options for how the gateway handles the case wh responsive. _Validation:_ + - Enum: [FailOpen FailClose] _Appears in:_ + - [Extension](#extension) | Field | Description | @@ -76,10 +79,12 @@ Invalid values include: * "example.com/bar" - "/" is an invalid character _Validation:_ + - MaxLength: 253 - Pattern: `^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` _Appears in:_ + - [Extension](#extension) - [ParentGatewayReference](#parentgatewayreference) - [PoolObjectReference](#poolobjectreference) @@ -126,6 +131,7 @@ InferenceObjectives, defined by the Inference Platform Admin. _Appears in:_ + - [InferenceObjective](#inferenceobjective) | Field | Description | Default | Validation | @@ -143,6 +149,7 @@ InferenceObjectiveStatus defines the observed state of InferenceObjective _Appears in:_ + - [InferenceObjective](#inferenceobjective) | Field | Description | Default | Validation | @@ -182,6 +189,7 @@ InferencePoolSpec defines the desired state of InferencePool _Appears in:_ + - [InferencePool](#inferencepool) | Field | Description | Default | Validation | @@ -200,6 +208,7 @@ InferencePoolStatus defines the observed state of InferencePool. _Appears in:_ + - [InferencePool](#inferencepool) | Field | Description | Default | Validation | @@ -223,11 +232,13 @@ Invalid values include: * "invalid/kind" - "/" is an invalid character _Validation:_ + - MaxLength: 63 - MinLength: 1 - Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` _Appears in:_ + - [Extension](#extension) - [ParentGatewayReference](#parentgatewayreference) - [PoolObjectReference](#poolobjectreference) @@ -258,11 +269,13 @@ Invalid values include: * example.com. - can not start or end with "." _Validation:_ + - MaxLength: 253 - MinLength: 1 - Pattern: `^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$` _Appears in:_ + - [InferencePoolSpec](#inferencepoolspec) @@ -284,11 +297,13 @@ Valid values include: * 123-my-value _Validation:_ + - MaxLength: 63 - MinLength: 0 - Pattern: `^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$` _Appears in:_ + - [InferencePoolSpec](#inferencepoolspec) @@ -314,11 +329,13 @@ Invalid values include: * "example.com" - "." is an invalid character _Validation:_ + - MaxLength: 63 - MinLength: 1 - Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?$` _Appears in:_ + - [ParentGatewayReference](#parentgatewayreference) @@ -332,10 +349,12 @@ Object names can have a variety of forms, including RFC 1123 subdomains, RFC 1123 labels, or RFC 1035 labels. _Validation:_ + - MaxLength: 253 - MinLength: 1 _Appears in:_ + - [Extension](#extension) - [ParentGatewayReference](#parentgatewayreference) - [PoolObjectReference](#poolobjectreference) @@ -352,6 +371,7 @@ defaulting to Gateway. _Appears in:_ + - [PoolStatus](#poolstatus) | Field | Description | Default | Validation | @@ -372,6 +392,7 @@ referrer. _Appears in:_ + - [InferenceObjectiveSpec](#inferenceobjectivespec) | Field | Description | Default | Validation | @@ -390,6 +411,7 @@ PoolStatus defines the observed state of InferencePool from a Gateway. _Appears in:_ + - [InferencePoolStatus](#inferencepoolstatus) | Field | Description | Default | Validation | @@ -405,11 +427,10 @@ _Underlying type:_ _integer_ PortNumber defines a network port. _Validation:_ + - Maximum: 65535 - Minimum: 1 _Appears in:_ -- [Extension](#extension) - - +- [Extension](#extension)