Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions site-src/concepts/roles-and-personas.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@ Before diving into the details of the API, descriptions of the personas these AP

The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads, including handling Ops for:

- Hardware
- Model Server
- Base Model
- Resource Allocation for Workloads
- Gateway configuration
- etc
- Hardware
- Model Server
- Base Model
- Resource Allocation for Workloads
- Gateway configuration
- etc

## Inference Workload Owner

An Inference Workload Owner persona owns and manages one or many Generative AI Workloads (LLM focused *currently*). This includes:

- Defining priority
- Managing fine-tunes
- LoRA Adapters
- System Prompts
- Prompt Cache
- etc.
- LoRA Adapters
- System Prompts
- Prompt Cache
- etc.
- Managing rollout of adapters
34 changes: 17 additions & 17 deletions site-src/guides/epp-configuration/config-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ The fields in a schedulingProfile entry are:
- *name* specifies the scheduling profile's name.
- *plugins* specifies the set of plugins to be used when this scheduling profile is chosen for a request.
Each entry in the schedulingProfile's plugins section has the following fields:
- *pluginRef* is a reference to the name of the plugin instance to be used
- *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one
will be used.
- *pluginRef* is a reference to the name of the plugin instance to be used
- *weight* is the weight to be used if the referenced plugin is a scorer. If omitted, a weight of one
will be used.

A complete configuration might look like this:
```yaml
Expand Down Expand Up @@ -201,12 +201,12 @@ Scores pods based on the amount of the prompt is believed to be in the pod's KvC

- *Type*: prefix-cache-scorer
- *Parameters*:
- `blockSize` specified the size of the blocks to break up the input prompt when
calculating the block hashes. If not specified defaults to `64`
- `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If
not specified defaults to `256`
- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries
per server (pod). If not specified defaults to `31250`
- `blockSize` specified the size of the blocks to break up the input prompt when
calculating the block hashes. If not specified defaults to `64`
- `maxPrefixBlocksToMatch` specifies the maximum number of prefix blocks to match. If
not specified defaults to `256`
- `lruCapacityPerServer` specifies the capacity of the LRU indexer in number of entries
per server (pod). If not specified defaults to `31250`

#### **LoRAAffinityScorer**

Expand All @@ -222,27 +222,27 @@ Picks the pod with the maximum score from the list of candidates. This is the de
if not specified.

- *Type*: max-score-picker
- *Parameters*:
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on
the scores of those endpoints. If not specified defaults to `1`.
- *Parameters*:
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates, based on
the scores of those endpoints. If not specified defaults to `1`.

#### **RandomPicker**

Picks a random pod from the list of candidates.

- *Type*: random-picker
- *Parameters*:
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.
- *Parameters*:
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.

#### **WeightedRandomPicker**

Picks pod(s) from the list of candidates based on weighted random sampling using A-Res algorithm.

- *Type*: weighted-random-picker
- *Parameters*:
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.
- `maxNumOfEndpoints`: Maximum number of endpoints to pick from the list of candidates. If not
specified defaults to `1`.

#### **KvCacheScorer**

Expand Down
12 changes: 6 additions & 6 deletions site-src/implementations/gateways.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
This project has several implementations that are planned or in progress:

- [Gateway Implementations](#gateway-implementations)
- [Alibaba Cloud Container Service for Kubernetes](#alibaba-cloud-container-service-for-kubernetes)
- [Envoy AI Gateway](#envoy-ai-gateway)
- [Google Kubernetes Engine](#google-kubernetes-engine)
- [Istio](#istio)
- [Kgateway](#kgateway)
- [Kubvernor](#kubvernor)
- [Alibaba Cloud Container Service for Kubernetes](#alibaba-cloud-container-service-for-kubernetes)
- [Envoy AI Gateway](#envoy-ai-gateway)
- [Google Kubernetes Engine](#google-kubernetes-engine)
- [Istio](#istio)
- [Kgateway](#kgateway)
- [Kubvernor](#kubvernor)

[1]:#alibaba-cloud-container-service-for-kubernetes
[2]:#envoy-ai-gateway
Expand Down
10 changes: 5 additions & 5 deletions site-src/performance/regression-testing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ Refer to example manifest:
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
- **Traffic Distribution:**
- 60 % on first 5 adapters (12 % each)
- 30 % on next 5 adapters (6 % each)
- 10 % on last 5 adapters (2 % each)
- 60 % on first 5 adapters (12 % each)
- 30 % on next 5 adapters (6 % each)
- 10 % on last 5 adapters (2 % each)
- **Max LoRA:** 3
- **Replicas:** 10 (vLLM)
- **Request Rates:** 20–200 QPS (increments of 20)
Expand Down Expand Up @@ -99,8 +99,8 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy
- Update benchmark IDs to `regression-before` and `regression-after`.
- Compare latency and throughput metrics, performing regression analysis.
- Check R² values specifically:
- **Prompts Attempted/Succeeded:** Expect R² ≈ 1
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
- **Prompts Attempted/Succeeded:** Expect R² ≈ 1
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).

Identify significant deviations, investigate causes, and confirm performance meets expected standards.

Expand Down
28 changes: 25 additions & 3 deletions site-src/reference/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,13 @@ Invalid values include:
* "foo.example.com" - must include path

_Validation:_

- MaxLength: 253
- MinLength: 1
- Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\/[A-Za-z0-9\/\-._~%!$&'()*+,;=:]+$`

_Appears in:_

- [ParentStatus](#parentstatus)


Expand All @@ -49,9 +51,11 @@ EndpointPickerFailureMode defines the options for how the parent handles the cas
Endpoint Picker extension is non-responsive.

_Validation:_

- Enum: [FailOpen FailClose]

_Appears in:_

- [EndpointPickerRef](#endpointpickerref)

| Field | Description |
Expand All @@ -70,6 +74,7 @@ associated configuration.


_Appears in:_

- [InferencePoolSpec](#inferencepoolspec)

| Field | Description | Default | Validation |
Expand Down Expand Up @@ -102,11 +107,13 @@ Invalid values include:
* "example.com/bar" - "/" is an invalid character

_Validation:_

- MaxLength: 253
- MinLength: 0
- Pattern: `^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`

_Appears in:_

- [EndpointPickerRef](#endpointpickerref)
- [ParentReference](#parentreference)

Expand Down Expand Up @@ -145,6 +152,7 @@ InferencePoolSpec defines the desired state of the InferencePool.


_Appears in:_

- [InferencePool](#inferencepool)

| Field | Description | Default | Validation |
Expand All @@ -163,6 +171,7 @@ InferencePoolStatus defines the observed state of the InferencePool.


_Appears in:_

- [InferencePool](#inferencepool)

| Field | Description | Default | Validation |
Expand All @@ -186,11 +195,13 @@ Invalid values include:
* "invalid/kind" - "/" is an invalid character

_Validation:_

- MaxLength: 63
- MinLength: 1
- Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`

_Appears in:_

- [EndpointPickerRef](#endpointpickerref)
- [ParentReference](#parentreference)

Expand Down Expand Up @@ -220,11 +231,13 @@ Invalid values include:
* example.com. - can not start or end with "."

_Validation:_

- MaxLength: 253
- MinLength: 1
- Pattern: `^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$`

_Appears in:_

- [LabelSelector](#labelselector)


Expand All @@ -239,6 +252,7 @@ This simplified version uses only the matchLabels field.


_Appears in:_

- [InferencePoolSpec](#inferencepoolspec)

| Field | Description | Default | Validation |
Expand All @@ -263,11 +277,13 @@ Valid values include:
* 123-my-value

_Validation:_

- MaxLength: 63
- MinLength: 0
- Pattern: `^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$`

_Appears in:_

- [LabelSelector](#labelselector)


Expand All @@ -293,11 +309,13 @@ Invalid values include:
* "example.com" - "." is an invalid character

_Validation:_

- MaxLength: 63
- MinLength: 1
- Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`

_Appears in:_

- [ParentReference](#parentreference)


Expand All @@ -311,10 +329,12 @@ Object names can have a variety of forms, including RFC 1123 subdomains,
RFC 1123 labels, or RFC 1035 labels.

_Validation:_

- MaxLength: 253
- MinLength: 1

_Appears in:_

- [EndpointPickerRef](#endpointpickerref)
- [ParentReference](#parentreference)

Expand All @@ -330,6 +350,7 @@ parent resource, such as a Gateway.


_Appears in:_

- [ParentStatus](#parentstatus)

| Field | Description | Default | Validation |
Expand All @@ -349,6 +370,7 @@ ParentStatus defines the observed state of InferencePool from a Parent, i.e. Gat


_Appears in:_

- [InferencePoolStatus](#inferencepoolstatus)

| Field | Description | Default | Validation |
Expand All @@ -367,6 +389,7 @@ Port defines the network port that will be exposed by this InferencePool.


_Appears in:_

- [EndpointPickerRef](#endpointpickerref)
- [InferencePoolSpec](#inferencepoolspec)

Expand All @@ -382,11 +405,10 @@ _Underlying type:_ _integer_
PortNumber defines a network port.

_Validation:_

- Maximum: 65535
- Minimum: 1

_Appears in:_
- [Port](#port)



- [Port](#port)
15 changes: 10 additions & 5 deletions site-src/reference/x-v1a1-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@ _Underlying type:_ _string_
ClusterName is the name of a cluster that exported the InferencePool.

_Validation:_

- MaxLength: 253
- MinLength: 1

_Appears in:_

- [ExportingCluster](#exportingcluster)


Expand All @@ -38,19 +40,21 @@ ControllerName is the name of a controller that manages a resource. It must be a

Valid values include:

- "example.com/bar"
- "example.com/bar"

Invalid values include:

- "example.com" - must include path
- "foo.example.com" - must include path
- "example.com" - must include path
- "foo.example.com" - must include path

_Validation:_

- MaxLength: 253
- MinLength: 1
- Pattern: `^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\/[A-Za-z0-9\/\-._~%!$&'()*+,;=:]+$`

_Appears in:_

- [ImportController](#importcontroller)


Expand All @@ -64,6 +68,7 @@ ExportingCluster defines a cluster that exported the InferencePool that backs th


_Appears in:_

- [ImportController](#importcontroller)

| Field | Description | Default | Validation |
Expand All @@ -80,6 +85,7 @@ ImportController defines a controller that is responsible for managing the Infer


_Appears in:_

- [InferencePoolImportStatus](#inferencepoolimportstatus)

| Field | Description | Default | Validation |
Expand Down Expand Up @@ -117,10 +123,9 @@ InferencePoolImportStatus defines the observed state of the InferencePoolImport.


_Appears in:_

- [InferencePoolImport](#inferencepoolimport)

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `controllers` _[ImportController](#importcontroller) array_ | Controllers is a list of controllers that are responsible for managing the InferencePoolImport. | | MaxItems: 8 <br />Required: \{\} <br /> |


Loading