Skip to content

Ensure EPP flags are configurable via Helm chart #1302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions config/charts/inferencepool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,21 @@ The following table list the configurable parameters of the chart.
| `inferenceExtension.image.pullPolicy` | Image pull policy for the container. Possible values: `Always`, `IfNotPresent`, or `Never`. Defaults to `Always`. |
| `inferenceExtension.extProcPort` | Port where the endpoint picker service is served for external processing. Defaults to `9002`. |
| `inferenceExtension.env` | List of environment variables to set in the endpoint picker container as free-form YAML. Defaults to `[]`. |
| `inferenceExtension.extraContainerPorts` | List of additional container ports to expose. Defaults to `[]`. |
| `inferenceExtension.extraServicePorts` | List of additional service ports to expose. Defaults to `[]`. |
| `inferenceExtension.logVerbosity` | Logging verbosity level for the endpoint picker. Defaults to `"3"`. |
| `inferenceExtension.enablePprof` | Enables pprof for profiling and debugging |
| `inferenceExtension.modelServerMetricsPath` | Flag to have model server metrics |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions (modelServerMetricsScheme/Path/Port) are a little vague, could we add more detail?

| `inferenceExtension.modelServerMetricsScheme` | Flag to have model server metrics scheme |
| `inferenceExtension.modelServerMetricsPort` | Flag for have model server metrics port |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its worth specifying that if this port is unset, that it defaults to the target port specified on the InferencePool

| `inferenceExtension.modelServerMetricsHttpsInsecureSkipVerify` | When using 'https' scheme for 'model-server-metrics-scheme', configure 'InsecureSkipVerify' (default to true) |
| `inferenceExtension.secureServing` | Enables secure serving. Defaults to true. |
| `inferenceExtension.healthChecking` | Enables health checking |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify what the default is.

| `inferenceExtension.certPath` | The path to the certificate for secure serving. The certificate and private key files are assumed to be named tls.crt and tls.key, respectively. If not set, and secureServing is enabled, then a self-signed certificate is used. |
| `inferenceExtension.refreshMetricsInterval` | Interval to refresh metrics |
| `inferenceExtension.refreshPrometheusMetricsInterval` | Interval to flush prometheus metrics |
| `inferenceExtension.metricsStalenessThreshold` | Duration after which metrics are considered stale. This is used to determine if a pod's metrics are fresh enough. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is used to determine if a pod's metrics are fresh enough." Consider rewording to something like: 'metrics staleness above the configured threshold will be considered invalid'

| `inferenceExtension.totalQueuedRequestsMetric` | Prometheus metric for the number of queued requests. |
| `inferenceExtension.extraContainerPorts` | List of additional container ports to expose. Defaults to `[]`. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets clarify that these extra ports are for the EPP itself

| `inferenceExtension.extraServicePorts` | List of additional service ports to expose. Defaults to `[]`. |
| `inferenceExtension.logVerbosity` | Logging verbosity level for the endpoint picker. Defaults to `"3"`. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `inferenceExtension.logVerbosity` | Logging verbosity level for the endpoint picker. Defaults to `"3"`. |
| `inferenceExtension.logVerbosity` | Logging verbosity level for the endpoint picker. Defaults to `"1"`. |

| `provider.name` | Name of the Inference Gateway implementation being used. Possible values: `gke`. Defaults to `none`. |

## Notes
Expand Down
10 changes: 10 additions & 0 deletions config/charts/inferencepool/templates/epp-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,16 @@ spec:
- "--model-server-metrics-path={{ .Values.inferenceExtension.modelServerMetricsPath }}"
- "--model-server-metrics-scheme={{ .Values.inferenceExtension.modelServerMetricsScheme }}"
- "--model-server-metrics-https-insecure-skip-verify={{ .Values.inferenceExtension.modelServerMetricsHttpsInsecureSkipVerify }}"
- "--model-server-metrics-port={{ .Values.inferenceExtension.modelServerMetricsPort }}"
- "--secure-serving={{ .Values.inferenceExtension.secureServing }}"
- "--health-checking={{ .Values.inferenceExtension.healthChecking }}"
- "--cert-path={{ .Values.inferenceExtension.certPath }}"
- "--total-queued-requests-metric={{ .Values.inferenceExtension.totalQueuedRequestsMetric }}"
- "--kv-cache-usage-percentage-metric={{ .Values.inferenceExtension.kvCacheUsagePercentageMetric }}"
- "--lora-info-metric={{ .Values.inferenceExtension.loraInfoMetric }}"
- "--refresh-metrics-interval={{ .Values.inferenceExtension.refreshMetricsInterval }}"
- "--refresh-prometheus-metrics-interval={{ .Values.inferenceExtension.refreshPrometheusMetricsInterval }}"
- "--metrics-staleness-threshold={{ .Values.inferenceExtension.metricsStalenessThreshold }}"
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
- --total-queued-requests-metric
- "nv_trt_llm_request_metrics{request_type=waiting}"
Expand Down
30 changes: 24 additions & 6 deletions config/charts/inferencepool/values.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
inferenceExtension:
# Number of replicas
replicas: 1
image:
name: epp
Expand All @@ -7,12 +8,29 @@ inferenceExtension:
pullPolicy: Always
extProcPort: 9002
env: []
enablePprof: true # Enable pprof handlers for profiling and debugging
enablePprof: true # Enable pprof handlers for profiling and debugging
modelServerMetricsPath: "/metrics"
modelServerMetricsScheme: "http"
modelServerMetricsHttpsInsecureSkipVerify: true
# This is the plugins configuration file.
grpcPort: 9002
grpcHealthPort: 9003
metricsPort: 9090
poolName: ""
poolNamespace: "default"
refreshMetricsInterval: "50ms"
refreshPrometheusMetricsInterval: "5s"
secureServing: true
healthChecking: false
totalQueuedRequestsMetric: "vllm:num_requests_waiting"
kvCacheUsagePercentageMetric: "vllm:gpu_cache_usage_perc"
loraInfoMetric: "vllm:lora_requests_info"
certPath: ""
metricsStalenessThreshold: "2s"

pluginsConfigFile: "default-plugins.yaml"
logVerbosity: 1

# This is the plugins configuration file.
# pluginsCustomConfig:
# custom-plugins.yaml: |
# apiVersion: inference.networking.x-k8s.io/v1alpha1
Expand All @@ -34,18 +52,18 @@ inferenceExtension:
# Example environment variables:
# env:
# KV_CACHE_SCORE_WEIGHT: "1"

# Define additional container ports
modelServerMetricsPort: 0
extraContainerPorts: []
# Define additional service ports
extraServicePorts: []

inferencePool:
targetPortNumber: 8000
modelServerType: vllm # vllm, triton-tensorrt-llm
# modelServers: # REQUIRED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change please, we should not default this, it should be explicitly set

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this comment addressed?

# matchLabels:
# app: vllm-llama3-8b-instruct
# modelServers:
# matchLabels:
# app: vllm-llama3-8b-instruct

provider:
name: none
Expand Down