Skip to content

Commit a96a1e1

Browse files
committed
Add benchmarking folder with common config set ups - prefix cache aware example and chart
1 parent 5a5f552 commit a96a1e1

File tree

8 files changed

+363
-0
lines changed

8 files changed

+363
-0
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
apiVersion: v2
2+
name: precise-prefix-cache-aware
3+
description: A Helm chart for precise-prefix-cache-aware benchmarking
4+
version: 0.1.0
5+
appVersion: "1.0"
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Precise Prefix Cache Aware Benchmarking Helm Chart
2+
3+
This Helm chart deploys the `inference-perf` benchmarking tool with two distinct configurations: a high-cache scenario and a low-cache scenario. This chart specifically utilizes the **shared prefix dataset** for benchmarking. This guide will walk you through deploying both.
4+
5+
## Prerequisites
6+
7+
Before you begin, ensure you have the following:
8+
9+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
10+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
11+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
12+
13+
14+
**Hugging Face Token Secret**
15+
16+
The benchmark requires a Hugging Face token to pull models. Create a Kubernetes Secret named `hf-token` (or a custom name you provide) in your target namespace, containing your Hugging Face token.
17+
18+
To create this secret:
19+
```bash
20+
export _HF_TOKEN='<YOUR_HF_TOKEN>'
21+
kubectl create secret generic hf-token --from-literal=token=$_HF_TOKEN
22+
```
23+
24+
## Shared Prefix Dataset Configuration
25+
26+
The chart uses the `shared_prefix` dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:
27+
28+
* `num_groups`: The number of shared prefix groups.
29+
* `num_prompts_per_group`: The number of prompts within each shared prefix group.
30+
* `system_prompt_len`: The length of the system prompt.
31+
* `question_len`: The length of the question part of the prompt.
32+
* `output_len`: The desired length of the model's output.
33+
34+
The default values for the dataset are defined in the chart, but you can override them using `--set config.data.shared_prefix.<parameter>` flags.
35+
36+
Example:
37+
38+
```bash
39+
helm install my-release . -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512
40+
```
41+
42+
## Deployment
43+
44+
This chart supports two main configurations, defined in `high-cache-values.yaml` and `low-cache-values.yaml`.
45+
46+
### 1. Deploying the High-Cache Configuration
47+
48+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `high-cache-values.yaml` file.
49+
50+
```bash
51+
export IP='<YOUR_IP>'
52+
export PORT='<YOUR_PORT>'
53+
helm install high-cache . -f high-cache-values.yaml \
54+
--set hfTokenSecret.name=hf-token \
55+
--set hfTokenSecret.key=token \
56+
--set "config.server.base_url=http://${IP}:${PORT}"
57+
```
58+
59+
**Parameters to customize:**
60+
61+
* `high-cache`: A unique name for this deployment.
62+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
63+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
64+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
65+
66+
### 2. Deploying the Low-Cache Configuration
67+
68+
This configuration is designed for scenarios with a lower cache hit rate. It uses the `low-cache-values.yaml` file.
69+
70+
```bash
71+
export IP='<YOUR_IP>'
72+
export PORT='<YOUR_PORT>'
73+
helm install low-cache . -f low-cache-values.yaml \
74+
-f high-cache-values.yaml \
75+
--set hfTokenSecret.name=hf-token \
76+
--set hfTokenSecret.key=token \
77+
--set "config.server.base_url=http://${IP}:${PORT}"
78+
```
79+
80+
**Parameters to customize:**
81+
82+
* `low-cache`: A unique name for this deployment.
83+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
84+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
85+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
86+
87+
## Uninstalling the Charts
88+
89+
To uninstall the deployed charts:
90+
91+
```bash
92+
helm uninstall my-high-cache-release
93+
helm uninstall my-low-cache-release
94+
```
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# High-Cache Configuration
2+
job:
3+
image: "quay.io/inference-perf/inference-perf:latest"
4+
memory: "8G"
5+
6+
logLevel: DEBUG
7+
8+
hfTokenSecret:
9+
name: hf-token
10+
key: token
11+
12+
config:
13+
load:
14+
type: constant
15+
interval: 15
16+
stages:
17+
- rate: 100
18+
duration: 30
19+
- rate: 200
20+
duration: 30
21+
- rate: 300
22+
duration: 30
23+
- rate: 400
24+
duration: 30
25+
- rate: 500
26+
duration: 30
27+
- rate: 600
28+
duration: 30
29+
- rate: 700
30+
duration: 30
31+
- rate: 800
32+
duration: 30
33+
worker_max_concurrency: 1000
34+
api:
35+
type: completion
36+
streaming: true
37+
server:
38+
type: vllm
39+
model_name: meta-llama/Llama-3.1-8B-Instruct
40+
base_url: http://0.0.0.0:8000
41+
ignore_eos: true
42+
tokenizer:
43+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
44+
data:
45+
type: shared_prefix
46+
shared_prefix:
47+
num_groups: 256
48+
num_prompts_per_group: 16
49+
system_prompt_len: 2048 # High-cache setting
50+
question_len: 256 # High-cache setting
51+
output_len: 256
52+
metrics:
53+
type: prometheus
54+
prometheus:
55+
google_managed: true
56+
report:
57+
request_lifecycle:
58+
summary: true
59+
per_stage: true
60+
per_request: true
61+
prometheus:
62+
summary: true
63+
per_stage: true
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Low-Cache Configuration
2+
job:
3+
image: "quay.io/inference-perf/inference-perf:latest"
4+
memory: "8G"
5+
6+
logLevel: INFO
7+
8+
hfTokenSecret:
9+
name: hf-token
10+
key: token
11+
12+
config:
13+
load:
14+
type: constant
15+
interval: 15
16+
stages:
17+
- rate: 100
18+
duration: 30
19+
- rate: 200
20+
duration: 30
21+
- rate: 300
22+
duration: 30
23+
- rate: 400
24+
duration: 30
25+
- rate: 500
26+
duration: 30
27+
- rate: 600
28+
duration: 30
29+
- rate: 700
30+
duration: 30
31+
- rate: 800
32+
duration: 30
33+
worker_max_concurrency: 1000
34+
api:
35+
type: completion
36+
streaming: true
37+
server:
38+
type: vllm
39+
model_name: meta-llama/Llama-3.1-8B-Instruct
40+
base_url: http://0.0.0.0:8000
41+
ignore_eos: true
42+
tokenizer:
43+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
44+
data:
45+
type: shared_prefix
46+
shared_prefix:
47+
num_groups: 256
48+
num_prompts_per_group: 16
49+
system_prompt_len: 256 # Low-cache setting
50+
question_len: 2048 # Low-cache setting
51+
output_len: 256
52+
metrics:
53+
type: prometheus
54+
prometheus:
55+
google_managed: true
56+
report:
57+
request_lifecycle:
58+
summary: true
59+
per_stage: true
60+
per_request: true
61+
prometheus:
62+
summary: true
63+
per_stage: true
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
{{/*
2+
Expand the name of the chart.
3+
*/}}
4+
{{- define "precise-prefix-cache-aware.name" -}}
5+
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
6+
{{- end }}
7+
8+
{{/*
9+
Create a default fully qualified app name.
10+
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
11+
If release name contains chart name it will be used as a full name.
12+
*/}}
13+
{{- define "precise-prefix-cache-aware.fullname" -}}
14+
{{- if .Values.fullnameOverride }}
15+
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
16+
{{- else }}
17+
{{- $name := default .Chart.Name .Values.nameOverride }}
18+
{{- if contains $name .Release.Name }}
19+
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
20+
{{- else }}
21+
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
22+
{{- end }}
23+
{{- end }}
24+
{{- end }}
25+
26+
{{/*
27+
Create chart name and version as used by the chart label.
28+
*/}}
29+
{{- define "precise-prefix-cache-aware.chart" -}}
30+
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
31+
{{- end }}
32+
33+
{{/*
34+
Common labels
35+
*/}}
36+
{{- define "precise-prefix-cache-aware.labels" -}}
37+
helm.sh/chart: {{ include "precise-prefix-cache-aware.chart" . }}
38+
{{ include "precise-prefix-cache-aware.selectorLabels" . }}
39+
{{- if .Chart.AppVersion }}
40+
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
41+
{{- end }}
42+
app.kubernetes.io/managed-by: {{ .Release.Service }}
43+
{{- end }}
44+
45+
{{/*
46+
Selector labels
47+
*/}}
48+
{{- define "precise-prefix-cache-aware.selectorLabels" -}}
49+
app.kubernetes.io/name: {{ include "precise-prefix-cache-aware.name" . }}
50+
app.kubernetes.io/instance: {{ .Release.Name }}
51+
{{- end }}
52+
53+
{{/*
54+
Config Mount Path
55+
*/}}
56+
{{- define "precise-prefix-cache-aware.configMount" -}}
57+
{{- print "/etc/inference-perf" -}}
58+
{{- end }}
59+
60+
{{/*
61+
Hugging Face Secret Name
62+
*/}}
63+
{{- define "precise-prefix-cache-aware.hfSecret" -}}
64+
{{- printf "%s-hf-secret" (include "precise-prefix-cache-aware.fullname" .) -}}
65+
{{- end }}
66+
67+
{{/*
68+
Hugging Face Secret Key
69+
*/}}
70+
{{- define "precise-prefix-cache-aware.hfKey" -}}
71+
{{- print "token" -}}
72+
{{- end }}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: {{ include "precise-prefix-cache-aware.fullname" . }}-config
5+
labels:
6+
{{- include "precise-prefix-cache-aware.labels" . | nindent 4 }}
7+
data:
8+
config.yaml: |
9+
{{- $config := .Values.config | deepCopy -}}
10+
{{- $secretToken := index (lookup "v1" "Secret" .Release.Namespace .Values.hfTokenSecret.name).data .Values.hfTokenSecret.key | b64dec -}}
11+
{{- $_ := set $config.tokenizer "token" $secretToken -}}
12+
{{- toYaml $config | nindent 4 }}
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: {{ include "precise-prefix-cache-aware.fullname" . }}-job
5+
labels:
6+
{{- include "precise-prefix-cache-aware.labels" . | nindent 4 }}
7+
app: inference-perf
8+
spec:
9+
template:
10+
metadata:
11+
labels:
12+
{{- include "precise-prefix-cache-aware.selectorLabels" . | nindent 8 }}
13+
app: inference-perf
14+
spec:
15+
restartPolicy: Never
16+
containers:
17+
- name: inference-perf-container
18+
image: {{ .Values.job.image }}
19+
command: ["inference-perf"]
20+
args:
21+
- "--config_file"
22+
- "{{ include "precise-prefix-cache-aware.configMount" . }}/config.yaml"
23+
- "--log-level"
24+
- {{ .Values.logLevel }}
25+
env:
26+
{{- if .Values.hfToken }}
27+
- name: HF_TOKEN
28+
valueFrom:
29+
secretKeyRef:
30+
name: {{ include "precise-prefix-cache-aware.hfTokenSecret.name" . }}
31+
key: {{ include "precise-prefix-cache-aware.hfTokenSecret.key" . }}
32+
{{- end }}
33+
volumeMounts:
34+
- name: config-volume
35+
mountPath: {{ include "precise-prefix-cache-aware.configMount" . }}
36+
readOnly: true
37+
resources:
38+
requests:
39+
memory: {{ .Values.job.memory }}
40+
volumes:
41+
- name: config-volume
42+
configMap:
43+
name: {{ include "precise-prefix-cache-aware.fullname" . }}-config
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{{- if .Values.hfToken -}}
2+
apiVersion: v1
3+
kind: Secret
4+
metadata:
5+
name: {{ include "precise-prefix-cache-aware.hfSecret" . }}
6+
labels:
7+
{{- include "precise-prefix-cache-aware.labels" . | nindent 4 }}
8+
type: Opaque
9+
data:
10+
token: {{ .Values.hfToken | b64enc }}
11+
{{- end }}

0 commit comments

Comments
 (0)