Skip to content

Commit 0691d5f

Browse files
authored
Add batch inference example (#141)
1 parent 2d4f9a8 commit 0691d5f

File tree

1 file changed

+130
-13
lines changed

1 file changed

+130
-13
lines changed

setup.KubeConEU25/README.md

Lines changed: 130 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,9 @@ Ethernet)](https://medium.com/@sunyanan.choochotkaew1/unlocking-gpudirect-rdma-o
5757
we will not cover advanced networking topics in this tutorial and disable this
5858
feature.
5959

60-
## Storage Setup
60+
## MLBatch Setup
61+
62+
### Storage Setup
6163

6264
We assume storage is available by means of preconfigured
6365
[NFS](https://en.wikipedia.org/wiki/Network_File_System) servers. We configure
@@ -66,8 +68,7 @@ Provisioner](https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner)
6668
```sh
6769
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
6870
helm repo update
69-
```
70-
```
71+
7172
helm install -n nfs-provisioner simplenfs nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
7273
--create-namespace \
7374
--set nfs.server=192.168.95.253 --set nfs.path=/var/repo/root/nfs \
@@ -78,10 +79,10 @@ helm install -n nfs-provisioner pokprod nfs-subdir-external-provisioner/nfs-subd
7879
--set nfs.server=192.168.98.96 --set nfs.path=/gpfs/fs_ec/pokprod002 \
7980
--set storageClass.name=nfs-client-pokprod --set storageClass.provisionerName=k8s-sigs.io/pokprod-nfs-subdir-external-provisioner
8081
```
81-
Make sure to replace the server ips and paths above with the right one for your
82-
environment. While we make use of both storage classes in the remainder of the
83-
tutorial for the sake of demonstration, everything could be done with a single
84-
class.
82+
Make sure to replace the server ips and paths above with the right values for
83+
your environment. While we make use of both storage classes in the remainder of
84+
the tutorial for the sake of demonstration, everything could be done with a
85+
single class.
8586
```sh
8687
kubectl get storageclasses
8788
```
@@ -91,11 +92,11 @@ nfs-client-pokprod k8s-sigs.io/pokprod-nfs-subdir-external-provisioner D
9192
nfs-client-simplenfs k8s-sigs.io/simplenfs-nfs-subdir-external-provisioner Delete Immediate true 15s
9293
```
9394

94-
## Prometheus Setup
95+
### Prometheus Setup
9596

9697
TODO
9798

98-
## MLBatch Cluster Setup
99+
### MLBatch Cluster Setup
99100

100101
We follow instructions from [CLUSTER-SETUP.md](../setup.k8s/CLUSTER-SETUP.md).
101102

@@ -181,11 +182,11 @@ EOF
181182
```
182183
We reserve 8 GPUs out of 24 for MLBatch's slack queue.
183184

184-
# Autopilot Extended Setup
185+
### Autopilot Extended Setup
185186

186187
TODO
187188

188-
## MLBatch Teams Setup
189+
### MLBatch Teams Setup
189190

190191
We configure team `blue` with user `alice` and `red` with user `bob` following
191192
the [team setup](../setup.k8s/TEAM-SETUP.md). Each team has a nominal quota of
@@ -308,9 +309,125 @@ portable. In this tutorial, we will rely on [user
308309
impersonation](https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation)
309310
with `kubectl` to run as a specific user.
310311

311-
## Batch Inference with vLLM
312+
## Example workloads
312313

313-
TODO
314+
Each example workload below is submitted as an
315+
[AppWrapper](https://project-codeflare.github.io/appwrapper/). See
316+
[USAGE.md](../USAGE.md) for a detailed discussion of queues and workloads in an
317+
MLBatch cluster.
318+
319+
### Batch Inference with vLLM
320+
321+
In this example, `alice` runs a batch inference workload using
322+
[vLLM](https://docs.vllm.ai/en/latest/) to serve IBM's
323+
[granite-3.2-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct)
324+
model.
325+
326+
First, `alice` creates a persistent volume claim to cache the model weights on
327+
first invocation so that subsequent instantiation of the model will reuse the
328+
cached data.
329+
```yaml
330+
kubectl apply --as alice -n blue -f- << EOF
331+
apiVersion: v1
332+
kind: PersistentVolumeClaim
333+
metadata:
334+
name: granite-3.2-8b-instruct
335+
spec:
336+
accessModes:
337+
- ReadWriteMany
338+
resources:
339+
requests:
340+
storage: 50Gi
341+
storageClassName: nfs-client-pokprod
342+
EOF
343+
```
344+
The workload wraps a Kubernetes Job in an AppWrapper. The Job consists of one
345+
Pod with two containers using an upstream `vllm-openai` image. The `vllm`
346+
container runs the inference runtime. The `load-generator` container submits a
347+
random series of requests to the inference runtime and reports a number of
348+
metrics such as _Time to First Token_ (TTFT) and _Time per Output Token_ (TPOT).
349+
```yaml
350+
kubectl apply --as alice -n blue -f- << EOF
351+
apiVersion: workload.codeflare.dev/v1beta2
352+
kind: AppWrapper
353+
metadata:
354+
name: batch-inference
355+
spec:
356+
components:
357+
- template:
358+
apiVersion: batch/v1
359+
kind: Job
360+
metadata:
361+
name: batch-inference
362+
spec:
363+
template:
364+
metadata:
365+
annotations:
366+
kubectl.kubernetes.io/default-container: load-generator
367+
labels:
368+
app: batch-inference
369+
spec:
370+
terminationGracePeriodSeconds: 0
371+
restartPolicy: Never
372+
containers:
373+
- name: vllm
374+
image: quay.io/tardieu/vllm-openai:v0.7.3 # mirror of vllm/vllm-openai:v0.7.3
375+
command:
376+
# serve model and wait for halt signal
377+
- sh
378+
- -c
379+
- |
380+
vllm serve ibm-granite/granite-3.2-8b-instruct &
381+
until [ -f /.config/halt ]; do sleep 1; done
382+
ports:
383+
- containerPort: 8000
384+
resources:
385+
requests:
386+
cpu: 4
387+
memory: 64Gi
388+
nvidia.com/gpu: 1
389+
limits:
390+
cpu: 4
391+
memory: 64Gi
392+
nvidia.com/gpu: 1
393+
volumeMounts:
394+
- name: cache
395+
mountPath: /.cache
396+
- name: config
397+
mountPath: /.config
398+
- name: load-generator
399+
image: quay.io/tardieu/vllm-benchmarks:v0.7.3
400+
command:
401+
# wait for vllm, submit batch of inference requests, send halt signal
402+
- sh
403+
- -c
404+
- |
405+
until nc -zv localhost 8000; do sleep 1; done;
406+
python3 benchmark_serving.py \
407+
--model=ibm-granite/granite-3.2-8b-instruct \
408+
--backend=vllm \
409+
--dataset-name=random \
410+
--random-input-len=128 \
411+
--random-output-len=128 \
412+
--max-concurrency=16 \
413+
--num-prompts=512;
414+
touch /.config/halt
415+
volumeMounts:
416+
- name: cache
417+
mountPath: /.cache
418+
- name: config
419+
mountPath: /.config
420+
volumes:
421+
- name: cache
422+
persistentVolumeClaim:
423+
claimName: granite-3.2-8b-instruct
424+
- name: config
425+
emptyDir: {}
426+
EOF
427+
```
428+
The two containers are synchronized as follows: `load-generator` waits for
429+
`vllm` to be ready to accept requests and, upon completion of the batch, signals
430+
`vllm` to make it quit.
314431

315432
## Pre-Training with PyTorch
316433

0 commit comments

Comments
 (0)