@@ -57,7 +57,9 @@ Ethernet)](https://medium.com/@sunyanan.choochotkaew1/unlocking-gpudirect-rdma-o
5757we will not cover advanced networking topics in this tutorial and disable this
5858feature.
5959
60- ## Storage Setup
60+ ## MLBatch Setup
61+
62+ ### Storage Setup
6163
6264We assume storage is available by means of preconfigured
6365[ NFS] ( https://en.wikipedia.org/wiki/Network_File_System ) servers. We configure
@@ -66,8 +68,7 @@ Provisioner](https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner)
6668``` sh
6769helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
6870helm repo update
69- ```
70- ```
71+
7172helm install -n nfs-provisioner simplenfs nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
7273 --create-namespace \
7374 --set nfs.server=192.168.95.253 --set nfs.path=/var/repo/root/nfs \
@@ -78,10 +79,10 @@ helm install -n nfs-provisioner pokprod nfs-subdir-external-provisioner/nfs-subd
7879 --set nfs.server=192.168.98.96 --set nfs.path=/gpfs/fs_ec/pokprod002 \
7980 --set storageClass.name=nfs-client-pokprod --set storageClass.provisionerName=k8s-sigs.io/pokprod-nfs-subdir-external-provisioner
8081```
81- Make sure to replace the server ips and paths above with the right one for your
82- environment. While we make use of both storage classes in the remainder of the
83- tutorial for the sake of demonstration, everything could be done with a single
84- class.
82+ Make sure to replace the server ips and paths above with the right values for
83+ your environment. While we make use of both storage classes in the remainder of
84+ the tutorial for the sake of demonstration, everything could be done with a
85+ single class.
8586``` sh
8687kubectl get storageclasses
8788```
@@ -91,11 +92,11 @@ nfs-client-pokprod k8s-sigs.io/pokprod-nfs-subdir-external-provisioner D
9192nfs-client-simplenfs k8s-sigs.io/simplenfs-nfs-subdir-external-provisioner Delete Immediate true 15s
9293```
9394
94- ## Prometheus Setup
95+ ### Prometheus Setup
9596
9697TODO
9798
98- ## MLBatch Cluster Setup
99+ ### MLBatch Cluster Setup
99100
100101We follow instructions from [ CLUSTER-SETUP.md] ( ../setup.k8s/CLUSTER-SETUP.md ) .
101102
@@ -181,11 +182,11 @@ EOF
181182```
182183We reserve 8 GPUs out of 24 for MLBatch's slack queue.
183184
184- # Autopilot Extended Setup
185+ ### Autopilot Extended Setup
185186
186187TODO
187188
188- ## MLBatch Teams Setup
189+ ### MLBatch Teams Setup
189190
190191We configure team ` blue ` with user ` alice ` and ` red ` with user ` bob ` following
191192the [ team setup] ( ../setup.k8s/TEAM-SETUP.md ) . Each team has a nominal quota of
@@ -308,9 +309,125 @@ portable. In this tutorial, we will rely on [user
308309impersonation] ( https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation )
309310with ` kubectl ` to run as a specific user.
310311
311- ## Batch Inference with vLLM
312+ ## Example workloads
312313
313- TODO
314+ Each example workload below is submitted as an
315+ [ AppWrapper] ( https://project-codeflare.github.io/appwrapper/ ) . See
316+ [ USAGE.md] ( ../USAGE.md ) for a detailed discussion of queues and workloads in an
317+ MLBatch cluster.
318+
319+ ### Batch Inference with vLLM
320+
321+ In this example, ` alice ` runs a batch inference workload using
322+ [ vLLM] ( https://docs.vllm.ai/en/latest/ ) to serve IBM's
323+ [ granite-3.2-8b-instruct] ( https://huggingface.co/ibm-granite/granite-3.2-8b-instruct )
324+ model.
325+
326+ First, ` alice ` creates a persistent volume claim to cache the model weights on
327+ first invocation so that subsequent instantiation of the model will reuse the
328+ cached data.
329+ ``` yaml
330+ kubectl apply --as alice -n blue -f- << EOF
331+ apiVersion : v1
332+ kind : PersistentVolumeClaim
333+ metadata :
334+ name : granite-3.2-8b-instruct
335+ spec :
336+ accessModes :
337+ - ReadWriteMany
338+ resources :
339+ requests :
340+ storage : 50Gi
341+ storageClassName : nfs-client-pokprod
342+ EOF
343+ ```
344+ The workload wraps a Kubernetes Job in an AppWrapper. The Job consists of one
345+ Pod with two containers using an upstream ` vllm-openai ` image. The ` vllm `
346+ container runs the inference runtime. The ` load-generator ` container submits a
347+ random series of requests to the inference runtime and reports a number of
348+ metrics such as _ Time to First Token_ (TTFT) and _ Time per Output Token_ (TPOT).
349+ ``` yaml
350+ kubectl apply --as alice -n blue -f- << EOF
351+ apiVersion : workload.codeflare.dev/v1beta2
352+ kind : AppWrapper
353+ metadata :
354+ name : batch-inference
355+ spec :
356+ components :
357+ - template :
358+ apiVersion : batch/v1
359+ kind : Job
360+ metadata :
361+ name : batch-inference
362+ spec :
363+ template :
364+ metadata :
365+ annotations :
366+ kubectl.kubernetes.io/default-container : load-generator
367+ labels :
368+ app : batch-inference
369+ spec :
370+ terminationGracePeriodSeconds : 0
371+ restartPolicy : Never
372+ containers :
373+ - name : vllm
374+ image : quay.io/tardieu/vllm-openai:v0.7.3 # mirror of vllm/vllm-openai:v0.7.3
375+ command :
376+ # serve model and wait for halt signal
377+ - sh
378+ - -c
379+ - |
380+ vllm serve ibm-granite/granite-3.2-8b-instruct &
381+ until [ -f /.config/halt ]; do sleep 1; done
382+ ports :
383+ - containerPort : 8000
384+ resources :
385+ requests :
386+ cpu : 4
387+ memory : 64Gi
388+ nvidia.com/gpu : 1
389+ limits :
390+ cpu : 4
391+ memory : 64Gi
392+ nvidia.com/gpu : 1
393+ volumeMounts :
394+ - name : cache
395+ mountPath : /.cache
396+ - name : config
397+ mountPath : /.config
398+ - name : load-generator
399+ image : quay.io/tardieu/vllm-benchmarks:v0.7.3
400+ command :
401+ # wait for vllm, submit batch of inference requests, send halt signal
402+ - sh
403+ - -c
404+ - |
405+ until nc -zv localhost 8000; do sleep 1; done;
406+ python3 benchmark_serving.py \
407+ --model=ibm-granite/granite-3.2-8b-instruct \
408+ --backend=vllm \
409+ --dataset-name=random \
410+ --random-input-len=128 \
411+ --random-output-len=128 \
412+ --max-concurrency=16 \
413+ --num-prompts=512;
414+ touch /.config/halt
415+ volumeMounts :
416+ - name : cache
417+ mountPath : /.cache
418+ - name : config
419+ mountPath : /.config
420+ volumes :
421+ - name : cache
422+ persistentVolumeClaim :
423+ claimName : granite-3.2-8b-instruct
424+ - name : config
425+ emptyDir : {}
426+ EOF
427+ ```
428+ The two containers are synchronized as follows: ` load-generator ` waits for
429+ ` vllm ` to be ready to accept requests and, upon completion of the batch, signals
430+ ` vllm ` to make it quit.
314431
315432## Pre-Training with PyTorch
316433
0 commit comments