@@ -561,8 +561,8 @@ model.
561561<details >
562562
563563First, ` alice ` creates a persistent volume claim to cache the model weights on
564- first invocation so that subsequent instantiation of the model will reuse the
565- cached data .
564+ first invocation so that subsequent instantiations of the model will reuse the
565+ cached model weights .
566566``` yaml
567567kubectl apply --as alice -n blue -f- << EOF
568568apiVersion : v1
@@ -579,10 +579,10 @@ spec:
579579EOF
580580```
581581The workload wraps a Kubernetes Job in an AppWrapper. The Job consists of one
582- Pod with two containers using an upstream ` vllm-openai ` image. The ` vllm `
583- container runs the inference runtime . The ` load-generator ` container submits a
584- random series of requests to the inference runtime and reports a number of
585- metrics such as _ Time to First Token_ (TTFT) and _ Time per Output Token_ (TPOT).
582+ Pod with two containers. The ` vllm ` container runs the inference runtime using
583+ an upstream ` vllm-openai ` image . The ` load-generator ` container submits a random
584+ series of requests to the inference runtime and reports a number of metrics such
585+ as _ Time to First Token_ (TTFT) and _ Time per Output Token_ (TPOT).
586586``` yaml
587587kubectl apply --as alice -n blue -f- << EOF
588588apiVersion : workload.codeflare.dev/v1beta2
@@ -599,16 +599,13 @@ spec:
599599 spec :
600600 template :
601601 metadata :
602- annotations :
603- kubectl.kubernetes.io/default-container : load-generator
604602 labels :
605603 app : batch-inference
606604 spec :
607- terminationGracePeriodSeconds : 0
608605 restartPolicy : Never
609606 containers :
610607 - name : vllm
611- image : quay.io/tardieu/vllm-openai:v0.7.3 # mirror of vllm/vllm-openai:v0.7.3
608+ image : quay.io/tardieu/vllm-openai:v0.7.3 # vllm/vllm-openai:v0.7.3
612609 command :
613610 # serve model and wait for halt signal
614611 - sh
@@ -635,7 +632,7 @@ spec:
635632 - name : load-generator
636633 image : quay.io/tardieu/vllm-benchmarks:v0.7.3
637634 command :
638- # wait for vllm, submit batch of inference requests, send halt signal
635+ # wait for vllm, submit batch of requests, send halt signal
639636 - sh
640637 - -c
641638 - |
@@ -666,6 +663,15 @@ The two containers are synchronized as follows: `load-generator` waits for
666663` vllm ` to be ready to accept requests and, upon completion of the batch, signals
667664` vllm ` to make it quit.
668665
666+ Stream the logs of the ` vllm ` container with:
667+ ``` sh
668+ kubectl logs --as alice -n blue -l app=batch-inference -c vllm -f
669+ ```
670+ Stream the logs of the ` load-generator ` container with:
671+ ``` sh
672+ kubectl logs --as alice -n blue -l app=batch-inference -c load-generator -f
673+ ```
674+
669675</details >
670676
671677### Pre-Training with PyTorch
0 commit comments