Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 6 additions & 10 deletions docs/tutorials/demo.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# vllm with wva autoscaler


Notes:
Notes:
1. Experiments on OpenShift Cluster with H100 GPUs.
2. To setup `vLLM` on `Openshift`, refer to [vllm-samples.md](vllm-samples.md).
3. We use `guidellm` as the load generator. Refer to [guidellm-sample.md](guidellm-sample.md) for a quick tutorial to create your guidellm image that will be used in a `Job` resource.
3. We use `guidellm` as the load generator. Refer to [guidellm-sample.md](guidellm-sample.md) for a quick tutorial to create your guidellm image that will be used in a `Job` resource.
3. The WVA autoscaler is assumed to be deployed in `workload-variant-autoscaler-system` namespace.


Expand Down Expand Up @@ -96,19 +96,19 @@ spec:
- "--data"
- "prompt_tokens=128,output_tokens=512"
- "--output-path"
- "/tmp/benchmarks.json"
- "/tmp/benchmarks.json"
restartPolicy: Never
backoffLimit: 4
```

In each job, fill in `image: <image-repo>:<tag>` with your `guidellm` image repo and tag. The `<rate>` and `max-seconds` are set as follows.

- In `guidellm-job-1.yaml`, we set `<rate>` and `<max-seconds>` to `8` and `1800` respectively. By doing this, we force `guidellm` client to send requests at rate `8` requests per second (480 req/min) for `30` minutes.
- In `guidellm-job-1.yaml`, we set `<rate>` and `<max-seconds>` to `8` and `1800` respectively. By doing this, we force `guidellm` client to send requests at rate `8` requests per second (480 req/min) for `30` minutes.
- In `guidellm-job-2.yaml`, we set `<rate>` and `<max-seconds>` to `8` and `1200` respectively. We start this job after a couple of minutes of starting `guidellm-job-1`. When both jobs are running, we are effectively sending requests at rate `8+8 = 16` requests per second (960 req/min).
- In `guidellm-job-3.yaml`, we set `<rate>` and `<max-seconds>` to `8` and `720` respectively. We start this job after a couple of minutes of starting `guidellm-job-2`. When all the three jobs are running, we are effectively sending requests at rate `8+8+8 = 24` requests per second (1440 req/min) for 12 minutes.
- With this setup, `guidellm-job-3` will complete first, bringing the effective request rate back to `16` req/sec. This is followed by the completion of `guidellm-job-2`, which will bring down rate to `8` req/sec. Finally, `guidellm-job-1` completes, after which no further requests are sent.

**Dynamic Load Generation Summary:**
**Dynamic Load Generation Summary:**
- Step 1: `oc apply -f guidellm-job-1.yaml`. Wait about 5 minutes before continuing to step 2.
- Step 2: `oc apply -f guidellm-job-2.yaml`. Wait about 5 minutes before continuing to step 3.
- Step 3: `oc apply -f guidellm-job-3.yaml`
Expand All @@ -118,8 +118,4 @@ In each job, fill in `image: <image-repo>:<tag>` with your `guidellm` image repo
## WVA Performance
The following figure shows the behaviour observed from the controller logs.

![Autoscaler Diagram](../../docs/diagrams/autoscaler-demo.png)




![Autoscaler Diagram](../design/diagrams/autoscaler-demo.png)