Skip to content

Comments

doc: consolidate installation steps#762

Open
shuynh2017 wants to merge 1 commit intollm-d:mainfrom
shuynh2017:doc-install-consolidate
Open

doc: consolidate installation steps#762
shuynh2017 wants to merge 1 commit intollm-d:mainfrom
shuynh2017:doc-install-consolidate

Conversation

@shuynh2017
Copy link
Contributor

This PR:

  • consolidates all installation steps to one place - docs/user-guide/installation.md
  • removes install instructions from README.md and points to docs/user-guide/installation.md
  • removes install instructions from charts/workload-variant-autoscaler/README.md and points to docs/user-guide/installation.md
  • provides separate installation steps for wva controller and scale target models in docs/user-guide/installation.md
  • raises the important of updating the global prometheus-adapter and shows how to append to existing prometheus-adapter config.

@shuynh2017
Copy link
Contributor Author

@lionelvillard pls review. Thanks.

Copy link
Collaborator

@mamy-CS mamy-CS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Thank you for the cleanup @shuynh2017

wva:
controllerInstance: "my-unique-instance-id"
```
When running multiple WVA controllers in the same cluster (e.g., for parallel e2e tests or multi-tenant environments), use the `controllerInstance` configuration to prevent metrics conflicts between controllers. See [Multi-Controller Isolation](../../docs/user-guide/multi-controller-isolation.md) for details configuration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When running multiple WVA controllers in the same cluster (e.g., for parallel e2e tests or multi-tenant environments), use the `controllerInstance` configuration to prevent metrics conflicts between controllers. See [Multi-Controller Isolation](../../docs/user-guide/multi-controller-isolation.md) for details configuration.
When running multiple WVA controllers in the same cluster (e.g., for parallel e2e tests or multi-tenant environments), use the `controllerInstance` configuration to prevent metrics conflicts between controllers. See [Multi-Controller Isolation](../../docs/user-guide/multi-controller-isolation.md) for detailed configuration.

# - Prometheus and monitoring stack
# - vLLM emulator for testing
# See deploy/kind-emulator/README.md for detailed instructions
make deploy-llm-d-wva-emulated-on-kind
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chart readme had a cleanup section that was removed. Consider adding cleanup instructions here maybe?

Comment on lines +74 to +82
```
helm upgrade -i wva-model-a ./workload-variant-autoscaler \
-n $WVA_NS \
--set controller.enabled=false \
--set va.enabled=true \
--set hpa.enabled=true \
--set llmd.namespace=team-a \
--set llmd.modelName=my-model-a \
--set llmd.modelID="meta-llama/Llama-3.1-8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing some important configuration options that were in the original chart readme, such as va.accelerator. Add the complete example

helm upgrade -i wva-model-a ./workload-variant-autoscaler \
  -n $WVA_NS \
  --set controller.enabled=false \
  --set va.enabled=true \
  --set hpa.enabled=true \
  --set va.accelerator=L40S \
  --set llmd.namespace=team-a \
  --set llmd.modelName=my-model-a \
  --set llmd.modelID="meta-llama/Llama-3.1-8" \
  --set vllmService.enabled=true \
  --set vllmService.nodePort=30000

export WVA_PROJECT=$PWD
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a validation step after step 1 to verify that the CA cert file was created, to avoid issues?

--set va.enabled=false \
--set hpa.enabled=false \
--set vllmService.enabled=false
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a verification step after step 3 to confirm the controller is running, before moving further?

## Installation Methods

### Option 1: Helm Installation (Recommended)
### Option 1: Helm Installation (Recommended, on OpenShift)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heading says recommended, on OpenShift, but the content is entirely openShift-specific. Might be confusing for other non openshift users. Clarify here.

helm upgrade -i workload-variant-autoscaler ./workload-variant-autoscaler \
-n $WVA_NS \
--set-file wva.prometheus.caCert=/tmp/prometheus-ca.crt \
--set controller.enabled=true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --set controller.enabled=true is explicit but redundant (it's the default). This is fine for clarity, but you could also remove it to reduce verbosity. Either approach works, I guess keeping it makes the intent clear.

@shuynh2017
Copy link
Contributor Author

@mamy-CS thank you for your comments. @lionelvillard also provided ideas for further organization. I will update the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants