diff --git a/docs/proposals/NNNN-template/README.md b/docs/proposals/NNNN-template/README.md new file mode 100644 index 00000000..90d23c8d --- /dev/null +++ b/docs/proposals/NNNN-template/README.md @@ -0,0 +1,267 @@ +# Proposal-NNNN: Your short, descriptive title + + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary + + + +## Motivation + + + +### Goals + + + +### Non-Goals + + + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + diff --git a/docs/proposals/NNNN-template/proposal.yaml b/docs/proposals/NNNN-template/proposal.yaml new file mode 100644 index 00000000..8d76bee0 --- /dev/null +++ b/docs/proposals/NNNN-template/proposal.yaml @@ -0,0 +1,40 @@ +title: Proposal Template +proposal-number: NNNN +authors: + - TBD +status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced +creation-date: yyyy-mm-dd +reviewers: + - TBD +approvers: + - TBD + +see-also: + - "/proposals/1234-we-heard-you-like-proposals" + - "/proposals/2345-everyone-gets-a-proposal" +replaces: + - "/proposals/3456-replaced-proposal" + +# The target maturity stage in the current dev cycle for this proposal. +stage: alpha|beta|stable + +# The most recent milestone for which work toward delivery of this proposal has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.2" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.2" + beta: "v0.3" + stable: "v0.5" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MyFeature +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - my_feature_metric diff --git a/docs/proposals/lora-autoscaler/README.md b/docs/proposals/lora-autoscaler/README.md new file mode 100644 index 00000000..4b742663 --- /dev/null +++ b/docs/proposals/lora-autoscaler/README.md @@ -0,0 +1,352 @@ +# Proposal-287: LoRA Autoscaler + + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary + + + +## Motivation + + + +The foundation model size of GenAI is becoming bigger and bigger, which leads to the hight latency of autoscaling +of new model servers, more serious with new nodes. LoRA adapter, on the other hand, is a lightweight solution for +different scenarios will less training cost and resource requirement. The combination of **Foundation Model** + **Multi LoRA Adapter** +would be a dense solution for the sake of the cost saving and latency reduction. + +### Goals + + + +- Support to serving lora models via both the Playground and the inference Service +- Support to exchange the lora models in the runtime +- Support to autoscale LoRAs based on the load +- Autoscaling framework should be easy to extend with other metrics +- Integrate with vLLM as the first step which supports load/unload LoRAs in the runtime +- Route the lora requests to the specific lora server + +### Non-Goals + + + +- Efficient loading lora models, this should be designed with another proposal +- Different scaling policies to implement, this will be designed in another proposal +- More fine-gained lora requests routing policies should be designed in another proposal, like: + - spread scheduling + - binpack scheduling + - latency-aware scheduling + - throughput-aware scheduling +- Support other inference engines like SGLang +- More fine-gained LoRA replica dispatching policies, right now we just dispatch the LoRAs to the replicas + as much equally as we can + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +I want to serve several LoRAs with the same foundation model and I want to route the lora requests to the specific lora server, +rather than routing the requests by random or round-robin. + +#### Story 2 + +I want to dynamic autoscaling the LoRAs based on the request and load, for example, if model server A is under high load with lora-1, +model server B has no traffic with lora-2, so model server B should unload the lora-2 and load the lora-1 for better traffic loadbalancing. +This just looks like how HPA works for Pods. + +### Notes/Constraints/Caveats (Optional) + + + +- LoRA replicas should not be 0 in the cluster side to avoid the cold start. +- We may need the integrate with inference engines for providing more precious metrics. + +### Risks and Mitigations + + + +The metric is a reactive indicator, which means the latency is unavoidable, the same as HPA. But we'll try to mitigate the +latency by offering different policies for configuration. The scaling down has the same problem in response to the cost. + +## Design Details + + + +### The LoRA Autoscaler + +Right now, vLLM has one lora metric `vllm:lora_requests_info` containing three labels: + +- running_lora_adapters: a per-adapter count of the number requests running using that adapter, formatted as a comma-separated string +- waiting_lora_adapters: similar, except counting requests that are waiting to be scheduled +- max_lora: the static "max number of LoRAs in a single batch." configuration + +We will leverage the `waiting_lora_adapters` as the dominant metric for the autoscaling decision. + +At a high level, the workflow looks like this: + +- Create Playground or Inference Service with the lora configured +- Dispatch the LoRAs to the instances as much equally as we can, for instance: + - if we have 2 replicas with 3 loras, we may dispatch the LoRAs to the replicas as follows: + - Replica 1: lora-1, lora-3 + - Replica 2: lora-2 + - if we have 7 replicas with 3 loras, we may dispatch the LoRAs to the replicas as follows: + - Replica 1: lora-1 + - Replica 2: lora-2 + - Replica 3: lora-3 + - Replica 4: lora-1 + - Replica 5: lora-2 + - Replica 6: lora-3 + - Replica 7: lora-1 + + Make sure **at least one lora exists** in replicas, to avoid lora loading overhead in runtime. +- Once the lora model loaded successfully, the gateway will update the route table for the lora requests +- The LoRA autoscaler will monitor the `waiting_lora_adapters` metrics: + + - once exceed the target threshold, the **lora autoscaler**, another controller, will jump in. It will trigger the lora loading for the hot loras but not beyond the max_lora configuration and same loras can't be in the same replica which is meaningless. + - also once a lora is **under low load**, lora autoscaler will first cut the corresponding traffic to the lora server and then offload the lora model. Note that the offload threshold should be bigger than the loading threshold to avoid the frequent loading/unloading overhead, both of them should be configurable. + +Several concerns here about the lora autoscaling: + + - the metric algorithm: right now, we'll use the `waiting_lora_adapters` as the dominant metric for the autoscaling decision + - How lora server knows when to load/offload the lora: we'll have a new CRD for tracking the lora loading status + - load dispatching policy: we'll make decision based on the lora number and the waiting requests. The policy should be configurable for extension in the future. + - the boundary with pod autoscaling: basically we'll autoscaling the loras first, once the lora autoscaler can't handle the load, for example, met the max_loras for all instances, the pod autoscaler will jump in. But considering the Pod autoscaling may also depend on the waiting requests, we may need to tune the metrics. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- function tests in gateway +- function tests in lora dispatching + +##### Integration tests + + + + + +- webhook tests for lora validation +- controller test to make sure the Playground or Service will run successfully + +##### e2e tests + + + +- e2e tests to make sure the lora service will run successfully +- e2e tests to make sure the lora autoscaling works as expected, both scaling up and down + +### Graduation Criteria + + + +## Implementation History + + + +- 2025-03-13: Proposal submitted + +## Drawbacks + + + +TODO. + +## Alternatives + + + +None. \ No newline at end of file diff --git a/docs/proposals/lora-autoscaler/proposal.yaml b/docs/proposals/lora-autoscaler/proposal.yaml new file mode 100644 index 00000000..3d2f4a22 --- /dev/null +++ b/docs/proposals/lora-autoscaler/proposal.yaml @@ -0,0 +1,28 @@ +title: LoRA Autoscaler +proposal-number: +authors: + - @kerthcet +status: implementable +creation-date: 2025-02-28 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this proposal. +stage: alpha + +# The most recent milestone for which work toward delivery of this proposal has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.2" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.2" + beta: TBD + stable: TBD + +# The following PRR answers are required at beta release +metrics: + - TBD