-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Current autoscaler considers the utilization of the KV cache and the request queue length as indicators of saturation, and it accordingly scales up or down the number of replicas, without consideration of user-perceived performance metrics such as TTFT and ITL, nor potential specifications of targets (SLOs) for such metrics.
This issue addresses the need for an SLO-driven, model-based autoscaling. The reason for a performance model is to assess the number of replicas required to attain a given SLOs, based on the observer/predicted load on the servers.
Several tasks are envisioned to address this issue.
- Add some some basic functionality.
- Add a queueing model to analyze and size a replica of an inference server. The model may be called to (1) analyze the steady state (server state over a period of time, such as a control cycle, as opposed to a given current state) and produce TTFT and ITL values for a given load (request rate and input/output token characteristics) and (2) size a server by providing the maximum request rate for a given load in order to attain TTFT and ITL SLOs. The queueing model may have a handful of internal parameters, which may be obtained off-line or on-line, preferably the latter. (PR Enhance queueing model used by queue analyzer #727 )
- Add an on-line model tuner to continuously estimate the queueing model internal parameters using observations and standard filtering techniques. (PR Basic model tuner functionality #743 )
-
Conduct experiments to validate queueing model and on-line model tuner.
-
Add means to collect metrics from the Scheduler related to the offered request rate, as opposed to the successfully completed requests. The former should be used to size the server, whereas the latter should be used to tune/calibrate the queueing model.
-
Add mechanism to specify SLOs. Two alternatives are possible: (1) explicit by modifying the APIs to allow user/admin to specify absolute/relative values for TTFT and ITL SLO targets, and (2) implicit by having WVA automatically setting target values based on observations and optimizing an internal objective.
-
Optionally, add a predictor of load (request rate and token distributions) for the next control cycle, given the statistical history, potentially using time-series techniques.
-
Add a new model-based scaling engine which employs above queueing model and on-line model tuner.
Notes:
- TTFT measure should be expressed as a relative measure to the number of prompt tokens in the requests, e.g. TTFT per 100 tokens.
- Queueing model need not be exact, as it is used solely to estimate maximum load to attain a given SLOs, as opposed to performance evaluation of performance metrics.