You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Experimental SLO-Aware Routing and Latency Prediction (#1568)
* add latency predictor
* add cv in model and update epp deployment
* bug fix
* track mape for predictions
* add running queue size to metrics
* add xgboost regressor and update tpot sampling logic
* emit predicted and actual ttft tpot in body
* seperate servers for training and prediction
* add latency predictor
put the predictor functions in director in a helper function
add scores to reqcxt
record prediction duration metrics
add prefix cache score to model input
slo based routing changes
retreive request priority queue from the datastore
update scoring logic
* better inital implemenation
Add scheduling profile, working state
remove latencypredictor from director
Move all latency prediction logic out of director and into scheduling profile. Make all Request/Response plugins take in RequestContext
* progress towards fixing up merge conflicts from latency predictor merge
* More refactor progress, fixing and adding tests
* working state, latency prediction
* Clean up changes, remove unneeded files, working functionality without latency flag and scheduling plugins
* Rebase cleanup, remove duplicate lines
* Integrate new alpha-beta slo scoring into scoring plugin
* Fix prefix cache scoring for slo-aware routing
* Add pycache or latency predictor to gitignore
* Rebase with main
* Fix prefix cache scoring being piped to latencyprediction_helper
* add dependancies in scorer
* chage to single profile
* chage to single profile
* restore two profiles
* restore two profiles
* restore two profiles
* update admit request to shed based on predictions
* add TODOs for future changes
* Change artifact registry references to personal compiled images
* Fix existing non-slo aware routing unit tests
* update latency predictor with better eval metrics
* Fix saturation detector unit test
* Change naming of SLO headers and prediction based routing header
* Remove port 9002 service on InferencePool causing make test to fail
* Fix epp hermetic integration test to expect ProcessingMode Send in response header
---------
Co-authored-by: kaushikmitr <[email protected]>
totalQueuedRequestsMetric=flag.String("total-queued-requests-metric", runserver.DefaultTotalQueuedRequestsMetric, "Prometheus metric for the number of queued requests.")
95
+
totalRunningRequestsMetric=flag.String("total-running-requests-metric", runserver.DefaultTotalRunningRequestsMetric, "Prometheus metric for the number of running requests.")
92
96
kvCacheUsagePercentageMetric=flag.String("kv-cache-usage-percentage-metric", runserver.DefaultKvCacheUsagePercentageMetric, "Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).")
93
97
// LoRA metrics
94
98
loraInfoMetric=flag.String("lora-info-metric", runserver.DefaultLoraInfoMetric, "Prometheus metric for the LoRA info metrics (must be in vLLM label format).")
@@ -107,6 +111,9 @@ var (
107
111
modelServerMetricsHttpsInsecureSkipVerify=flag.Bool("model-server-metrics-https-insecure-skip-verify", true, "When using 'https' scheme for 'model-server-metrics-scheme', configure 'InsecureSkipVerify' (default to true)")
108
112
haEnableLeaderElection=flag.Bool("ha-enable-leader-election", false, "Enables leader election for high availability. When enabled, readiness probes will only pass on the leader.")
109
113
114
+
// Latency Predictor Flag
115
+
enableLatencyPredictor=flag.Bool("enable-latency-predictor", false, "Enable the regression-based latency predictor and scheduler scorer.")
0 commit comments