Skip to content

Commit 17824ba

Browse files
authored
Update dynamic-lora-sidecar to expose metrics to track loaded adapters (#980)
* Add a metrics to track loaded adapters * Update the sample manifests * Add explanation of metrics from dyanmic LoRA adapter sidecar * Add explanation of metrics from dyanmic LoRA adapter sidecar (take 2) * Update metrics.md based on feedback
1 parent 5ff1e27 commit 17824ba

File tree

6 files changed

+121
-24
lines changed

6 files changed

+121
-24
lines changed

site-src/guides/metrics.md

Lines changed: 35 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,32 @@ This guide describes the current state of exposed metrics and how to scrape them
44

55
## Requirements
66

7-
To have response metrics, ensure the body mode is set to `Buffered` or `Streamed` (this should be the default behavior for all implementations).
7+
=== "EPP"
88

9-
If you want to include usage metrics for vLLM model server streaming request, send the request with `include_usage`:
9+
To have response metrics, ensure the body mode is set to `Buffered` or `Streamed` (this should be the default behavior for all implementations).
10+
11+
If you want to include usage metrics for vLLM model server streaming request, send the request with `include_usage`:
12+
13+
```
14+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
15+
"model": "food-review",
16+
"prompt": "whats your fav movie?",
17+
"max_tokens": 10,
18+
"temperature": 0,
19+
"stream": true,
20+
"stream_options": {"include_usage": "true"}
21+
}'
22+
```
23+
24+
=== "Dynamic LoRA Adapter Sidecar"
25+
26+
To have response metrics, ensure the vLLM model server is configured with the dynamic LoRA adapter as a sidecar container and a ConfigMap to configure which models to load/unload. See [this doc](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/tools/dynamic-lora-sidecar#example-configuration) for an example.
1027

11-
```
12-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
13-
"model": "food-review",
14-
"prompt": "whats your fav movie?",
15-
"max_tokens": 10,
16-
"temperature": 0,
17-
"stream": true,
18-
"stream_options": {"include_usage": "true"}
19-
}'
20-
```
2128

2229
## Exposed metrics
2330

31+
### EPP
32+
2433
| **Metric name** | **Metric Type** | <div style="width:200px">**Description**</div> | <div style="width:250px">**Labels**</div> | **Status** |
2534
|:---------------------------------------------|:-----------------|:------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------|
2635
| inference_model_request_total | Counter | The counter of requests broken out for each model. | `model_name`=&lt;model-name&gt; <br> `target_model_name`=&lt;target-model-name&gt; | ALPHA |
@@ -38,10 +47,20 @@ curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
3847
| inference_pool_ready_pods | Gauge | The number of ready pods for an inference server pool. | `name`=&lt;inference-pool-name&gt; | ALPHA |
3948
| inference_extension_info | Gauge | The general information of the current build. | `commit`=&lt;hash-of-the-build&gt; <br> `build_ref`=&lt;ref-to-the-build&gt; | ALPHA |
4049

50+
### Dynamic LoRA Adapter Sidecar
51+
52+
| **Metric name** | **Metric Type** | <div style="width:200px">**Description**</div> | <div style="width:250px">**Labels**</div> | **Status** |
53+
|:---------------------------|:-----------------|:-------------------------------------------------|:------------------------------------------|:------------|
54+
| lora_syncer_adapter_status | Gauge | Status of LoRA adapters (1=loaded, 0=not_loaded) | `adapter_name`=&lt;adapter-id&gt; | ALPHA |
4155

4256
## Scrape Metrics
4357

44-
Metrics endpoint is exposed at port 9090 by default. To scrape metrics, the client needs a ClusterRole with the following rule:
58+
The metrics endpoints are exposed on different ports by default:
59+
60+
- EPP exposes the metrics endpoint at port 9090
61+
- Dynamic LoRA adapter sidecar exposes the metrics endpoint at port 8080
62+
63+
To scrape metrics, the client needs a ClusterRole with the following rule:
4564
`nonResourceURLs: "/metrics", verbs: get`.
4665

4766
Here is one example if the client needs to mound the secret to act as the service account
@@ -86,7 +105,9 @@ metadata:
86105
kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader
87106
type: kubernetes.io/service-account-token
88107
```
89-
Then, you can curl the 9090 port like following
108+
109+
Then, you can curl the appropriate port as follows. For EPP (port 9090)
110+
90111
```
91112
TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret -o jsonpath='{.secrets[0].name}' -o jsonpath='{.data.token}' | base64 --decode)
92113

tools/dynamic-lora-sidecar/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,9 @@ spec:
121121
- name: reconciler
122122
image: your-image:tag
123123
command: ["python", "sidecar.py", "--health-check-timeout", "600", "--health-check-interval", "5", "--reconcile-trigger", "10"] #optional if overriding default values
124+
ports:
125+
- containerPort: 8080
126+
name: metrics
124127
volumeMounts:
125128
- name: config-volume
126129
mountPath: /config

tools/dynamic-lora-sidecar/deployment.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,10 @@ spec:
6969
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
7070
restartPolicy: Always
7171
imagePullPolicy: Always
72-
env:
72+
ports:
73+
- containerPort: 8080
74+
name: metrics
75+
env:
7376
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
7477
value: "/config/configmap.yaml"
7578
volumeMounts: # DO NOT USE subPath

tools/dynamic-lora-sidecar/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
aiohttp==3.12.12
22
jsonschema==4.24.0
3+
prometheus_client==0.22.1
34
PyYAML==6.0.2
45
requests==2.32.4
56
watchfiles==1.0.5

tools/dynamic-lora-sidecar/sidecar/sidecar.py

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,17 @@
2424
import datetime
2525
import os
2626
import sys
27+
from prometheus_client import Gauge, start_http_server
2728
from watchdog.observers.polling import PollingObserver as Observer
2829
from watchdog.events import FileSystemEventHandler
2930

31+
# Initialize Prometheus metrics
32+
ADAPTER_STATUS_METRICS = Gauge(
33+
'lora_syncer_adapter_status',
34+
'Status of LoRA adapters (1=loaded, 0=not_loaded)',
35+
['adapter_name']
36+
)
37+
3038
CONFIG_MAP_FILE = os.environ.get(
3139
"DYNAMIC_LORA_ROLLOUT_CONFIG", "/config/configmap.yaml"
3240
)
@@ -58,6 +66,8 @@ def parse_arguments():
5866
help=f'Path to config map file (default: {CONFIG_MAP_FILE})')
5967
parser.add_argument('--config-validation', action='store_true', default=True,
6068
help='Enable config validation (default: True)')
69+
parser.add_argument('--metrics-port', type=int, default=8080,
70+
help='Port to listen for Prometheus metrics (default: 8080)')
6171
return parser.parse_args()
6272

6373

@@ -226,7 +236,7 @@ def check_health() -> bool:
226236
time.sleep(self.health_check_interval.seconds)
227237
return False
228238

229-
def load_adapter(self, adapter: LoraAdapter):
239+
def load_adapter(self, adapter: LoraAdapter) -> None | str:
230240
"""Sends a request to load the specified model."""
231241
if adapter in self.registered_adapters:
232242
logging.info(
@@ -243,10 +253,12 @@ def load_adapter(self, adapter: LoraAdapter):
243253
response = requests.post(url, json=payload)
244254
response.raise_for_status()
245255
logging.info(f"loaded model {adapter.id}")
256+
return None
246257
except requests.exceptions.RequestException as e:
247258
logging.error(f"error loading model {adapter.id}: {e}")
259+
return f"error loading model {adapter.id}: {e}"
248260

249-
def unload_adapter(self, adapter: LoraAdapter):
261+
def unload_adapter(self, adapter: LoraAdapter) -> None | str:
250262
"""Sends a request to unload the specified model."""
251263
if adapter not in self.registered_adapters:
252264
logging.info(
@@ -284,28 +296,42 @@ def reconcile(self):
284296
adapters_to_load_id = ", ".join(str(a.id) for a in adapters_to_load)
285297
logging.info(f"adapter to load {adapters_to_load_id}")
286298
for adapter in adapters_to_load:
287-
self.load_adapter(adapter)
299+
err = self.load_adapter(adapter)
300+
if err is None:
301+
self.update_adapter_status_metrics(adapter.id, is_loaded=True)
288302
adapters_to_unload = self.ensure_not_exist_adapters - self.ensure_exist_adapters
289303
adapters_to_unload_id = ", ".join(str(a.id) for a in adapters_to_unload)
290304
logging.info(f"adapters to unload {adapters_to_unload_id}")
291305
for adapter in adapters_to_unload:
292-
self.unload_adapter(adapter)
306+
err = self.unload_adapter(adapter)
307+
if err is None:
308+
self.update_adapter_status_metrics(adapter.id, is_loaded=False)
309+
310+
def update_adapter_status_metrics(self, adapter_id: str, is_loaded: bool):
311+
"""Update adapter status metrics"""
312+
status = 1 if is_loaded else 0
313+
ADAPTER_STATUS_METRICS.labels(adapter_name=adapter_id).set(status)
314+
293315

294316

295317
async def main():
296318
args = parse_arguments()
297-
319+
298320
# Update CONFIG_MAP_FILE with argument value
299321
config_file = args.config
300-
322+
301323
reconciler_instance = LoraReconciler(
302324
config_file=config_file,
303325
health_check_timeout=args.health_check_timeout,
304326
health_check_interval=args.health_check_interval,
305327
reconcile_trigger_seconds=args.reconcile_trigger,
306328
config_validation=args.config_validation
307329
)
308-
330+
331+
# Start metrics server
332+
logging.info(f"Starting metrics server on port {args.metrics_port}")
333+
start_http_server(args.metrics_port)
334+
309335
logging.info(f"Running initial reconcile for config map {config_file}")
310336
reconciler_instance.reconcile()
311337

tools/dynamic-lora-sidecar/sidecar/test_sidecar.py

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
import yaml
1818
import os
1919
import datetime
20-
from sidecar import LoraReconciler, LoraAdapter, CONFIG_MAP_FILE, BASE_FIELD
20+
from sidecar import LoraReconciler, LoraAdapter, CONFIG_MAP_FILE, BASE_FIELD, ADAPTER_STATUS_METRICS
2121

2222
# Update TEST_CONFIG_DATA to include the new configuration parameters
2323
TEST_CONFIG_DATA = {
@@ -227,12 +227,55 @@ def test_health_check_settings(self):
227227
reconcile_trigger_seconds=45,
228228
config_validation=False
229229
)
230-
230+
231231
# Check that values are properly set
232232
self.assertEqual(reconciler.health_check_timeout, datetime.timedelta(seconds=240))
233233
self.assertEqual(reconciler.health_check_interval, datetime.timedelta(seconds=15))
234234
self.assertEqual(reconciler.reconcile_trigger_seconds, 45)
235235

236+
def test_update_adapter_status_metrics(self):
237+
"""Test that update_adapter_status_metrics method works correctly"""
238+
# Clear any existing metrics
239+
ADAPTER_STATUS_METRICS.clear()
240+
241+
# Create reconciler
242+
reconciler = LoraReconciler(
243+
config_file=CONFIG_MAP_FILE,
244+
health_check_timeout=180,
245+
health_check_interval=10,
246+
reconcile_trigger_seconds=30,
247+
config_validation=False
248+
)
249+
250+
# Test setting loaded status
251+
reconciler.update_adapter_status_metrics("test-adapter-1", is_loaded=True)
252+
reconciler.update_adapter_status_metrics("test-adapter-2", is_loaded=False)
253+
254+
# Get all metric samples
255+
metric_samples = list(ADAPTER_STATUS_METRICS.collect())[0].samples
256+
257+
# Check that metrics were set correctly
258+
adapter_metrics = {}
259+
for sample in metric_samples:
260+
adapter_name = sample.labels['adapter_name']
261+
adapter_metrics[adapter_name] = sample.value
262+
263+
self.assertEqual(adapter_metrics.get('test-adapter-1'), 1.0, "test-adapter-1 should be marked as loaded")
264+
self.assertEqual(adapter_metrics.get('test-adapter-2'), 0.0, "test-adapter-2 should be marked as not loaded")
265+
266+
def test_metrics_endpoint(self):
267+
"""Test that Prometheus metrics can be collected"""
268+
from prometheus_client import generate_latest
269+
270+
# Clear metrics and set a test value
271+
ADAPTER_STATUS_METRICS.clear()
272+
ADAPTER_STATUS_METRICS.labels(adapter_name='test-adapter').set(1)
273+
274+
# Test that generate_latest produces valid output
275+
metrics_bytes = generate_latest()
276+
metrics = metrics_bytes.decode('utf-8')
277+
self.assertIn('lora_syncer_adapter_status{adapter_name="test-adapter"} 1.0', metrics)
278+
236279

237280
if __name__ == "__main__":
238281
unittest.main()

0 commit comments

Comments
 (0)