You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement periodic plugin metrics reporting for the Launcher Plugin API
v3.7.0. Plugins can now report uptime and cluster interaction latency
histograms, which the Launcher exposes on its Prometheus /metrics endpoint.
- Add MetricsPlugin optional interface and PluginMetrics type
- Add thread-safe Histogram with swap-on-drain pattern for lock-free reads
- Add metricsLoop with non-blocking sends, panic recovery, and context
cancellation
- Add MetricsResponse (message type 203) to the wire protocol
- Add RunMetrics conformance scenario for plugin validation
- Add --plugin-metrics-interval-seconds flag to DefaultOptions
- Bump API version from 3.6.0 to 3.7.0
- Update docs (GUIDE, API, ARCHITECTURE, CHANGELOG) and examples
Unlike all other protocol messages, the plugin initiates the metrics response. The plugin sends it periodically on a timer (controlled by `--plugin-metrics-interval-seconds`) without any corresponding request from the Launcher. Both `requestId` and `responseId` are zero.
77
+
78
+
```json
79
+
{
80
+
"messageType": 203,
81
+
"requestId": 0,
82
+
"responseId": 0,
83
+
"uptimeSeconds": 3600,
84
+
"clusterInteractionLatencySample": {
85
+
"buckets": [0, 2, 3, 0, 0, 0, 0, 0, 0, 0],
86
+
"sum": 1.52
87
+
}
88
+
}
89
+
```
90
+
91
+
| Field | Type | Description |
92
+
|-------|------|-------------|
93
+
|`uptimeSeconds`| uint64 | Seconds since the plugin started. Always present. |
|`clusterInteractionLatencySample.sum`| float64 | Sum of all observed values. |
69
97
70
98
### Stream responses
71
99
@@ -239,6 +267,58 @@ Returns information about cluster capabilities (queues, resource limits, contain
239
267
-`w` - ResponseWriter to send cluster info
240
268
-`user` - Username requesting info
241
269
270
+
### Type: MetricsPlugin (optional interface)
271
+
272
+
```go
273
+
typeMetricsPlugininterface {
274
+
Plugin
275
+
Metrics(ctx context.Context) PluginMetrics
276
+
}
277
+
```
278
+
279
+
Plugins that want to report custom metrics to the Launcher implement this interface. The `Metrics` method is called periodically (controlled by `--plugin-metrics-interval-seconds`). All plugins automatically report `uptimeSeconds`; implement this interface only for additional plugin-specific metrics like cluster interaction latency.
280
+
281
+
Implementations should return quickly and avoid blocking I/O.
A thread-safe histogram that accumulates observations locally and can be drained into a portable snapshot for sending to the Launcher. Use `NewHistogram(ClusterInteractionLatencyBuckets)` to create one with the correct bucket boundaries.
308
+
309
+
- `Observe` records a single observation (e.g., a latency measurement in seconds). Safe for concurrent use.
310
+
- `Drain` collects all accumulated observations since the last drain, resets the histogram, and returns a portable snapshot. Returns nil if no observations have been recorded.
The histogram bucket upper bounds (in seconds) for cluster interaction latency. These must match the Launcher's bucket boundaries so histogram data can be replayed correctly.
321
+
242
322
### Type: ResponseWriter
243
323
244
324
```go
@@ -373,12 +453,18 @@ Closes the stream. Must be called when streaming is complete.
373
453
374
454
```go
375
455
typeRuntimestruct {
376
-
// contains filtered or unexported fields
456
+
MaxMessageSizeint
457
+
MetricsInterval time.Duration
377
458
}
378
459
```
379
460
380
461
The Runtime handles the request/response protocol and dispatches to plugin methods.
381
462
463
+
| Field | Type | Description |
464
+
|-------|------|-------------|
465
+
|`MaxMessageSize`|`int`| Upper limit on message size for requests and responses. |
466
+
|`MetricsInterval`|`time.Duration`| Interval between periodic metrics reports. Zero disables. Typically set from `DefaultOptions.MetricsInterval`. |
Copy file name to clipboardExpand all lines: docs/ARCHITECTURE.md
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -353,6 +353,28 @@ type Error struct {
353
353
- Better than string parsing
354
354
- Follows Launcher API specification
355
355
356
+
### Metrics collection
357
+
358
+
The SDK supports periodic metrics reporting to the Launcher (API v3.7.0+). Unlike all other protocol messages, the plugin initiates metrics — it sends `MetricsResponse` messages on a timer without any corresponding request.
359
+
360
+
```
361
+
Bootstrap completes → Start metrics goroutine → Every N seconds:
The plugin uses a local prometheus histogram as a cache, accumulating observations on the hot path (e.g., timing each Slurm command). On each metrics tick, the framework drains the histogram (collecting and resetting it) and sends the snapshot to the Launcher, which replays the data into its own Prometheus registry.
373
+
374
+
**Why push-based?** The Launcher-plugin IPC channel has no QoS. Requesting metrics on-demand could delay time-sensitive messages (job status updates, control operations). Push-based metrics use the existing response channel and are inherently non-blocking from the Launcher's perspective.
375
+
376
+
**Why swap-on-drain?** The Go prometheus client does not expose a `Reset()` method on individual histograms. The SDK works around this by atomically swapping the current histogram for a fresh one on each drain, then collecting from the old instance.
The Launcher collects periodic metrics from plugins and exposes them on its Prometheus `/metrics` endpoint. All plugins automatically report `uptimeSeconds`. Plugins that interact with external schedulers can report additional metrics by implementing the `MetricsPlugin` interface.
1051
+
1052
+
The Launcher passes `--plugin-metrics-interval-seconds <N>` at startup (default: 60, 0 to disable). The SDK handles the timer and IPC automatically.
1053
+
1054
+
#### Reporting cluster interaction latency
1055
+
1056
+
If your plugin runs CLI commands or makes API calls to a scheduler, you can measure their latency and report it as a histogram. A cluster interaction is any individual call to the external scheduler — a CLI command invocation, an HTTP/gRPC API request, or an SDK method call. Measure the wall-clock duration of the external call itself, from invocation to response.
1057
+
1058
+
**What to measure:**
1059
+
1060
+
- Time every external scheduler call: job submission, control operations (stop/kill/cancel), status queries, output retrieval, resource usage queries, etc.
1061
+
- For batch operations (e.g., a single `squeue` call that returns status for many jobs), record one observation for the entire call, not one per job.
1062
+
- Measure only the external call duration. Don't include internal cache lookups, response serialization, or in-process logic.
The `Histogram` type is thread-safe. Call `Observe` from any goroutine. The framework calls `Drain` on each metrics tick, which collects all accumulated observations and resets the histogram.
1112
+
1113
+
#### Wiring metrics into the runtime
1114
+
1115
+
Pass the metrics interval from `DefaultOptions` to the `Runtime`:
1116
+
1117
+
```go
1118
+
opts:= &launcher.DefaultOptions{}
1119
+
launcher.MustLoadOptions(opts, "myplugin")
1120
+
1121
+
rt:= launcher.NewRuntime(lgr, plugin)
1122
+
rt.MetricsInterval = opts.MetricsInterval
1123
+
```
1124
+
1125
+
#### How it works
1126
+
1127
+
The plugin accumulates metrics locally (using a prometheus histogram as a cache). On each metrics tick, the framework drains the accumulated data and sends a `MetricsResponse` (message type 203) to the Launcher over the IPC channel. The Launcher replays the histogram data into its own Prometheus registry, which API clients can then query.
1128
+
1129
+
This design avoids adding request/response overhead to the Launcher-plugin connection for metrics collection. The team rejected the on-demand alternative because there is no QoS on the IPC channel and metrics requests could delay more important messages.
1130
+
1048
1131
### User profiles
1049
1132
1050
1133
System administrators may want to set default or maximum values for certain features on a per-user or per-group basis. For example, different groups of users could have different memory limits or CPU counts.
0 commit comments