Adds metrics endpoint #78

ilopezluna · 2025-06-12T12:52:54Z

This PR uses the llama.cpp metrics endpoint to collect and aggregate the metrics of all active runners.

No active runners:

curl http://localhost:13434/metrics
# No active runners

An active runner with completions mode:

curl http://localhost:13434/metrics                                
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 285.714

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.047

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1000

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 47

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.007

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

An active runner with embeddings mode:

curl http://localhost:13434/metrics
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

doringeman · 2025-06-12T16:20:50Z

pkg/metrics/prometheus_parser.go

+}
+
+// NewPrometheusParser creates a new Prometheus metrics parser
+func NewPrometheusParser() *PrometheusParser {


Can't we use https://pkg.go.dev/github.com/prometheus/prometheus/promql/parser?

I get a 500 when visit the link but I assume its temporary. I'll take a look tomorrow, thanks!

Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

I've changed to use https://pkg.go.dev/github.com/prometheus/common/expfmt so I can use families, err := parser.TextToMetricFamilies(strings.NewReader(string(body)))
And then I can get metrics per family and add our labels.
Let me know what do you think 🙏

pkg/metrics/scheduler_proxy.go

xenoscopic · 2025-06-12T21:06:20Z

pkg/metrics/prometheus_parser.go

+}
+
+// NewPrometheusParser creates a new Prometheus metrics parser
+func NewPrometheusParser() *PrometheusParser {


Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

pkg/inference/scheduling/scheduler.go

- Remove custom prometheus_metrics.go - Use expfmt.TextParser for parsing and expfmt.NewEncoder for output

# Conflicts: # go.mod # pkg/inference/scheduling/scheduler.go

xenoscopic

LGTM, just a few minor suggestions.

pkg/inference/scheduling/scheduler.go

pkg/metrics/aggregated_handler.go

pkg/metrics/scheduler_proxy.go

Signed-off-by: Dorin Geman <[email protected]>

docs: update link to avoid redirect

Adds metrics endpoint

842b246

ilopezluna changed the title ~~[WIP] Adds metrics endpoint~~ Adds metrics endpoint Jun 12, 2025

ilopezluna requested a review from a team June 12, 2025 12:58

ilopezluna marked this pull request as ready for review June 12, 2025 12:58

doringeman reviewed Jun 12, 2025

View reviewed changes

xenoscopic requested changes Jun 12, 2025

View reviewed changes

ilopezluna added 6 commits June 13, 2025 10:56

Remove NewSchedulerMetricsHandler, not used

dac6894

replace custom parser with official Prometheus libraries

d8e2620

- Remove custom prometheus_metrics.go - Use expfmt.TextParser for parsing and expfmt.NewEncoder for output

acquire/release the loader's lock

17047a1

Merge branch 'main' into add-metrics

e92a9f0

# Conflicts: # go.mod # pkg/inference/scheduling/scheduler.go

I missed commiting this

c304305

remove unneeded dep

bb394fb

ilopezluna requested review from doringeman and xenoscopic June 13, 2025 10:35

xenoscopic approved these changes Jun 13, 2025

View reviewed changes

pkg/inference/scheduling/scheduler.go Outdated Show resolved Hide resolved

pkg/inference/scheduling/scheduler.go Outdated Show resolved Hide resolved

pkg/metrics/aggregated_handler.go Outdated Show resolved Hide resolved

pkg/metrics/scheduler_proxy.go Outdated Show resolved Hide resolved

clean up

9c4674e

ilopezluna merged commit 9933b7d into main Jun 16, 2025
4 checks passed

ilopezluna deleted the add-metrics branch June 16, 2025 08:18

ericcurtin referenced this pull request in ericcurtin/model-runner Sep 21, 2025

qwen3 coder support (in filter, keys method) (#78)

b044249

doringeman added a commit to doringeman/model-runner that referenced this pull request Sep 23, 2025

client: Add ResetStore (docker#78)

a452c7e

Signed-off-by: Dorin Geman <[email protected]>

doringeman added a commit to doringeman/model-runner that referenced this pull request Sep 24, 2025

client: Add ResetStore (docker#78)

041276d

Signed-off-by: Dorin Geman <[email protected]>

doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025

Merge pull request docker#78 from thaJeztah/update_links

a4a8978

docs: update link to avoid redirect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds metrics endpoint #78

Adds metrics endpoint #78

Uh oh!

ilopezluna commented Jun 12, 2025 •

edited

Loading

Uh oh!

doringeman Jun 12, 2025

Uh oh!

ilopezluna Jun 12, 2025

Uh oh!

xenoscopic Jun 12, 2025

Uh oh!

ilopezluna Jun 13, 2025

Uh oh!

Uh oh!

xenoscopic Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

xenoscopic left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Adds metrics endpoint #78

Adds metrics endpoint #78

Uh oh!

Conversation

ilopezluna commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

doringeman Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

ilopezluna Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

ilopezluna Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xenoscopic Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ilopezluna commented Jun 12, 2025 •

edited

Loading