Skip to content

Conversation

@ilopezluna
Copy link
Contributor

@ilopezluna ilopezluna commented Jun 12, 2025

This PR uses the llama.cpp metrics endpoint to collect and aggregate the metrics of all active runners.

No active runners:

curl http://localhost:13434/metrics
# No active runners

An active runner with completions mode:

curl http://localhost:13434/metrics                                
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 285.714

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.047

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 1000

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 47

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 0.007

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="completion",model="ai/llama3.2"} 2

An active runner with embeddings mode:

curl http://localhost:13434/metrics
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode counter
llamacpp:n_busy_slots_per_decode{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 1

# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{backend="llama.cpp",mode="embedding",model="ai/mxbai-embed-large"} 0

@ilopezluna ilopezluna changed the title [WIP] Adds metrics endpoint Adds metrics endpoint Jun 12, 2025
@ilopezluna ilopezluna requested a review from a team June 12, 2025 12:58
@ilopezluna ilopezluna marked this pull request as ready for review June 12, 2025 12:58
}

// NewPrometheusParser creates a new Prometheus metrics parser
func NewPrometheusParser() *PrometheusParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a 500 when visit the link but I assume its temporary. I'll take a look tomorrow, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to use https://pkg.go.dev/github.com/prometheus/common/expfmt so I can use families, err := parser.TextToMetricFamilies(strings.NewReader(string(body)))
And then I can get metrics per family and add our labels.
Let me know what do you think 🙏

}

// NewPrometheusParser creates a new Prometheus metrics parser
func NewPrometheusParser() *PrometheusParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like pkg.go.dev was offline for a bit, seems back up now. I'd also advocate for less code we need to manage and test.

- Remove custom prometheus_metrics.go
- Use expfmt.TextParser for parsing and expfmt.NewEncoder for output
# Conflicts:
#	go.mod
#	pkg/inference/scheduling/scheduler.go
Copy link
Contributor

@xenoscopic xenoscopic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few minor suggestions.

@ilopezluna ilopezluna merged commit 9933b7d into main Jun 16, 2025
4 checks passed
@ilopezluna ilopezluna deleted the add-metrics branch June 16, 2025 08:18
ericcurtin referenced this pull request in ericcurtin/model-runner Sep 21, 2025
doringeman added a commit to doringeman/model-runner that referenced this pull request Sep 23, 2025
doringeman added a commit to doringeman/model-runner that referenced this pull request Sep 24, 2025
doringeman pushed a commit to doringeman/model-runner that referenced this pull request Oct 2, 2025
docs: update link to avoid redirect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants