-
Notifications
You must be signed in to change notification settings - Fork 27
add support for exposing prometheus metrics for function runs #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for exposing prometheus metrics for function runs #229
Conversation
5376b82 to
f58d513
Compare
ddeedaf to
9325768
Compare
negz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The implementation looks good to me. I'd like to slim down the tests a bit though.
Once you're done, please squash this down to a single commit. Please also update the PR description to reflect the latest state of the code.
WalkthroughAdds Prometheus metrics support to the gRPC server: new Prometheus-related dependencies, new ServeOptions fields and public options for metrics configuration, wiring of a Prometheus unary interceptor and registry, a separate HTTP /metrics endpoint, and tests verifying default/custom registry and port behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client as gRPC Client
participant Server as gRPC Server
participant Interceptor as Prometheus Interceptor
participant Handler as Service Handler
participant Registry as Prometheus Registry
participant MetricsHTTP as HTTP /metrics
rect rgb(230,245,255)
note right of Server: Startup
Server->>Registry: register gRPC metrics
Server->>MetricsHTTP: start HTTP server on MetricsAddress
end
Client->>Server: Unary RPC
Server->>Interceptor: invoke chained unary interceptors
Interceptor->>Registry: record start/labels
Interceptor->>Handler: forward request
Handler-->>Interceptor: response
Interceptor->>Registry: record result/latency
Interceptor-->>Server: return response
rect rgb(240,255,240)
note right of MetricsHTTP: Scrape
participant Scraper as Prometheus
Scraper->>MetricsHTTP: GET /metrics
MetricsHTTP->>Registry: gather metrics
Registry-->>MetricsHTTP: exposition text
MetricsHTTP-->>Scraper: 200 OK + metrics
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
sdk_test.go (2)
280-359: Test flakes: binding to :8080 by default risks port conflicts; avoid sleeps; fix typo
- Binding metrics to 8080 can be in use on CI or dev machines; this test will be flaky. Either:
- Probe 8080 and skip if occupied, or
- Override the default via WithMetricsServer on a free ephemeral port while still asserting the default behavior elsewhere (doc/units).
- Replace fixed sleeps with polling the endpoint for readiness.
- Typo: “DefaultRegisrty” → “DefaultRegistry”.
Example minimal change to skip if 8080 is busy:
@@ - // Should use default metrics port 8080 - metricsPort := 8080 + metricsPort := 8080 + l, err := net.Listen("tcp", fmt.Sprintf(":%d", metricsPort)) + if err != nil { + t.Skipf("port %d in use; skipping default-port metrics test: %v", metricsPort, err) + } + l.Close() @@ - // Wait for server to start - time.Sleep(3 * time.Second) + // Wait for server/metrics readiness with polling + deadline := time.Now().Add(10 * time.Second) + for { + if time.Now().After(deadline) { + t.Fatalf("server not ready in time") + } + resp, err := http.Get(fmt.Sprintf("http://localhost:%d/metrics", metricsPort)) + if err == nil { + resp.Body.Close() + break + } + time.Sleep(100 * time.Millisecond) + } @@ - t.Run("MetricsServerTest On DefaultPort With DefaultRegisrty", func(t *testing.T) { + t.Run("MetricsServerTest On DefaultPort With DefaultRegistry", func(t *testing.T) {Also consider offering a shutdown hook from Serve so tests can cleanly stop servers instead of leaking goroutines.
199-278: Solid E2E for custom registry/port; consider readiness polling and cleanupLooks good for custom registry and ephemeral port. To reduce flakes, prefer polling the /metrics endpoint over fixed sleeps, and consider adding a way to stop the server after the test.
sdk.go (3)
51-52: Defaulting to :8080 enables metrics by default; ensure docs/PR text match and security posture is acceptableMetrics are enabled by default on 0.0.0.0:8080. This contradicts the PR summary saying metrics are opt-in. Please:
- Update docs/PR summary to reflect “enabled by default; WithMetricsServer("") disables.”
- Confirm exposing an unauthenticated HTTP endpoint in all Function processes is acceptable for your environments.
179-186: Avoid brittle type assertion on DefaultRegistererprometheus.DefaultRegisterer is an interface and may not always be a *prometheus.Registry (it can be replaced). The forced type assertion risks a panic. Prefer a safer default:
- Use prometheus.NewRegistry() explicitly, or
- Store interfaces (Registerer/Gatherer) instead of a concrete *Registry, or
- Fall back to a new registry if the assertion fails.
Example safe fallback:
- //nolint:forcetypeassert // prometheus.DefaultRegisterer is always *prometheus.Registry - so := &ServeOptions{ + so := &ServeOptions{ @@ - MetricsRegistry: prometheus.DefaultRegisterer.(*prometheus.Registry), // Use default registry + MetricsRegistry: func() *prometheus.Registry { + if r, ok := prometheus.DefaultRegisterer.(*prometheus.Registry); ok { + return r + } + return prometheus.NewRegistry() + }(),
234-253: Don’t silently swallow metrics server errorsListenAndServe errors (e.g., port in use) are ignored, leaving users unaware metrics aren’t exposed. Log non-closed errors or surface them via a channel.
Minimal improvement:
- go func() { - _ = metricsServer.ListenAndServe() // Ignore errors - }() + go func() { + if err := metricsServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { + // TODO: plumb a logger; for now consider stderr or a hook + // log.Printf("metrics server error: %v", err) + } + }()Also consider graceful shutdown semantics for the metrics server when the gRPC server stops.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (3)
go.mod(4 hunks)sdk.go(6 hunks)sdk_test.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
sdk.go (2)
proto/v1/run_function_grpc.pb.go (1)
FunctionRunnerServiceServer(80-84)proto/v1beta1/run_function_grpc.pb.go (1)
FunctionRunnerServiceServer(82-86)
sdk_test.go (2)
sdk.go (5)
Serve(178-256)Listen(73-79)Insecure(126-133)WithMetricsServer(160-165)WithMetricsRegistry(169-174)proto/v1/run_function_grpc.pb.go (1)
NewFunctionRunnerServiceClient(61-63)
🪛 ast-grep (0.39.6)
sdk_test.go
[warning] 364-364: "Detected a network listener listening on 0.0.0.0 or an empty string.
This could unexpectedly expose the server publicly as it binds to all
available interfaces. Instead, specify another IP address that is not
0.0.0.0 nor the empty string."
Context: net.Listen("tcp", ":0")
Note: [CWE-200] Exposure of Sensitive Information to an Unauthorized Actor [REFERENCES]
- https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-bind-to-all-interfaces-go)
🔇 Additional comments (4)
go.mod (3)
12-15: Deps align with new metrics feature; verify version choices and exposure intentThe added modules (grpc-ecosystem/providers/prometheus v1.1.0 and client_golang v1.19.1) are appropriate for gRPC metrics + promhttp. Please confirm:
- Version constraints are intentional (no known CVEs or breaking changes for your supported Go/toolchain).
- Public exposure of an HTTP metrics endpoint is acceptable for all consumers, given metrics are now enabled by default in sdk.go.
40-45: Confirm transitive metrics deps are expectedbeorn7/perks and cespare/xxhash/v2 are standard transitive deps from Prometheus. Nothing to change; just confirm they’re acceptable under your license/compliance policy.
117-119: Pinning prometheus common/procfs/client_model is fine; keep them in syncThese versions look consistent with client_golang v1.19.1. If you upgrade client_golang later, update these as needed to avoid mismatches.
sdk_test.go (1)
361-371: False-positive analyzer warning; helper is fineUsing net.Listen(":0") in tests to find a free port is standard and safe here. No change needed.
5e958fa to
d59be6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
sdk_test.go (1)
269-276: Tests verify library behavior rather than SDK functionality.These tests check for Prometheus format markers (
"# HELP") and specific gRPC middleware metrics ("grpc_server_started_total"), which validates that the Prometheus and gRPC libraries work correctly, not the SDK's integration logic. Per the maintainer's earlier feedback, tests should focus on SDK code and trust that upstream libraries function correctly.Consider simplifying to verify only:
- That the metrics endpoint responds with HTTP 200
- That metrics collection is enabled/disabled based on configuration
- Optionally, that custom registries are used when provided
Avoid asserting specific metric names or Prometheus output format.
Also applies to: 350-357
sdk.go (2)
209-225: Critical bug: Custom interceptors are ignored when metrics are disabled.The interceptor chain is only built inside the
if so.MetricsAddress != ""block (line 214). This means custom interceptors provided viaso.UnaryInterceptorsare only appended when metrics are enabled (line 220). If a user disables metrics by passingWithMetricsServer(""), their custom interceptors won't be applied to the gRPC server.Apply this fix to ensure custom interceptors are always applied:
// Build interceptors based on options var interceptors []grpc.UnaryServerInterceptor var metrics *grpcprometheus.ServerMetrics - // Add metrics interceptor if metrics address is provided if so.MetricsAddress != "" { - // Use Prometheus metrics metrics = grpcprometheus.NewServerMetrics() - - // Apply metrics interceptor and custom interceptors interceptors = append(interceptors, metrics.UnaryServerInterceptor()) - interceptors = append(interceptors, so.UnaryInterceptors...) - serverOpts = append(serverOpts, grpc.ChainUnaryInterceptor(interceptors...)) - // Register the metrics with the registry so.MetricsRegistry.MustRegister(metrics) } + // Always apply custom interceptors + interceptors = append(interceptors, so.UnaryInterceptors...) + if len(interceptors) > 0 { + serverOpts = append(serverOpts, grpc.ChainUnaryInterceptor(interceptors...)) + } + srv := grpc.NewServer(serverOpts...)
167-174: Misleading documentation: WithMetricsRegistry doesn't auto-enable metrics.The comment on line 168 states "metrics collection is automatically enabled" when providing a custom registry. However, metrics are only enabled when
MetricsAddress != ""(checked on line 214). A user callingWithMetricsRegistry(customRegistry)without setting a metrics address will not collect metrics.Update the comment to clarify:
// WithMetricsRegistry configures a custom Prometheus registry for metrics. -// If a custom registry is provided, metrics collection is automatically enabled. +// The provided registry will be used when metrics are enabled via a non-empty MetricsAddress. +// Metrics collection is enabled only when MetricsAddress is non-empty.
🧹 Nitpick comments (4)
sdk_test.go (2)
214-222: Add error handling for server startup failures.The tests start servers in goroutines but don't verify they started successfully before proceeding. If
Servefails immediately (e.g., credentials issue, port conflict), the test would sleep, then fail when attempting to connect, without indicating the root cause.Use a channel to signal server readiness or errors:
serverReady := make(chan struct{}) serverDone := make(chan error, 1) go func() { // Signal readiness once listener is established close(serverReady) err := Serve(mockServer, ...) serverDone <- err }() select { case <-serverReady: // Proceed with test case err := <-serverDone: t.Fatalf("Server failed to start: %v", err) case <-time.After(5 * time.Second): t.Fatal("Server did not become ready") }Also applies to: 295-301
200-278: Add cleanup to stop servers after tests complete.Both test functions start servers in goroutines but don't stop them when the test finishes. This can leave goroutines and open ports lingering, potentially causing resource leaks or interference with subsequent tests.
Use
t.Cleanup()or context cancellation to ensure servers are gracefully shut down:ctx, cancel := context.WithCancel(context.Background()) t.Cleanup(cancel) go func() { // Pass ctx to Serve if supported, or implement shutdown logic _ = Serve(mockServer, ...) }()Alternatively, track the gRPC server instance and call
GracefulStop()int.Cleanup().Also applies to: 281-359
sdk.go (2)
237-239: Redundant nil check.The nil check
if metrics != nilon line 237 is redundant. At this point, we're inside theif so.MetricsAddress != ""block (line 235), wheremetricswas unconditionally assigned on line 216. The variable cannot be nil here.Remove the redundant check:
// Initialize metrics for the gRPC server - if metrics != nil { - metrics.InitializeMetrics(srv) - } + metrics.InitializeMetrics(srv)
157-165: Consider documenting the behavior when address is empty.The
WithMetricsServerfunction allows passing an empty string to disable metrics (per maintainer comments and the default-enabled behavior). However, this isn't documented in the function comment.Clarify the behavior:
// WithMetricsServer configures the metrics server address and starts an HTTP server // to expose Prometheus metrics on /metrics endpoint. If address is non-empty, // metrics collection is automatically enabled. +// Pass an empty string to disable metrics collection and the metrics server.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (3)
go.mod(4 hunks)sdk.go(6 hunks)sdk_test.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
sdk_test.go (3)
sdk.go (5)
Serve(178-256)Listen(73-79)Insecure(126-133)WithMetricsServer(160-165)WithMetricsRegistry(169-174)proto/v1beta1/run_function_grpc.pb.go (1)
NewFunctionRunnerServiceClient(63-65)proto/v1/run_function_grpc.pb.go (1)
NewFunctionRunnerServiceClient(61-63)
🪛 ast-grep (0.39.6)
sdk_test.go
[warning] 364-364: "Detected a network listener listening on 0.0.0.0 or an empty string.
This could unexpectedly expose the server publicly as it binds to all
available interfaces. Instead, specify another IP address that is not
0.0.0.0 nor the empty string."
Context: net.Listen("tcp", ":0")
Note: [CWE-200] Exposure of Sensitive Information to an Unauthorized Actor [REFERENCES]
- https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-bind-to-all-interfaces-go)
🔇 Additional comments (1)
go.mod (1)
12-12: No update required for providers/prometheus version.
providers/prometheus v1.1.0 is the latest released tag and is compatible with the main v2 middleware; no changes needed.
Signed-off-by: Siddharth <[email protected]>
85b20ad to
32739b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (9)
sdk_test.go (5)
225-225: Replace hard-coded sleep with proper synchronization.Hard-coded
time.Sleep(3 * time.Second)is brittle and can fail in slow CI environments or succeed despite failures in fast environments.This issue was already raised in a previous review. Consider implementing a retry/polling mechanism that waits for the server to become ready.
247-247: Replace hard-coded sleep with proper synchronization.Hard-coded
time.Sleep(2 * time.Second)is brittle and can fail in slow CI environments.This issue was already raised in a previous review. Consider implementing a retry/polling mechanism.
292-292: Use ephemeral port instead of hardcoded 8080.Line 292 hardcodes
metricsPort := 8080, which can cause port conflicts in CI or when running tests in parallel.This issue was already raised in a previous review. Use
getAvailablePort(t)consistently.
304-304: Replace hard-coded sleep with proper synchronization.Hard-coded
time.Sleep(3 * time.Second)is brittle.This issue was already raised in a previous review.
328-328: Replace hard-coded sleep with proper synchronization.Hard-coded
time.Sleep(2 * time.Second)is brittle.This issue was already raised in a previous review.
sdk.go (4)
179-185: Unsafe type assertion could panic.This issue was raised in a previous review. The forced type assertion on line 185 could panic if
prometheus.DefaultRegistereris not a*prometheus.Registry. While this is unlikely in practice, it violates Go best practices for type assertions.
209-224: Custom interceptors are ignored when metrics are disabled.This issue was raised in a previous review. When
MetricsAddressis empty,so.UnaryInterceptorsare completely ignored. While the author indicated this is intentional since interceptors are new in this PR, the coupling should be documented (see comment on lines 62-65).
250-251: Metrics server startup errors are silently discarded.This issue was raised in a previous review. While the author indicated they don't want to panic, completely ignoring errors means users won't know if the metrics server failed to start (e.g., port already in use).
Per previous review suggestions, consider at least logging the error rather than fully discarding it.
62-65: Document the coupling between UnaryInterceptors and metrics.The
UnaryInterceptorsfield is only applied whenMetricsAddressis non-empty (see lines 213-224). This coupling is not obvious from the struct definition and could confuse users who expect interceptors to work independently.Add a comment to clarify:
// Metrics options MetricsAddress string MetricsRegistry *prometheus.Registry + // UnaryInterceptors are custom gRPC interceptors. Note: These are only + // applied when metrics are enabled (i.e., when MetricsAddress is non-empty). UnaryInterceptors []grpc.UnaryServerInterceptor
🧹 Nitpick comments (4)
sdk_test.go (2)
199-278: Consider simplifying test to focus on SDK behavior.While it's valuable to verify that metrics are exposed when configured, the test assertions on lines 269-276 are primarily testing the Prometheus and gRPC middleware libraries rather than the SDK's integration logic. Per reviewer feedback, tests should focus on SDK code.
Consider simplifying to:
- Verify the metrics HTTP endpoint is reachable
- Verify at least one expected metric type is present
- Remove detailed assertions about Prometheus format and specific metric names
This reduces maintenance burden while still validating the core SDK integration.
280-359: Consider simplifying test to focus on SDK behavior.Similar to the custom registry test, this test could be simplified to focus on SDK behavior rather than testing Prometheus library functionality.
Given that negz commented on test verbosity, consider consolidating or simplifying this test to verify only that:
- The metrics server starts with default settings
- The endpoint is accessible
sdk.go (1)
237-239: Remove redundant nil check.The nil check for
metricson line 237 is redundant sincemetricscan only be non-nil when assigned on line 216, which is inside the sameso.MetricsAddress != ""block.Apply this diff:
// Start metrics server if address is provided if so.MetricsAddress != "" { // Initialize metrics for the gRPC server - if metrics != nil { - metrics.InitializeMetrics(srv) - } + metrics.InitializeMetrics(srv) // Use the registry for metrics handler handler := promhttp.HandlerFor(so.MetricsRegistry, promhttp.HandlerOpts{})go.mod (1)
12-14: Bump Prometheus and gRPC middleware to latest versions.
– client_golang: v1.19.1 → v1.23.2
– go-grpc-middleware: v1.1.0 → v2.3.2
No known security issues affect the current client_golang version (patched in v1.11.1).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (3)
go.mod(4 hunks)sdk.go(6 hunks)sdk_test.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
sdk_test.go (2)
sdk.go (5)
Serve(178-256)Listen(73-79)Insecure(126-133)WithMetricsServer(160-165)WithMetricsRegistry(169-174)proto/v1/run_function_grpc.pb.go (1)
NewFunctionRunnerServiceClient(61-63)
🪛 ast-grep (0.39.6)
sdk_test.go
[warning] 364-364: "Detected a network listener listening on 0.0.0.0 or an empty string.
This could unexpectedly expose the server publicly as it binds to all
available interfaces. Instead, specify another IP address that is not
0.0.0.0 nor the empty string."
Context: net.Listen("tcp", ":0")
Note: [CWE-200] Exposure of Sensitive Information to an Unauthorized Actor [REFERENCES]
- https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-bind-to-all-interfaces-go)
🔇 Additional comments (4)
sdk_test.go (1)
361-371: LGTM! Good use of ephemeral ports for testing.The
getAvailablePorthelper properly uses ephemeral ports to avoid conflicts in parallel test execution.Note: The static analysis warning about binding to 0.0.0.0 is a false positive in this context since this is test-only code using ephemeral ports.
sdk.go (3)
51-51: Verify default port 8080 is appropriate for production use.Port 8080 is commonly used and may conflict with other services. Consider whether this default is appropriate or if it should be documented more prominently.
Since metrics are now enabled by default (per the conversation in past reviews), ensure that:
- Documentation clearly states port 8080 is used by default
- Users are aware they can override with
WithMetricsServer- The chosen port doesn't conflict with common Crossplane deployment patterns
157-164: LGTM! Clear documentation for metrics server option.The documentation correctly states that metrics collection is enabled when the address is non-empty.
167-173: LGTM! Documentation correctly clarified.The documentation now correctly notes that metrics are only enabled when
MetricsAddressis non-empty, addressing previous feedback.
negz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for persisting with this @sagarwal16-atlassian!
Description of your changes
This PR adds comprehensive Prometheus metrics collection and exposure capabilities to the Crossplane Function SDK, enabling observability into function performance and behaviour.
Changes
Metrics Collection: Added gRPC interceptors using grpc-ecosystem/go-grpc-middleware for automatic request/response metrics
Metrics Server: Added HTTP server to expose metrics on configurable endpoint (default: :8080/metrics)
Configuration Options:
WithMetricsServer(address) - Configure metrics server address
WithMetricsRegistry(registry) - Use custom Prometheus registry
Default Metrics: Automatic collection of request duration, count, size, and error rates
Backward Compatibility: All metrics features are opt-in via serve options
Fixes #20
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
TestMetricsServer_WithDefaultRegistryAndDefaultPortHow It Works:
WithMetricsServer("")(metrics should be enabled by default):8080and default prometheus metrics registryhttp://localhost:8080/metricsreturns response with prometheus metricsTestMetricsServer_WithCustomRegistryAndCustomPortHow It Works:
WithMetricsServer(<custom-port>)http://localhost:<custom-port>/metricsreturns response with prometheus metricsSummary by CodeRabbit