Skip to content

feat: add Prometheus Pushgateway support for CLI apps#3176

Open
coolwednesday wants to merge 1 commit intogofr-dev:developmentfrom
coolwednesday:feature/metrics-pushgateway-cli
Open

feat: add Prometheus Pushgateway support for CLI apps#3176
coolwednesday wants to merge 1 commit intogofr-dev:developmentfrom
coolwednesday:feature/metrics-pushgateway-cli

Conversation

@coolwednesday
Copy link
Member

@coolwednesday coolwednesday commented Mar 17, 2026

Summary

CLI applications are short-lived — they exit before Prometheus can scrape /metrics. This PR adds push-based metrics export via Prometheus Pushgateway for GoFr CLI apps, along with automatic CLI command metrics.

Closes #2232

What's included

  • Pushgateway integration (pkg/gofr/metrics/exporters/pushgateway.go): Wraps push.Pusher with prometheus.DefaultGatherer to push all collected metrics on shutdown
  • MeterProvider lifecycle fix (exporters.Prometheus() now returns both Meter and MeterProvider): Ensures buffered metrics are flushed on Container.Close()
  • Auto CLI metrics in cmd.go following the existing cron metrics pattern:
    • app_cmd_duration_seconds (histogram)
    • app_cmd_success_total (counter)
    • app_cmd_errors_total (counter)
  • CLI shutdown path in run.go: Calls Shutdown() after cmd.Run() to flush metrics and close resources
  • Config-driven: Set METRICS_PUSH_GATEWAY_URL env var to enable (CLI only, not HTTP apps)
  • Docker observability stack for sample-cmd example: pushgateway + prometheus + grafana
  • CMD Metrics panels added to the existing Grafana dashboard in http-server example (duration p95, success rate, error rate)

Design decisions

  • Pushgateway is wired in NewCMD() only — HTTP apps continue using pull-based scraping
  • Container owns the pushgateway and flushes on Close(), keeping cmd struct clean
  • No new interfaces — uses concrete *exporters.PushGateway type directly
  • Uses prometheus.DefaultGatherer which reads from the same default registry the OTel Prometheus exporter writes to

Test plan

  • go build ./... compiles
  • go test ./pkg/gofr/... ./pkg/gofr/metrics/... ./pkg/gofr/container/... passes
  • golangci-lint run clean (no new issues)
  • cd examples/sample-cmd/docker && docker-compose up — verify pushgateway receives app_cmd_* metrics
  • Check Grafana at localhost:3000 for new CMD panels

CLI apps are short-lived and exit before Prometheus can scrape /metrics.
This adds push-based metrics export via Pushgateway, configured through
METRICS_PUSH_GATEWAY_URL env var, along with auto CLI metrics tracking
(duration, success/error counters) and observability infrastructure.

Closes gofr-dev#2232
Copy link
Member

@Umang01-hash Umang01-hash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Issue #2232 explicitly listed "Support cleanup (optional) so old metrics don't pile up" as a requirement. Every CronJob run permanently adds a job group to the Pushgateway. Please add A Delete(ctx context.Context) error method on PushGateway using pusher.DeleteContext(ctx) and METRICS_PUSH_GATEWAY_DELETE_ON_FINISH=true env var to opt in .

  2. All apps without APP_NAME set push under the same job group and silently overwrite each other. Change the fallback to filepath.Base(os.Args[0]) or add a dedicated METRICS_PUSH_GATEWAY_JOB env var override.

  3. Current max bucket is 60s. Cron buckets extend to 3600s. A 5-minute batch job falls into +Inf only. Align upper boundary with app_cron_duration_seconds.

  4. Metric naming inconsistency with cron :
    app_cmd_errors_totalapp_cmd_failures (match cron's _failures)
    app_cmd_success_totalapp_cmd_success (match cron's no-_total)
    Add app_cmd_total (match cron's app_cron_job_total)

  5. Move metricServer.Shutdown(ctx) before container.Close() in Shutdown() so the Prometheus scrape endpoint stops accepting requests before the OTel meter provider is shut down.

}

if c.pushGateway != nil {
err = errors.Join(err, c.pushGateway.Push(context.Background()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push call has no timeout; context.Background() used.

If the Pushgateway is unreachable via a firewall black-hole, the CLI hangs indefinitely at exit.

File file.FileSystem

meterProvider meterProviderShutdowner
pushGateway *exporters.PushGateway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushGateway stored as concrete type : GoFr's convention is "take interfaces, return concrete types" — the same PR correctly introduces meterProviderShutdowner as an interface for meterProvider. Please apply the same pattern:

  type metricsFlusher interface {
      Push(context.Context) error                                                                                                                                                  
  }    

pushGateway metricsFlusher

Without this, Container.Close()'s pushgateway path can only be unit-tested with a real HTTP endpoint (or httptest.Server), not with a simple mock.

Comment on lines +79 to +83
if url := app.Config.Get("METRICS_PUSH_GATEWAY_URL"); url != "" {
jobName := app.Config.GetOrDefault("APP_NAME", "gofr-app")
app.container.SetPushGateway(exporters.NewPushGateway(url, jobName, app.container.Logger))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has 0 test coverage. Lets add test for them.

shutdownCtx, cancel := context.WithTimeout(context.Background(), shutDownTimeout)
defer cancel()

if err := a.Shutdown(shutdownCtx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a.Shutdown() is called sequentially after a.cmd.Run(). If the handler panics, the stack unwinds and Shutdown() is never reached — metrics are not pushed and no container cleanup happens.

Maybe we can defere shutdown?

shutdownCtx, cancel := context.WithTimeout(context.Background(), shutDownTimeout)                                                                                              
  defer cancel()                                                                                                                                                                   
  defer func() {                                                                                                                                                                 
      if err := a.Shutdown(shutdownCtx); err != nil {                                                                                                                              
          a.Logger().Errorf("CLI shutdown error: %v", err)
      }                                                                                                                                                                            
  }()                                                                                                                                                                              
  a.cmd.Run(a.container)

// PushGateway pushes metrics from the default Prometheus registry to a Pushgateway.
type PushGateway struct {
pusher *push.Pusher
logger logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use logging.Logger diretcly here? What is the need of new logger interface?

@coolwednesday
Copy link
Member Author

Regarding Comment 1 (Delete support / METRICS_PUSH_GATEWAY_DELETE_ON_FINISH):

The Pushgateway documentation explicitly states that the Pushgateway is designed as a metric cache — the standard recommendation is to not delete pushed metrics, and instead use job and instance labels to distinguish runs.

If you push and immediately delete, Prometheus may not have scraped yet (typical scrape interval is 15–30s), and the metrics are lost forever. There's no reliable way for the CLI to know whether Prometheus has completed its scrape before issuing a delete.

For users who need cleanup of stale metrics, this is best handled at the Pushgateway operational level (e.g., Pushgateway's own --push.disable-consistency-check flag, TTL configurations, or external cron jobs that prune old job groups) — not from the framework level. Baking delete into the framework adds a footgun that's hard to use safely by default.

This can always be revisited in a follow-up if users explicitly request it, but for v1 the "push and leave" approach is the correct and safe default.

@coolwednesday
Copy link
Member Author

Regarding Comment 5 (Shutdown order — move metricServer.Shutdown before container.Close):

The current shutdown order is actually correct:

httpServer.Shutdown → grpcServer.Shutdown → container.Close() → metricServer.Shutdown

The /metrics HTTP endpoint should stay alive as long as possible so Prometheus can scrape final metrics. Shutting it down earlier would mean Prometheus misses the last scrape window.

For the Pushgateway path specifically, the push happens inside container.Close() before the meter provider shuts down — which is the right sequence (push metrics first, then tear down the provider).

@coolwednesday
Copy link
Member Author

Regarding Comment 8 (Factory.go test coverage):

The new pushgateway wiring in factory.go is 4 lines of config-read + constructor call. The core logic (NewPushGateway, Push) is already covered in pushgateway_test.go. Writing a proper test for the factory wiring requires heavy config mocking for minimal additional coverage. Deferring this to a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support emitting metrics from CLI applications

2 participants