Skip to content

Feature Request: Subscription Health Monitoring Endpoint #128

@sonathaj

Description

@sonathaj

Problem Statement

We need to monitor subscription lag in Grafana dashboards but cannot calculate lag from existing endpoints because:

  1. /api/resources/v1/subscriptions returns status.stream.ackedOffset but NOT partition length
  2. Getting partition metadata requires calling separate EventStore/partition APIs
  3. Grafana cannot join data from multiple API calls or perform complex calculations
  4. This forces us to use external ETL processes, adding complexity and delay

We would appreciate having a single endpoint that returns subscription health with pre-calculated lag.

What We Need

A monitoring-optimized endpoint that returns subscription health metrics in a format Grafana can consume directly.

Proposed Endpoint

HTTP Request

GET /api/resources/v1/subscriptions/health

Query Parameters

Parameter Type Required Description
namespace string No Filter subscriptions by namespace
labelSelector string No Comma-separated label selectors (e.g., app=content,env=prod)

Response Model

The endpoint MUST return a JSON array at the root level (not wrapped in an object) for Grafana compatibility.

[
  {
    "name": "string",              // Subscription name (required)
    "namespace": "string",         // Subscription namespace (required)
    "phase": "string",             // active | inactive | failed (required)
    "partitionId": "string",       // Partition identifier (nullable)
    "ackedOffset": 51523,          // Last acknowledged offset (nullable, number)
    "partitionLength": 51650,      // Current partition length (nullable, number)
    "lag": 127,                    // Calculated lag: partitionLength - ackedOffset (nullable, number)
    "subscriberState": "string",   // reachable | unreachable (nullable)
    "subscriberReason": "string"   // Error message if unreachable (nullable)
  }
]

Sample Response

Example 1: Multiple Healthy Subscriptions

[
  {
    "name": "content-subscription",
    "namespace": "mozart",
    "phase": "active",
    "partitionId": "https://test1.com",
    "ackedOffset": 51523,
    "partitionLength": 51650,
    "lag": 127,
    "subscriberState": "reachable",
    "subscriberReason": null
  },
  {
    "name": "desktop-events-subscription",
    "namespace": "mozart",
    "phase": "active",
    "partitionId": "https://test2.com",
    "ackedOffset": 98234,
    "partitionLength": 98650,
    "lag": 416,
    "subscriberState": "reachable",
    "subscriberReason": null
  }
]

Example 2: Subscription with Problems

[
  {
    "name": "failing-subscription",
    "namespace": "mozart",
    "phase": "active",
    "partitionId": "https://some-adapter.com",
    "ackedOffset": 45000,
    "partitionLength": 46523,
    "lag": 1523,
    "subscriberState": "unreachable",
    "subscriberReason": "HTTP 503 Service Unavailable: Connection refused"
  },
  {
    "name": "inactive-subscription",
    "namespace": "mozart",
    "phase": "inactive",
    "partitionId": "https://disabled-adapter.com",
    "ackedOffset": null,
    "partitionLength": null,
    "lag": null,
    "subscriberState": null,
    "subscriberReason": null
  }
]

Example 3: Filtered by Namespace

GET /api/resources/v1/subscriptions/health?namespace=mozart

Returns only subscriptions in "mozart" namespace.

Example 4: Filtered by Label

GET /api/resources/v1/subscriptions/health?labelSelector=app=content,critical=true

Returns only subscriptions matching both labels.

Example 5: No Matching Subscriptions

[]

Empty array when no subscriptions match the filters.

Implementation Requirements

Critical Requirements for Grafana Integration

  1. Root-Level Array: Response MUST be a JSON array at root level, NOT wrapped in an object:

    // ✅ CORRECT
    [{"name": "sub1", ...}, {"name": "sub2", ...}]
    
    // ❌ WRONG - Grafana won't parse this
    {"items": [...], "count": 2}
  2. Data Types:

    • lag, ackedOffset, partitionLength MUST be numeric types (not strings)
    • Null values are acceptable for optional fields
  3. Consistent Fields:

    • Phase values: active, inactive, failed
    • SubscriberState values: reachable, unreachable, null
    • All fields present in every response (use null for missing data)
  4. Error Handling:

    • If partition metadata unavailable: include subscription with partitionLength=null, lag=null
    • Never filter out subscriptions due to missing data

Grafana Use Cases

What we need to build:

  • Table showing subscription lag with color thresholds (red if lag > 1000)
  • Time series graph of lag over time (one line per subscription)
  • Alerts: trigger when lag > threshold for 5 minutes
  • Status panels: count of active/inactive/failed subscriptions

Critical for Grafana:

  • Root-level JSON array (not wrapped in object)
  • Numeric types for lag, ackedOffset, partitionLength (not strings)
  • Consistent field names across all responses
  • Empty array [] when no results (not null or missing)

Why Not Use Existing /api/resources/v1/subscriptions Endpoint?

The existing endpoint doesn't have partition length data:

// Current /subscriptions response
{
  "items": [{
    "status": {
      "stream": {
        "ackedOffset": 51523  // ✅ We have this
      }
    }
  }]
}
// ❌ No partition length to calculate lag!

To calculate lag we would need:

  1. Call /api/resources/v1/subscriptions to get ackedOffset
  2. Call another endpoint (partition metadata API) to get partitionLength
  3. Use Grafana transformations to join the data and calculate lag = partitionLength - ackedOffset

This is impractical because:

  • Grafana's transformation capabilities are limited
  • Cannot reliably join data from multiple datasources
  • Performance issues from multiple API calls every 15-30 seconds
  • Complex transformations make dashboards fragile and unmaintainable

The health endpoint solves this by:

  • Fetching both ackedOffset (from subscription) and partitionLength (from event store) server-side
  • Pre-calculating lag
  • Returning flat, monitoring-optimized structure
  • Single API call with <500ms response time

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions