New monitoring metrics #198

gimballock · 2026-01-21T19:08:27Z

gimballock
Jan 21, 2026

Discussion: Expanding SV2 Metrics with the Four Golden Signals

Overview

I'd like to open a discussion about expanding our monitoring capabilities using the Four Golden Signals framework from the Google SRE book: Traffic, Saturation, Latency, and Errors.

To be clear: I'm not proposing we implement all of these metrics. Rather, I want to map out the full space of potentially valuable metrics we could add. The key insight is that by measuring different dimensions of the system, you don't just add information—you multiply diagnostic clarity, like going from 2D to 3D. Seeing both an error signal and a growing buffer depth provides far more confidence about what's happening than either one individually.

Imagine the diagnostic power of simultaneously observing:

Pool hashrate drops by 40% over 2 minutes
mining.submit latency spikes from 50ms to 800ms at p99
Error logs show "channel_saturated" appearing 500x/min from translator
Jobs buffer utilization jumps from 20% to 98%
Network metrics reveal 3 mining devices reconnecting repeatedly
This correlated view immediately reveals: a network issue caused device reconnections,
which saturated the translator's channel capacity, backing up the jobs buffer and
degrading submit latency—ultimately causing miners to fail over to backup pools.

This multi-dimensional view enables sophisticated alerting based on combinations of signals—correlating error rates with response times, traffic spikes, and resource consumption.

Current State: Mostly Traffic Metrics

Most existing metrics fall into the Traffic category, measuring client interactions:

Category	Metrics
System	`sv2_uptime_seconds`
Server	`sv2_server_channels_total`, `sv2_server_hashrate_total`, per-channel hashrate/shares
Clients	`sv2_clients_total`, `sv2_client_hashrate_total`, per-channel metrics
SV1 (Translator)	`sv1_clients_total`, `sv1_hashrate_total`

The Full Landscape: Potential Metrics by Signal Type

1. Additional Traffic Metrics (Counters)

pool_blocks_found_total
jdc_jobs_declared_total
tproxy_sv1_messages_total
sv2_shares_rejected_total / sv1_shares_rejected_total

Why Counters? These are monotonically increasing values. Using Gauges for counter-like data causes issues—if a connection resets and the count restarts at 1, a Gauge would interpret this as a negative change. Counters automatically handle session resets correctly, which is critical for reliable alerting on traffic patterns.

Note: The existing *_shares_accepted_total metrics may be vulnerable to this issue.

2. Saturation Metrics (Gauges)

Resource utilization for fixed-capacity systems:

sv2_memory_used_bytes
pool_template_store_size / jdc_template_store_size
pool_share_queue_depth
jdc_pending_job_declarations
tproxy_job_cache_size
tproxy_vardiff_states_count

These help detect gradual resource accumulation (e.g., memory leaks) and enable threshold-based alerting.

3. Latency Metrics (Histograms)

sv2_share_validation_latency_seconds
sv2_job_distribution_latency_seconds
sv2_message_processing_latency_seconds
pool_block_submission_latency_seconds
pool_tp_response_latency_seconds
jdc_upstream_rtt_seconds / jdc_job_declaration_latency_seconds
tproxy_upstream_rtt_seconds / tproxy_share_translation_latency_seconds

Why Histograms? Latency only exists as discrete events—snapshotting when no event occurs yields meaningless data. Histograms enable percentile-based alerting:

"Are 95% of shares validated in ≤50ms?"
"Are 95% of jobs distributed in ≤25ms?"

This quantifies responsiveness degradation during CPU exhaustion or traffic spikes.

4. Error Metrics (Counters)

sv2_errors_total (by type)
sv2_connection_errors_total / sv2_protocol_errors_total
pool_tp_errors_total
jdc_job_declaration_failures_total / jdc_upstream_errors_total
tproxy_translation_errors_total / tproxy_sv1_protocol_errors_total

While we have log streams, dedicated error counters surface rare issues in real-time alongside other signals. Protocol errors can indicate revenue-impacting issues like stale shares or difficulty misconfigurations—correlating these with specific miners or resource constraints accelerates resolution.

Implementation Considerations

Metric Type	Collection Method	Best For
Gauges	Snapshot scanning (no code-site instrumentation)	Values that can rise or fall
Counters	Event-based (requires instrumentation in handlers)	Monotonic values, rate calculations
Histograms	Event-based (requires instrumentation)	Latency distributions, percentile queries

Event-based metrics require passing the metrics context through the component hierarchy to message handlers and job processors—more invasive but necessary for accurate rate/percentile functions and complete data records.

Questions for Discussion

Which dimensions matter most? Which metrics would provide the most immediate value for your deployments?
Are there additional saturation or latency points we should consider?
What alerting thresholds would be most useful?
Any concerns about the instrumentation overhead for event-based metrics?

Again, this is about identifying which subset of these metrics would give us the best multi-dimensional view of system health. The goal is strategic instrumentation, not comprehensive coverage.

Looking forward to your thoughts!

gimballock · 2026-01-22T18:43:49Z

gimballock
Jan 22, 2026
Author

Adjusted plan for memory-stable monitoring

Final specification for production-ready metrics with bounded cardinality and lifecycle cleanup.

1. Metric Definitions

1.1 Common Metrics (All Components)

Gauges (Snapshot Metrics)

Metric	Type	Labels	Description	Cleanup
`sv2_uptime_seconds`	Gauge	none	Server uptime in seconds	N/A (single value)
`sv2_channels_max`	Gauge	none	Maximum channel capacity	N/A (config value)
`sv2_server_channels_extended`	Gauge	none	Extended upstream channels	N/A (aggregate)
`sv2_server_channels_standard`	Gauge	none	Standard upstream channels	N/A (aggregate)
`sv2_client_channels_extended`	Gauge	none	Extended downstream channels	N/A (aggregate)
`sv2_client_channels_standard`	Gauge	none	Standard downstream channels	N/A (aggregate)
`sv2_server_hashrate`	Gauge	none	Total hashrate to upstream	N/A (aggregate)
`sv2_client_hashrate`	Gauge	none	Total downstream hashrate	N/A (aggregate)
`sv2_memory_used_bytes`	Gauge	none	Current memory usage	N/A (system metric)
`sv2_queue_depth`	Gauge	none	Pending items in queues	N/A (aggregate)

Derived metrics (via PromQL):

sv2_server_channels_total = sv2_server_channels_extended + sv2_server_channels_standard
sv2_client_channels_total = sv2_client_channels_extended + sv2_client_channels_standard
sv2_channels_active = sv2_server_channels_extended + sv2_server_channels_standard + sv2_client_channels_extended + sv2_client_channels_standard
sv2_hashrate_total = sv2_server_hashrate + sv2_client_hashrate

Cleanup: None needed - gauges are reset and repopulated on every /metrics scrape.

Counters (Event-Driven)

Metric	Type	Labels	Cardinality	Description	Cleanup
`sv2_shares_accepted_total`	Counter	none	1	Total shares accepted (aggregate)	None (no labels)
`sv2_shares_rejected_total`	Counter	`reason` (enum)	~5	Total shares rejected by reason	None (bounded enum)
`sv2_errors_total`	Counter	`component`, `error_type`, `severity`	~180	All errors consolidated	None (bounded enums)

Cleanup: None needed - labels are bounded enums.

Histograms (Latency Metrics)

Metric	Type	Labels	Buckets	Description	Cleanup
`sv2_share_validation_latency_seconds`	Histogram	`result` (valid/invalid)	7 buckets	Share validation latency	None (bounded)
`sv2_job_distribution_latency_seconds`	Histogram	none	7 buckets	Job distribution latency	None (no labels)

Bucket configuration:

// Reduced from 14 to 7 buckets (50% memory savings)
const LATENCY_BUCKETS: &[f64] = &[
    0.001,  // 1ms
    0.01,   // 10ms
    0.05,   // 50ms
    0.1,    // 100ms
    0.5,    // 500ms
    1.0,    // 1s
    5.0,    // 5s
];

Cleanup: None needed - labels are bounded.

1.2 Pool-Specific Metrics

Gauges

Metric	Type	Labels	Description
`pool_template_store_size`	Gauge	none	Number of cached templates
`pool_share_queue_depth`	Gauge	none	Pending share validations
`pool_clients_total`	Gauge	none	Connected downstream clients

Counters

Metric	Type	Labels	Description
`pool_blocks_found_total`	Counter	none	Blocks found (monotonic)
`pool_templates_received_total`	Counter	none	Templates received from TP

Histograms

Metric	Type	Labels	Buckets	Description
`pool_block_submission_latency_seconds`	Histogram	none	7 buckets	Block submission latency
`pool_tp_response_latency_seconds`	Histogram	none	7 buckets	Template Provider response time

Bucket configuration for block operations:

const BLOCK_LATENCY_BUCKETS: &[f64] = &[
    0.01,   // 10ms
    0.05,   // 50ms
    0.1,    // 100ms
    0.5,    // 500ms
    1.0,    // 1s
    5.0,    // 5s
    10.0,   // 10s
];

1.3 JDC-Specific Metrics

Gauges

Metric	Type	Labels	Description
`jdc_template_store_size`	Gauge	none	Number of cached templates
`jdc_pending_job_declarations`	Gauge	none	Pending JD requests
`jdc_upstream_rtt_seconds`	Gauge	none	RTT to pool (sampled)

Counters

Metric	Type	Labels	Description
`jdc_jobs_declared_total`	Counter	`mode` (coinbase_only/full)	Jobs declared
`jdc_templates_received_total`	Counter	none	Templates received

Histograms

Metric	Type	Labels	Buckets	Description
`jdc_job_declaration_latency_seconds`	Histogram	`result` (success/failure)	7 buckets	Job declaration latency
`jdc_share_forwarding_latency_seconds`	Histogram	none	7 buckets	Share forwarding latency
`jdc_template_fetch_latency_seconds`	Histogram	none	7 buckets	Template fetch latency

1.4 Tproxy-Specific Metrics

Gauges

Metric	Type	Labels	Description
`tproxy_vardiff_states_count`	Gauge	none	Active vardiff states
`tproxy_job_cache_size`	Gauge	none	Cached SV1 jobs
`sv1_clients_total`	Gauge	none	Connected SV1 miners
`sv1_hashrate_total`	Gauge	none	Total SV1 hashrate
`tproxy_upstream_rtt_seconds`	Gauge	none	RTT to upstream

Counters

Metric	Type	Labels	Description
`tproxy_jobs_translated_total`	Counter	none	Jobs translated SV2→SV1
`tproxy_shares_translated_total`	Counter	none	Shares translated SV1→SV2

Histograms

Metric	Type	Labels	Buckets	Description
`tproxy_share_translation_latency_seconds`	Histogram	none	6 buckets	Share translation latency
`tproxy_job_translation_latency_seconds`	Histogram	none	6 buckets	Job translation latency
`tproxy_sv1_response_latency_seconds`	Histogram	none	7 buckets	SV1 response latency
`tproxy_downstream_authorize_latency_seconds`	Histogram	none	6 buckets	SV1 authorization latency

Fast operation buckets (microsecond scale):

const TRANSLATION_LATENCY_BUCKETS: &[f64] = &[
    0.0001,  // 100μs
    0.001,   // 1ms
    0.005,   // 5ms
    0.01,    // 10ms
    0.05,    // 50ms
    0.1,     // 100ms
];

2. Lifecycle Cleanup Logic

2.1 Gauge Cleanup (Current Implementation)

Strategy: Reset and repopulate on every /metrics scrape.

async fn handle_prometheus_metrics(State(state): State<ServerState>) -> Response {
    // Update system metrics
    let uptime_secs = SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap_or_default()
        .as_secs() - state.start_time;
    state.metrics.sv2_uptime_seconds.set(uptime_secs as f64);
    
    // Aggregate metrics are set directly (no labels to clean up)
    state.metrics.channels_active.set(get_active_channel_count() as f64);
    state.metrics.hashrate_total.set(get_total_hashrate() as f64);
    
    // Encode and return metrics
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    let mut buffer = vec![];
    encoder.encode(&metric_families, &mut buffer).unwrap();
    
    Response::builder()
        .status(200)
        .header(CONTENT_TYPE, encoder.format_type())
        .body(Body::from(buffer))
        .unwrap()
}

Key point: No label cleanup needed - all gauges use aggregate values or no labels.

2.2 Counter/Histogram Cleanup (Not Needed)

Strategy: Use bounded labels only - no cleanup required.

All counters and histograms use:

No labels (aggregate only), OR
Bounded enum labels (fixed cardinality)

Example - bounded enum ensures no leak:

// This will never create more than 5 time series
metrics.shares_rejected_total
    .with_label_values(&[reason.as_str()])  // reason is enum with 5 variants
    .inc();

// This will never create more than 180 time series
// 3 components × 20 error types × 3 severities = 180
metrics.errors_total
    .with_label_values(&[
        component.as_str(),      // 3 variants
        error_type.as_str(),     // ~20 variants per component
        severity.as_str(),       // 3 variants
    ])
    .inc();

2.3 Optional: User-Level Metrics (Feature-Gated)

Only for private mining operations with stable user sets.

// Cargo.toml
[features]
user-level-metrics = []

// Code
#[cfg(feature = "user-level-metrics")]
pub struct UserMetrics {
    pub shares_accepted_by_user: IntCounterVec,
    pub hashrate_by_user: GaugeVec,
}

#[cfg(feature = "user-level-metrics")]
impl UserMetrics {
    pub fn new() -> Result<Self, prometheus::Error> {
        tracing::warn!(
            "user-level-metrics enabled: unbounded memory growth if used in public pool"
        );
        
        Ok(Self {
            shares_accepted_by_user: register_int_counter_vec!(
                "sv2_shares_accepted_by_user_total",
                "Shares accepted per user (PRIVATE OPS ONLY)",
                &["user_identity"]
            )?,
            hashrate_by_user: register_gauge_vec!(
                "sv2_hashrate_by_user",
                "Hashrate per user (PRIVATE OPS ONLY)",
                &["user_identity"]
            )?,
        })
    }
}

// Lifecycle cleanup for user metrics (with grace period)
struct UserCleanupTracker {
    disconnected_users: HashMap<String, Instant>,
    grace_period: Duration,
}

impl UserCleanupTracker {
    async fn cleanup_expired(&mut self, metrics: &UserMetrics) {
        let now = Instant::now();
        let expired: Vec<_> = self.disconnected_users
            .iter()
            .filter(|(_, disconnected_at)| {
                now.duration_since(**disconnected_at) > self.grace_period
            })
            .map(|(user, _)| user.clone())
            .collect();
        
        for user in &expired {
            let _ = metrics.shares_accepted_by_user.remove_label_values(&[user]);
            let _ = metrics.hashrate_by_user.remove_label_values(&[user]);
            self.disconnected_users.remove(user);
            tracing::debug!("Cleaned up metrics for user {} after grace period", user);
        }
    }
}

3. Memory Budget

3.1 Baseline (Without User-Level Metrics)

Category	Count	Cardinality	Memory per TS	Total Memory
Gauges	10	1 each	200 bytes	2 KB
Counters	3 base	1-180	200 bytes	~7 KB
Histograms	12	1-2 labels	1.6 KB each	~20 KB
Total	25 metrics	~200 time series	—	~29 KB

Breakdown:

Gauges: 10 metrics × 1 time series × 200 bytes = 2 KB
Counters:
- shares_accepted_total: 1 time series = 200 bytes
- shares_rejected_total: 5 time series = 1 KB
- errors_total: 180 time series = 36 KB (but sparse, ~5 KB active)
Histograms: 12 metrics × avg 1.5 labels × 8 buckets × 200 bytes = 20 KB

3.2 With User-Level Metrics (Private Ops)

Scenario	User Count	Additional Memory
Small private op	10 users	+2 KB
Medium private op	50 users	+10 KB
Large private op	100 users	+20 KB

Total for 50-user private operation: ~40 KB (baseline 30 KB + user metrics 10 KB)

4. Prometheus Recording Rules

Pre-compute common aggregations for faster dashboard queries.

# /etc/prometheus/rules/sv2_rules.yml
groups:
  - name: sv2_derived_metrics
    interval: 15s
    rules:
      # Derived totals from granular metrics
      - record: sv2:server_channels:total
        expr: sv2_server_channels_extended + sv2_server_channels_standard
      
      - record: sv2:client_channels:total
        expr: sv2_client_channels_extended + sv2_client_channels_standard
      
      - record: sv2:channels:active
        expr: |
          sv2_server_channels_extended + sv2_server_channels_standard +
          sv2_client_channels_extended + sv2_client_channels_standard
      
      - record: sv2:hashrate:total
        expr: sv2_server_hashrate + sv2_client_hashrate

  - name: sv2_traffic
    interval: 30s
    rules:
      # Share rates
      - record: sv2:shares_accepted:rate5m
        expr: rate(sv2_shares_accepted_total[5m])
      
      - record: sv2:shares_rejected:rate5m
        expr: rate(sv2_shares_rejected_total[5m])
      
      # Share rejection ratio
      - record: sv2:shares:rejection_ratio
        expr: |
          rate(sv2_shares_rejected_total[5m]) 
          / 
          (rate(sv2_shares_accepted_total[5m]) + rate(sv2_shares_rejected_total[5m]))

  - name: sv2_latency
    interval: 30s
    rules:
      # Pre-compute percentiles (expensive queries)
      - record: sv2:share_validation_latency:p50
        expr: histogram_quantile(0.50, rate(sv2_share_validation_latency_seconds_bucket[5m]))
      
      - record: sv2:share_validation_latency:p95
        expr: histogram_quantile(0.95, rate(sv2_share_validation_latency_seconds_bucket[5m]))
      
      - record: sv2:share_validation_latency:p99
        expr: histogram_quantile(0.99, rate(sv2_share_validation_latency_seconds_bucket[5m]))

  - name: sv2_errors
    interval: 30s
    rules:
      # Error rate by severity
      - record: sv2:errors:rate5m:by_severity
        expr: sum by (severity) (rate(sv2_errors_total[5m]))
      
      # Error rate by component
      - record: sv2:errors:rate5m:by_component
        expr: sum by (component) (rate(sv2_errors_total[5m]))
      
      # Total error rate
      - record: sv2:errors:rate5m
        expr: sum(rate(sv2_errors_total[5m]))

  - name: sv2_saturation
    interval: 30s
    rules:
      # Channel utilization
      - record: sv2:channels:utilization
        expr: sv2:channels:active / sv2_channels_max
      
      # Memory utilization (if limit is configured)
      - record: sv2:memory:utilization
        expr: sv2_memory_used_bytes / sv2_memory_limit_bytes

Load into Prometheus:

# prometheus.yml
rule_files:
  - /etc/prometheus/rules/sv2_rules.yml

5. Example Alert Rules

# /etc/prometheus/alerts/sv2_alerts.yml
groups:
  - name: sv2_critical
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: sv2:errors:rate5m{severity="error"} > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in {{ $labels.component }}"
          description: "Error rate is {{ $value }} errors/sec"
      
      # High share validation latency
      - alert: HighShareValidationLatency
        expr: sv2:share_validation_latency:p95 > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High share validation latency"
          description: "p95 latency is {{ $value }}s"
      
      # Channel saturation
      - alert: ChannelSaturation
        expr: sv2:channels:utilization > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Channel capacity nearly exhausted"
          description: "{{ $value | humanizePercentage }} of channels in use"
      
      # Template provider down
      - alert: TemplateProviderDown
        expr: rate(sv2_errors_total{component="pool", error_type="template_provider_timeout"}[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Template provider is timing out"

0 replies

gimballock · 2026-01-23T00:33:18Z

gimballock
Jan 23, 2026
Author

Here are some thoughts on how all these tools can be used together:
Possible enhancements to the API
SRE guidebook

0 replies

GitGab19 · 2026-02-02T11:56:11Z

GitGab19
Feb 2, 2026
Maintainer

Hi @gimballock, thanks for opening this discussion and structuring it with all this information.

It took me quite a bit to read everything, and I have to admit that there are so many things which is a bit hard to get the juice out of it.

That said, I think this is a good comprehensive overview of many possible ways for which we can expand the current monitoring capabilities on our applications.

From a practical perspective, my take is that we should improve (and expand) the current Traffic metrics that we have, before exploring all the other golden signals you mentioned, for a couple of reasons:

we initially started to add these monitoring functionalities with the main goal of having a way to retrieve data from a UI (which we're currently building)
while it can be easy to just add new monitoring features and metrics with the usage of AI, the reality is that, as a FOSS project, we are going to maintain everything we add to our code-base, and so we need to be really strategic in that sense
given the endless monitoring possibilities that we have, we need to take the right time to reason about which kind of information we want to get out of our applications

In addition to this, I think that we have a real need for all the type of Traffic metrics now, and we have a very clear idea of what information we want to have.

While I believe the other types of signals are something really valuable as well, I also think that the real need for them is yet to be defined. For example: what do we want to achieve from the instrumentation of our code-base needed for the latency metrics? Would it be to have the user analyzing this data at runtime, or would it be for us to run some kind of "benches" on CI and avoid possible regressions of our apps?

0 replies

gimballock · 2026-02-03T02:03:44Z

gimballock
Feb 3, 2026
Author

Thanks for the thoughtful reply. I agree we should be strategic—maintenance burden is real for FOSS.

I agree that traffic is the most important signal. Traffic is foundational (you need to know what's flowing), errors are the next most actionable, latency helps diagnose root causes, and saturation matters for capacity planning. So focusing on traffic first makes sense.

Regarding latency metrics, the primary use case I see is operational debugging, not CI benchmarking. When something goes wrong in production (shares rejected, hashrate drops, miners disconnecting), latency data helps pinpoint where the problem is:

Is the pool slow to validate shares?
Is the upstream (JDS/TP) connection lagging?
Is it network, or is it our code?
Without latency instrumentation, you're left guessing. With it, you can correlate spikes in latency with drops in throughput or error rates—that's the "golden signals" idea in practice.

That said, I agree we don't need to build all of this upfront. Traffic metrics are the right starting point because they're the most immediately useful and easiest to reason about.

Regarding changes to the current traffic metrics:

I have consolidated several of the channel metrics into one with multiple labels:

extended vs standard channels are now channels with channel_type=standard/extended labels
server vs client channels are also represented with direction=server/client labels

Similar labels were used to combine other metrics also: hashrate, connections, shares

I have also added two new traffic metrics: bytes and messages transmitted

One way to think about traffic metrics is as layers of the mining stack, like the OSI:

Blocks — the ultimate output
Hashrate & shares — work contributed
Channels & connections — session state
Messages & bytes — raw protocol traffic

The current metrics cover the middle two layers well (channels, connections, hashrate, shares). Bytes and messages would complete the picture at the bottom; blocks at the top.

One thing that I have found difficult though is enumeration of sv2-native miners, namely b/c you cannot distinguish easily between client channels that are proxies and those that are actual miners. But I might have a non-code fix for that.

Does this framing work? I'm happy to focus PRs on solidifying traffic metrics first and tackle the other signals incrementally as real needs emerge.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New monitoring metrics #198

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

New monitoring metrics #198

Uh oh!

Uh oh!

gimballock Jan 21, 2026

Discussion: Expanding SV2 Metrics with the Four Golden Signals

Overview

Current State: Mostly Traffic Metrics

The Full Landscape: Potential Metrics by Signal Type

1. Additional Traffic Metrics (Counters)

2. Saturation Metrics (Gauges)

3. Latency Metrics (Histograms)

4. Error Metrics (Counters)

Implementation Considerations

Questions for Discussion

Replies: 4 comments

Uh oh!

Uh oh!

gimballock Jan 22, 2026 Author

Adjusted plan for memory-stable monitoring

1. Metric Definitions

1.1 Common Metrics (All Components)

Gauges (Snapshot Metrics)

Counters (Event-Driven)

Histograms (Latency Metrics)

1.2 Pool-Specific Metrics

Gauges

Counters

Histograms

1.3 JDC-Specific Metrics

Gauges

Counters

Histograms

1.4 Tproxy-Specific Metrics

Gauges

Counters

Histograms

2. Lifecycle Cleanup Logic

2.1 Gauge Cleanup (Current Implementation)

2.2 Counter/Histogram Cleanup (Not Needed)

2.3 Optional: User-Level Metrics (Feature-Gated)

3. Memory Budget

3.1 Baseline (Without User-Level Metrics)

3.2 With User-Level Metrics (Private Ops)

4. Prometheus Recording Rules

5. Example Alert Rules

Uh oh!

gimballock Jan 23, 2026 Author

Uh oh!

GitGab19 Feb 2, 2026 Maintainer

Uh oh!

gimballock Feb 3, 2026 Author

gimballock
Jan 21, 2026

gimballock
Jan 22, 2026
Author

gimballock
Jan 23, 2026
Author

GitGab19
Feb 2, 2026
Maintainer

gimballock
Feb 3, 2026
Author