New monitoring metrics #198
Replies: 4 comments
-
Adjusted plan for memory-stable monitoringFinal specification for production-ready metrics with bounded cardinality and lifecycle cleanup. 1. Metric Definitions1.1 Common Metrics (All Components)Gauges (Snapshot Metrics)
Derived metrics (via PromQL):
Cleanup: None needed - gauges are reset and repopulated on every Counters (Event-Driven)
Cleanup: None needed - labels are bounded enums. Histograms (Latency Metrics)
Bucket configuration: // Reduced from 14 to 7 buckets (50% memory savings)
const LATENCY_BUCKETS: &[f64] = &[
0.001, // 1ms
0.01, // 10ms
0.05, // 50ms
0.1, // 100ms
0.5, // 500ms
1.0, // 1s
5.0, // 5s
];Cleanup: None needed - labels are bounded. 1.2 Pool-Specific MetricsGauges
Counters
Histograms
Bucket configuration for block operations: const BLOCK_LATENCY_BUCKETS: &[f64] = &[
0.01, // 10ms
0.05, // 50ms
0.1, // 100ms
0.5, // 500ms
1.0, // 1s
5.0, // 5s
10.0, // 10s
];1.3 JDC-Specific MetricsGauges
Counters
Histograms
1.4 Tproxy-Specific MetricsGauges
Counters
Histograms
Fast operation buckets (microsecond scale): const TRANSLATION_LATENCY_BUCKETS: &[f64] = &[
0.0001, // 100μs
0.001, // 1ms
0.005, // 5ms
0.01, // 10ms
0.05, // 50ms
0.1, // 100ms
];2. Lifecycle Cleanup Logic2.1 Gauge Cleanup (Current Implementation)Strategy: Reset and repopulate on every async fn handle_prometheus_metrics(State(state): State<ServerState>) -> Response {
// Update system metrics
let uptime_secs = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_secs() - state.start_time;
state.metrics.sv2_uptime_seconds.set(uptime_secs as f64);
// Aggregate metrics are set directly (no labels to clean up)
state.metrics.channels_active.set(get_active_channel_count() as f64);
state.metrics.hashrate_total.set(get_total_hashrate() as f64);
// Encode and return metrics
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();
let mut buffer = vec![];
encoder.encode(&metric_families, &mut buffer).unwrap();
Response::builder()
.status(200)
.header(CONTENT_TYPE, encoder.format_type())
.body(Body::from(buffer))
.unwrap()
}Key point: No label cleanup needed - all gauges use aggregate values or no labels. 2.2 Counter/Histogram Cleanup (Not Needed)Strategy: Use bounded labels only - no cleanup required. All counters and histograms use:
Example - bounded enum ensures no leak: // This will never create more than 5 time series
metrics.shares_rejected_total
.with_label_values(&[reason.as_str()]) // reason is enum with 5 variants
.inc();
// This will never create more than 180 time series
// 3 components × 20 error types × 3 severities = 180
metrics.errors_total
.with_label_values(&[
component.as_str(), // 3 variants
error_type.as_str(), // ~20 variants per component
severity.as_str(), // 3 variants
])
.inc();2.3 Optional: User-Level Metrics (Feature-Gated)Only for private mining operations with stable user sets. // Cargo.toml
[features]
user-level-metrics = []
// Code
#[cfg(feature = "user-level-metrics")]
pub struct UserMetrics {
pub shares_accepted_by_user: IntCounterVec,
pub hashrate_by_user: GaugeVec,
}
#[cfg(feature = "user-level-metrics")]
impl UserMetrics {
pub fn new() -> Result<Self, prometheus::Error> {
tracing::warn!(
"user-level-metrics enabled: unbounded memory growth if used in public pool"
);
Ok(Self {
shares_accepted_by_user: register_int_counter_vec!(
"sv2_shares_accepted_by_user_total",
"Shares accepted per user (PRIVATE OPS ONLY)",
&["user_identity"]
)?,
hashrate_by_user: register_gauge_vec!(
"sv2_hashrate_by_user",
"Hashrate per user (PRIVATE OPS ONLY)",
&["user_identity"]
)?,
})
}
}
// Lifecycle cleanup for user metrics (with grace period)
struct UserCleanupTracker {
disconnected_users: HashMap<String, Instant>,
grace_period: Duration,
}
impl UserCleanupTracker {
async fn cleanup_expired(&mut self, metrics: &UserMetrics) {
let now = Instant::now();
let expired: Vec<_> = self.disconnected_users
.iter()
.filter(|(_, disconnected_at)| {
now.duration_since(**disconnected_at) > self.grace_period
})
.map(|(user, _)| user.clone())
.collect();
for user in &expired {
let _ = metrics.shares_accepted_by_user.remove_label_values(&[user]);
let _ = metrics.hashrate_by_user.remove_label_values(&[user]);
self.disconnected_users.remove(user);
tracing::debug!("Cleaned up metrics for user {} after grace period", user);
}
}
}3. Memory Budget3.1 Baseline (Without User-Level Metrics)
Breakdown:
3.2 With User-Level Metrics (Private Ops)
Total for 50-user private operation: ~40 KB (baseline 30 KB + user metrics 10 KB) 4. Prometheus Recording RulesPre-compute common aggregations for faster dashboard queries. # /etc/prometheus/rules/sv2_rules.yml
groups:
- name: sv2_derived_metrics
interval: 15s
rules:
# Derived totals from granular metrics
- record: sv2:server_channels:total
expr: sv2_server_channels_extended + sv2_server_channels_standard
- record: sv2:client_channels:total
expr: sv2_client_channels_extended + sv2_client_channels_standard
- record: sv2:channels:active
expr: |
sv2_server_channels_extended + sv2_server_channels_standard +
sv2_client_channels_extended + sv2_client_channels_standard
- record: sv2:hashrate:total
expr: sv2_server_hashrate + sv2_client_hashrate
- name: sv2_traffic
interval: 30s
rules:
# Share rates
- record: sv2:shares_accepted:rate5m
expr: rate(sv2_shares_accepted_total[5m])
- record: sv2:shares_rejected:rate5m
expr: rate(sv2_shares_rejected_total[5m])
# Share rejection ratio
- record: sv2:shares:rejection_ratio
expr: |
rate(sv2_shares_rejected_total[5m])
/
(rate(sv2_shares_accepted_total[5m]) + rate(sv2_shares_rejected_total[5m]))
- name: sv2_latency
interval: 30s
rules:
# Pre-compute percentiles (expensive queries)
- record: sv2:share_validation_latency:p50
expr: histogram_quantile(0.50, rate(sv2_share_validation_latency_seconds_bucket[5m]))
- record: sv2:share_validation_latency:p95
expr: histogram_quantile(0.95, rate(sv2_share_validation_latency_seconds_bucket[5m]))
- record: sv2:share_validation_latency:p99
expr: histogram_quantile(0.99, rate(sv2_share_validation_latency_seconds_bucket[5m]))
- name: sv2_errors
interval: 30s
rules:
# Error rate by severity
- record: sv2:errors:rate5m:by_severity
expr: sum by (severity) (rate(sv2_errors_total[5m]))
# Error rate by component
- record: sv2:errors:rate5m:by_component
expr: sum by (component) (rate(sv2_errors_total[5m]))
# Total error rate
- record: sv2:errors:rate5m
expr: sum(rate(sv2_errors_total[5m]))
- name: sv2_saturation
interval: 30s
rules:
# Channel utilization
- record: sv2:channels:utilization
expr: sv2:channels:active / sv2_channels_max
# Memory utilization (if limit is configured)
- record: sv2:memory:utilization
expr: sv2_memory_used_bytes / sv2_memory_limit_bytesLoad into Prometheus: # prometheus.yml
rule_files:
- /etc/prometheus/rules/sv2_rules.yml5. Example Alert Rules# /etc/prometheus/alerts/sv2_alerts.yml
groups:
- name: sv2_critical
rules:
# High error rate
- alert: HighErrorRate
expr: sv2:errors:rate5m{severity="error"} > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in {{ $labels.component }}"
description: "Error rate is {{ $value }} errors/sec"
# High share validation latency
- alert: HighShareValidationLatency
expr: sv2:share_validation_latency:p95 > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High share validation latency"
description: "p95 latency is {{ $value }}s"
# Channel saturation
- alert: ChannelSaturation
expr: sv2:channels:utilization > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Channel capacity nearly exhausted"
description: "{{ $value | humanizePercentage }} of channels in use"
# Template provider down
- alert: TemplateProviderDown
expr: rate(sv2_errors_total{component="pool", error_type="template_provider_timeout"}[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Template provider is timing out" |
Beta Was this translation helpful? Give feedback.
-
|
Here are some thoughts on how all these tools can be used together: |
Beta Was this translation helpful? Give feedback.
-
|
Hi @gimballock, thanks for opening this discussion and structuring it with all this information. It took me quite a bit to read everything, and I have to admit that there are so many things which is a bit hard to get the juice out of it. That said, I think this is a good comprehensive overview of many possible ways for which we can expand the current monitoring capabilities on our applications. From a practical perspective, my take is that we should improve (and expand) the current
In addition to this, I think that we have a real need for all the type of While I believe the other types of signals are something really valuable as well, I also think that the real need for them is yet to be defined. For example: what do we want to achieve from the instrumentation of our code-base needed for the latency metrics? Would it be to have the user analyzing this data at runtime, or would it be for us to run some kind of "benches" on CI and avoid possible regressions of our apps? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the thoughtful reply. I agree we should be strategic—maintenance burden is real for FOSS. I agree that traffic is the most important signal. Traffic is foundational (you need to know what's flowing), errors are the next most actionable, latency helps diagnose root causes, and saturation matters for capacity planning. So focusing on traffic first makes sense. Regarding latency metrics, the primary use case I see is operational debugging, not CI benchmarking. When something goes wrong in production (shares rejected, hashrate drops, miners disconnecting), latency data helps pinpoint where the problem is:
That said, I agree we don't need to build all of this upfront. Traffic metrics are the right starting point because they're the most immediately useful and easiest to reason about. Regarding changes to the current traffic metrics: I have consolidated several of the channel metrics into one with multiple labels:
Similar labels were used to combine other metrics also: hashrate, connections, shares I have also added two new traffic metrics: bytes and messages transmitted One way to think about traffic metrics is as layers of the mining stack, like the OSI:
The current metrics cover the middle two layers well (channels, connections, hashrate, shares). Bytes and messages would complete the picture at the bottom; blocks at the top. One thing that I have found difficult though is enumeration of sv2-native miners, namely b/c you cannot distinguish easily between client channels that are proxies and those that are actual miners. But I might have a non-code fix for that. Does this framing work? I'm happy to focus PRs on solidifying traffic metrics first and tackle the other signals incrementally as real needs emerge. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion: Expanding SV2 Metrics with the Four Golden Signals
Overview
I'd like to open a discussion about expanding our monitoring capabilities using the Four Golden Signals framework from the Google SRE book: Traffic, Saturation, Latency, and Errors.
To be clear: I'm not proposing we implement all of these metrics. Rather, I want to map out the full space of potentially valuable metrics we could add. The key insight is that by measuring different dimensions of the system, you don't just add information—you multiply diagnostic clarity, like going from 2D to 3D. Seeing both an error signal and a growing buffer depth provides far more confidence about what's happening than either one individually.
Imagine the diagnostic power of simultaneously observing:
mining.submitlatency spikes from 50ms to 800ms at p99This correlated view immediately reveals: a network issue caused device reconnections,
which saturated the translator's channel capacity, backing up the jobs buffer and
degrading submit latency—ultimately causing miners to fail over to backup pools.
This multi-dimensional view enables sophisticated alerting based on combinations of signals—correlating error rates with response times, traffic spikes, and resource consumption.
Current State: Mostly Traffic Metrics
Most existing metrics fall into the Traffic category, measuring client interactions:
sv2_uptime_secondssv2_server_channels_total,sv2_server_hashrate_total, per-channel hashrate/sharessv2_clients_total,sv2_client_hashrate_total, per-channel metricssv1_clients_total,sv1_hashrate_totalThe Full Landscape: Potential Metrics by Signal Type
1. Additional Traffic Metrics (Counters)
pool_blocks_found_totaljdc_jobs_declared_totaltproxy_sv1_messages_totalsv2_shares_rejected_total/sv1_shares_rejected_totalWhy Counters? These are monotonically increasing values. Using Gauges for counter-like data causes issues—if a connection resets and the count restarts at 1, a Gauge would interpret this as a negative change. Counters automatically handle session resets correctly, which is critical for reliable alerting on traffic patterns.
Note: The existing
*_shares_accepted_totalmetrics may be vulnerable to this issue.2. Saturation Metrics (Gauges)
Resource utilization for fixed-capacity systems:
sv2_memory_used_bytespool_template_store_size/jdc_template_store_sizepool_share_queue_depthjdc_pending_job_declarationstproxy_job_cache_sizetproxy_vardiff_states_countThese help detect gradual resource accumulation (e.g., memory leaks) and enable threshold-based alerting.
3. Latency Metrics (Histograms)
sv2_share_validation_latency_secondssv2_job_distribution_latency_secondssv2_message_processing_latency_secondspool_block_submission_latency_secondspool_tp_response_latency_secondsjdc_upstream_rtt_seconds/jdc_job_declaration_latency_secondstproxy_upstream_rtt_seconds/tproxy_share_translation_latency_secondsWhy Histograms? Latency only exists as discrete events—snapshotting when no event occurs yields meaningless data. Histograms enable percentile-based alerting:
This quantifies responsiveness degradation during CPU exhaustion or traffic spikes.
4. Error Metrics (Counters)
sv2_errors_total(by type)sv2_connection_errors_total/sv2_protocol_errors_totalpool_tp_errors_totaljdc_job_declaration_failures_total/jdc_upstream_errors_totaltproxy_translation_errors_total/tproxy_sv1_protocol_errors_totalWhile we have log streams, dedicated error counters surface rare issues in real-time alongside other signals. Protocol errors can indicate revenue-impacting issues like stale shares or difficulty misconfigurations—correlating these with specific miners or resource constraints accelerates resolution.
Implementation Considerations
Event-based metrics require passing the metrics context through the component hierarchy to message handlers and job processors—more invasive but necessary for accurate rate/percentile functions and complete data records.
Questions for Discussion
Again, this is about identifying which subset of these metrics would give us the best multi-dimensional view of system health. The goal is strategic instrumentation, not comprehensive coverage.
Looking forward to your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions