Flag metrics are binary indicators (raised/lowered) that represent whether a specific condition is met. They form the foundation for health metric expressions.
flag_metrics:
- name: "slow_response"
service: "api-gateway"
template:
name: "api_latency"
environments:
- name: "production"
- name: "staging"
threshold: 1000| Property | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Unique metric identifier within the service |
service |
string | Yes | Service this metric belongs to |
template |
object | Yes | Reference to a metric template |
environments |
array | Yes | Environment-specific configurations |
| Property | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Template name (key in metric_templates) |
vars |
object | No | Additional template variables |
| Property | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Environment name (must exist in environments) |
threshold |
number | No | Override threshold for this environment |
- Query Execution: The template query is executed with variable substitution
- Value Comparison: The result is compared to the threshold using the operator
- Flag State: The flag is raised (
true) or lowered (false) based on the comparison
Query Result: 750ms
Threshold: 500ms
Operator: gt (greater than)
Result: 750 > 500 = true → Flag RAISED
Operators are defined in the template but affect flag behavior:
| Operator | Flag Raised When | Example |
|---|---|---|
gt |
value > threshold | Latency 750ms > 500ms threshold |
lt |
value < threshold | Success rate 85% < 90% threshold |
eq |
value == threshold | Error rate 100% == 100% threshold |
The threshold can be customized per environment:
metric_templates:
api_latency:
query: "stats.timers.api.$environment.$service.mean"
op: "gt"
threshold: 500 # Default threshold
flag_metrics:
- name: "slow_response"
service: "api-gateway"
template:
name: "api_latency"
environments:
- name: "production"
# Uses default 500ms threshold
- name: "staging"
threshold: 1000 # Override: staging is more lenient
- name: "development"
threshold: 2000 # Override: dev is even more lenientProduction vs Non-Production:
environments:
- name: "production"
# Strict: 500ms
- name: "staging"
threshold: 750 # 50% more lenient
- name: "development"
threshold: 1500 # 3x more lenientRegional Differences:
environments:
- name: "us-east"
# Default threshold
- name: "ap-southeast"
threshold: 600 # Account for network latencyFlag metrics are internally referenced using the format: {service}.{name}
flag_metrics:
- name: "slow_response" # → api-gateway.slow_response
service: "api-gateway"
- name: "error_rate_high" # → api-gateway.error_rate_high
service: "api-gateway"This naming is important when referencing flags in health metric expressions.
flag_metrics:
- name: "api_slow"
service: "user-service"
template:
name: "api_latency"
environments:
- name: "production"flag_metrics:
- name: "success_rate_low"
service: "payment-service"
template:
name: "api_success_rate"
environments:
- name: "production"
threshold: 99.9 # Strict SLA
- name: "staging"
threshold: 95.0 # Relaxed for testing
- name: "development"
threshold: 90.0 # Very relaxedflag_metrics:
# API performance flags
- name: "api_slow"
service: "compute"
template:
name: "api_latency"
environments:
- name: "production"
- name: "api_success_rate_low"
service: "compute"
template:
name: "api_success_rate"
environments:
- name: "production"
# API availability flag
- name: "api_down"
service: "compute"
template:
name: "api_down"
environments:
- name: "production"metric_templates:
api_latency:
query: "stats.timers.api.$environment.$service.mean"
op: "gt"
threshold: 500
api_success_rate:
query: "asPercent(stats.counters.api.$environment.$service.2xx, stats.counters.api.$environment.$service.total)"
op: "lt"
threshold: 95
api_down:
query: "asPercent(stats.counters.api.$environment.$service.5xx, stats.counters.api.$environment.$service.total)"
op: "eq"
threshold: 100
flag_metrics:
- name: "api_slow"
service: "identity"
template:
name: "api_latency"
environments:
- name: "production"
- name: "staging"
threshold: 1000
- name: "api_success_rate_low"
service: "identity"
template:
name: "api_success_rate"
environments:
- name: "production"
- name: "staging"
- name: "api_down"
service: "identity"
template:
name: "api_down"
environments:
- name: "production"
- name: "staging"For large deployments, organize flag metrics in separate files:
conf.d/
├── 10-compute-services.yaml
├── 20-network-services.yaml
└── 30-storage-services.yaml
# conf.d/10-compute-services.yaml
flag_metrics:
- name: "api_slow"
service: "nova"
template:
name: "api_latency"
environments:
- name: "production"
# ... more compute service flagsUse consistent naming patterns across services:
# Good: Consistent pattern
- name: "api_slow"
- name: "api_success_rate_low"
- name: "api_down"
# Avoid: Inconsistent naming
- name: "slow"
- name: "low_success"
- name: "down"Ensure every flag metric covers all required environments:
flag_metrics:
- name: "api_slow"
service: "compute"
template:
name: "api_latency"
environments:
- name: "production"
- name: "staging"
- name: "development" # Don't forget non-prod!Add comments explaining threshold values:
flag_metrics:
- name: "api_slow"
service: "compute"
template:
name: "api_latency"
environments:
- name: "production"
# SLA: 99th percentile < 500ms
- name: "staging"
threshold: 1000 # Shared infrastructure, higher latency expected- Metric Templates - Template configuration
- Health Metrics - Using flags in expressions
- Environments - Environment configuration
- Examples - Complete configuration samples