- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
Open
Labels
triageThe issue needs triaging.The issue needs triaging.
Description
The current 10 ms upper bound on GoFr’s SQL query histogram quantiles is great for ultra-fast query visibility (for apps serving millions of requests per minute), but it compresses all slower queries into the same top bucket, which makes real-world latency analysis difficult — especially in production workloads where 100 ms–5 s queries are normal for analytical or transactional DBs.
⚙️ Why 10 ms was used initially
- It was designed to detect micro-latency regressions in high-throughput services (like internal GoFr benchmarks).
 - Early use cases focused on microservices hitting in-memory or well-indexed SQL queries where anything >10 ms was an anomaly.
 - It optimized for Prometheus cardinality — fewer histogram buckets = smaller metrics footprint.
 
❌ Why that’s limiting in real workloads
- Queries taking 20 ms, 200 ms, or 2 s all fall into the same bucket — you lose resolution.
 - Makes it impossible to differentiate between “slightly slow” and “critical” queries.
 - You can’t correlate query latency with API request latency effectively.
 - Makes alerting thresholds (like P95 or P99 query latency) misleadingly small.
 
✅ Recommended Fix
Adopt a wider histogram range with logarithmic or percentile-style bucket spacing.
Example ideal bucket set for SQL histograms:
[]float64{
    0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 20, 30, 60,
}This covers:
- Microsecond-level detail for fast queries
 - Smooth spread for mid-range queries
 - Full visibility up to 60 seconds for extreme cases
 
🧩 Dynamic Adjustment (Could be an option)
Instead of hardcoding buckets, expose them via configuration — for example:
app.Metrics.SetHistogramBuckets("sql_query_duration_seconds", []float64{
    0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60,
})You could tie this to:
- Datasource type (MySQL, Postgres, BigQuery, etc.)
 - Request timeout (e.g., if 
REQUEST_TIMEOUT=30s, upper bound could be 30 s) - Environment (dev → short, prod → wide)
 
📊 Impact
- 
Slightly higher Prometheus storage footprint (more buckets)
 - 
Far more actionable insights:
- 95th percentile query time per DB
 - Breakdown of fast vs slow queries
 - Easier detection of query regressions after deployments
 
 
Metadata
Metadata
Assignees
Labels
triageThe issue needs triaging.The issue needs triaging.