-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
Related to: [RFC] Search Performance and Resiliency #8879, and [META] Query Insights Framework #11522
Current State
Query Insights (#11522) successfully identifies which queries are slow through:
- Top N queries by latency (Top N Queries by latency implementation #11904) ✅
- Query-level CPU and memory tracking (Query-level resource usages tracking #13172) ✅
- Export capabilities for external monitoring (Query insights exporters implementation #12982) ✅
What's tracked today: Query phase, Fetch phase, Can_match phase durations
The Problem: Operators at times observe that total search latency significantly exceeds the sum of reported phases.
Production Example:
Total search latency: 150ms
├─ query_phase: 40ms (27%)
├─ fetch_phase: 15ms (10%)
├─ can_match_phase: 5ms (3%)
└─ UNACCOUNTED: 90ms (60%)
The Pain Point
In the above mentioned production example 60% of latency is invisible. Operators cannot answer:
- Why is my query taking 150ms when phases only account for 60ms?
- Is it queue saturation? Network delays? Coordination overhead?
- Where should I optimize first?
Current workflow (without breakdown):
- See 150ms query in Query Insights Top N
- Collect and analyze resource utilization metrics, thread dumps, or flame graphs, or JFR profiles (hours of work, needs deep expertise)
- Experiment with config/sharding/setup/query changes (trial and error)
- Hope the change helps
Impact: Performance optimization becomes guesswork without clarity which request workflow segment/component is adding most to the search request latency. MTTR for latency issues measured in days/weeks and at times requiring tenured search developers to identify and address latency issues.
Describe the solution you'd like
Proposed Changes
Add comprehensive latency breakdown to Query Insights showing exactly where time is spent:
What We'll Track
Phases (enhanced existing): Query, Fetch, Can_match, DFS, Suggest, Rewrite, Expand
Aggregations (new): Aggregation execution (per-shard), Aggregation reduce (coordinator), Aggregation type breakdown (optional)
Coordination (new): Request parsing, shard routing, response aggregation, reduce phase, result processing
Transport (new): Network round-trip time, request serialization/deserialization, response serialization/deserialization
Queuing (new): Search queue wait, get/fetch queue wait, thread scheduling
Cache (new): Query cache lookup/write, request cache lookup/write, shard request cache, fielddata operations
Concurrency (new): Lock acquisition, shard lock contention, circuit breaker checks
Concurrent Segment Search (new, if enabled): Slice creation and distribution, Slice execution scheduling, Per-slice execution time (min, max, avg, variance), Slice result aggregation, Thread pool coordination overhead
Cluster State & Metadata (new): Cluster state lookup, index metadata access, mapping lookup
Plugins & Extensions (new): Security plugin overhead, custom plugin processing, extension processing
Prioritization for Phased Implementation
When implementing in phases, the priority order can be as follows:
- Phase 1 (Must-Have): Coordination , Queuing, Concurrent Segment Search
- Phase 2 (High Value): Cache operations, Concurrency , Cluster state & metadata
- Phase 3 (Nice-to-Have): Plugins & extensions, Transport , Phases (enhanced tracking)
Query Insights API - Before and After Examples
API Request
GET /_insights/top_queries?type=latency&size=10&from=2026-02-12T00:00:00Z&to=2026-02-12T23:59:59Z
Example 1: Query WITHOUT Aggregations
BEFORE: Current Query Insights Response
What you see today (limited visibility):
{
"top_queries": [
{
"timestamp": 1707764400000,
"node_id": "node-1",
"indices": ["content"],
"search_type": "query_then_fetch",
"total_shards": 220,
"source": {
"query": {
"bool": {
"must": [
{"match": {"content": "search term"}},
{"exists": {"field": "category"}}
]
}
},
"size": 100,
"from": 0
},
"took": 150,
"phase_latency_breakdown": {
"query": 40,
"fetch": 15,
"can_match": 5
},
"resource_usages": {
"cpu": 45,
"memory": 1024000
},
"user": "admin",
"labels": {
"team": "analytics",
"priority": "high"
}
}
]
}
Problem: You see 150ms total latency with only 60ms accounted for in phases (query: 40ms, fetch: 15ms, can_match: 5ms). 90ms (60%) is unaccounted for. You know the query is slow, but not why.
The Gap: Where is the missing 90ms?
- Queue wait times❓
- Network delays❓
- Coordination overhead❓
- Cache operations❓
- Concurrent segment search overhead❓
AFTER: Enhanced Query Insights Response
What you'll see (improved visibility):
{
"top_queries": [
{
"timestamp": 1707764400000,
"node_id": "node-1",
"indices": ["content"],
"search_type": "query_then_fetch",
"total_shards": 220,
"source": {
"query": {
"bool": {
"must": [
{"match": {"content": "search term"}},
{"exists": {"field": "category"}}
]
}
},
"size": 100,
"from": 0
},
"took": 150,
"resource_usages": {
"cpu": 45,
"memory": 1024000
},
"user": "admin",
"labels": {
"team": "analytics",
"priority": "high"
},
"latency_breakdown": {
"phases": {
"query": 40,
"fetch": 15,
"can_match": 5,
"dfs": 0,
"suggest": 0,
"rewrite": 2,
"expand": 1
},
"aggregations": {
"enabled": false,
"aggregation_execution": 0,
"aggregation_reduce": 0
},
"coordination": {
"request_parsing": 2,
"shard_routing": 3,
"response_aggregation": 6,
"reduce_phase": 4,
"result_processing": 3
},
"transport": {
"network_roundtrip": 25,
"request_serialization": 2,
"request_deserialization": 1,
"response_serialization": 2,
"response_deserialization": 2
},
"queuing": {
"search_queue_wait": 35,
"get_fetch_queue_wait": 8,
"thread_scheduling": 2
},
"cache": {
"query_cache_lookup": 1,
"query_cache_write": 0,
"request_cache_lookup": 2,
"request_cache_write": 0,
"shard_request_cache": 1,
"fielddata_operations": 1
},
"concurrency": {
"lock_acquisition": 1,
"shard_lock_contention": 0,
"circuit_breaker_checks": 0
},
"concurrent_segment_search": {
"enabled": true,
"slice_creation_and_distribution": 2,
"slice_execution_scheduling": 3,
"per_slice_execution": {
"slice_count": 2,
"min_execution_time": 35,
"max_execution_time": 42,
"avg_execution_time": 38.5,
"variance": 7,
"individual_slices": [
{
"slice_id": 0,
"execution_time": 35,
"segments_processed": 8
},
{
"slice_id": 1,
"execution_time": 42,
"segments_processed": 10
}
]
},
"slice_result_aggregation": 5,
"thread_pool_coordination": 2
},
"cluster_state_metadata": {
"cluster_state_lookup": 1,
"index_metadata_access": 1,
"mapping_lookup": 1
},
"plugins_extensions": {
"security_plugin_overhead": 2,
"custom_plugin_processing": 0,
"extension_processing": 0
},
"other": 1
}
}
]
}
Actionable Insights:
- Queue Saturation (35ms wait, 23% of total) → Increase search thread pool or add data nodes or query optimization
- Network Overhead (25ms roundtrip, 17% of total) → Investigate network topology or capacity saturation
- Concurrent Segment Search (slice count = 2 validated) → Low aggregation overhead (5ms) confirms optimization
- No Aggregations → Confirms aggregations not contributing to latency
Example 2: Query WITH Aggregations
BEFORE: Current Query Insights Response
What you see today (limited visibility):
{
"top_queries": [
{
"timestamp": 1707764500000,
"node_id": "node-2",
"indices": ["logs-2026.02.*"],
"search_type": "query_then_fetch",
"total_shards": 120,
"source": {
"query": {
"bool": {
"filter": [
{"range": {"@timestamp": {"gte": "2026-02-12T00:00:00", "lte": "2026-02-12T23:59:59"}}},
{"term": {"status": "error"}}
]
}
},
"aggs": {
"error_types": {
"terms": {
"field": "error_type.keyword",
"size": 100
},
"aggs": {
"hourly_distribution": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1h"
},
"aggs": {
"avg_response_time": {
"avg": {"field": "response_time"}
}
}
}
}
}
},
"size": 0
},
"took": 320,
"phase_latency_breakdown": {
"query": 120,
"fetch": 25,
"can_match": 8
},
"resource_usages": {
"cpu": 180,
"memory": 2048000
},
"user": "analytics_team",
"labels": {
"team": "observability",
"priority": "medium"
}
}
]
}
Problem: You see 320ms total latency with only 153ms accounted for in phases (query: 120ms, fetch: 25ms, can_match: 8ms). 167ms (52%) is unaccounted for. The query phase shows 120ms, but you don't know how much of that is aggregation execution vs. query/filter execution.
The Gap: Where is the missing 167ms? And within the 120ms query phase, how much is aggregation overhead?
AFTER: Enhanced Query Insights Response
What you'll see (complete visibility with aggregation breakdown):
{
"top_queries": [
{
"timestamp": 1707764500000,
"node_id": "node-2",
"indices": ["logs-2026.02.*"],
"search_type": "query_then_fetch",
"total_shards": 120,
"source": {
"query": {
"bool": {
"filter": [
{"range": {"@timestamp": {"gte": "2026-02-12T00:00:00", "lte": "2026-02-12T23:59:59"}}},
{"term": {"status": "error"}}
]
}
},
"aggs": {
"error_types": {
"terms": {
"field": "error_type.keyword",
"size": 100
},
"aggs": {
"hourly_distribution": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1h"
},
"aggs": {
"avg_response_time": {
"avg": {"field": "response_time"}
}
}
}
}
}
},
"size": 0
},
"took": 320,
"resource_usages": {
"cpu": 180,
"memory": 2048000
},
"user": "analytics_team",
"labels": {
"team": "observability",
"priority": "medium"
},
"latency_breakdown": {
"phases": {
"query": 25,
"fetch": 25,
"can_match": 8,
"dfs": 0,
"suggest": 0,
"rewrite": 3,
"expand": 0
},
"aggregations": {
"enabled": true,
"aggregation_execution": 180,
"aggregation_reduce": 35,
"type_breakdown": {
"terms": 120,
"date_histogram": 45,
"metrics": 15
}
},
"coordination": {
"request_parsing": 3,
"shard_routing": 4,
"response_aggregation": 8,
"reduce_phase": 35,
"result_processing": 5
},
"transport": {
"network_roundtrip": 28,
"request_serialization": 3,
"request_deserialization": 2,
"response_serialization": 4,
"response_deserialization": 3
},
"queuing": {
"search_queue_wait": 15,
"get_fetch_queue_wait": 5,
"thread_scheduling": 2
},
"cache": {
"query_cache_lookup": 2,
"query_cache_write": 0,
"request_cache_lookup": 0,
"request_cache_write": 0,
"shard_request_cache": 1,
"fielddata_operations": 8
},
"concurrency": {
"lock_acquisition": 1,
"shard_lock_contention": 0,
"circuit_breaker_checks": 1
},
"concurrent_segment_search": {
"enabled": true,
"slice_creation_and_distribution": 2,
"slice_execution_scheduling": 3,
"per_slice_execution": {
"slice_count": 2,
"min_execution_time": 22,
"max_execution_time": 28,
"avg_execution_time": 25,
"variance": 6,
"individual_slices": [
{
"slice_id": 0,
"execution_time": 22,
"segments_processed": 12
},
{
"slice_id": 1,
"execution_time": 28,
"segments_processed": 15
}
]
},
"slice_result_aggregation": 4,
"thread_pool_coordination": 2
},
"cluster_state_metadata": {
"cluster_state_lookup": 1,
"index_metadata_access": 2,
"mapping_lookup": 2
},
"plugins_extensions": {
"security_plugin_overhead": 3,
"custom_plugin_processing": 0,
"extension_processing": 0
},
"other": 2
}
}
]
}
Actionable Insights:
- Aggregations Dominate (215ms, 67% of total took time) → Primary optimization target
- Terms Aggregation (120ms, 38% of total) → Action: Reduce cardinality (evaluate composite aggregation, add filters, or reduce size from 100)
- Date Histogram (45ms, 14% of total) → Action: Consider larger fixed_interval (e.g., 4h instead of 1h)
- Aggregation Reduce (35ms, 11% of total) → Action: Reduce shard count or optimize aggregation structure
- Fielddata Operations (8ms) → Status: Acceptable for terms aggregation on keyword field
Before vs. After:
- Before: "Query is slow at 320ms, query phase shows 120ms but unclear why"
- After: "Aggregations account for 215ms (67%); terms aggregation is the bottleneck at 120ms"
- Time to Resolution: Minutes instead of hours with targeted aggregation optimization
Configuration to Enable Breakdown
Configuration
# Enable breakdown tracking
plugins.query_insights.latency_breakdown.enabled: true
# Detail level: basic (8 timers) | detailed (16 timers) | debug (24 timers)
plugins.query_insights.latency_breakdown.level: detailed
# Only track queries exceeding threshold
plugins.query_insights.latency_breakdown.threshold: 100 # ms
Dynamic Update:
PUT /_cluster/settings{
"persistent": {
"plugins.query_insights.latency_breakdown.enabled": true,
"plugins.query_insights.latency_breakdown.level": "detailed",
"plugins.query_insights.latency_breakdown.threshold": 100
}
}
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
Impact and Benefits
This framework transforms root cause analysis from a days-or-weeks long process of guesswork and complex diagnostics (flame graphs, thread dumps) into a data-driven operation completed in minutes-or-hours. By providing precise visibility, it enables teams to meet stringent (e.g. p99 SLA) performance requirements and optimize costs by identifying configuration-based fixes—like cache tuning and query optimization —over expensive cluster scaling. It directly implements the Visibility theme (#8879)
Integration and Future Work
Built on the existing Query Insights (#11522) and Top N (#11904) infrastructure, this backward-compatible update complements current resource tracking and exporters without API breaking changes. It serves as the foundation for automated recommendations (#12292)—such as thread pool adjustments—and upcoming Dashboard visualizations, ensuring that deep latency breakdowns are accessible to both exporters and end-users.
Proposed Next Steps
- Community Feedback: Validate approach and components
- Detailed Design: HLD with instrumentation strategy
- POC: Implement subset to validate overhead
- Phased Implementation:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status