-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Closed
Closed
Copy link
Labels
:Analytics/ES|QLAKA ESQLAKA ESQL>bugTeam:AnalyticsMeta label for analytical engine team (ESQL/Aggs/Geo)Meta label for analytical engine team (ESQL/Aggs/Geo)blockerv8.16.0
Description
Elasticsearch Version
8.16.0, 9.0.0
Installed Plugins
No response
Java Version
bundled
OS Version
Any
Problem Description
The per-cluster took time that occurs now in ES|QL from this commit can result in incorrect per-cluster took times, including negative took times that when added to a TimeValue, cause a fatal exception. ES|QL CCS is unstable until this bug is fixed.
Example incorrect took time:
"details": {
"remote_cluster": {
"status": "successful",
"indices": "logs-apm.error-default",
"took": 27205351062,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
}
}
}
(toggle) Stack trace of fatal when the took time is negative:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "duration cannot be negative, was given [-32521808046272586]"
}
],
"type": "illegal_argument_exception",
"reason": "duration cannot be negative, was given [-32521808046272586]",
"suppressed": [
{
"type": "task_cancelled_exception",
"reason": "cancelled on failure",
"suppressed": [
{
"type": "exception",
"reason": "4 further exceptions were dropped"
}
]
},
{
"type": "task_cancelled_exception",
"reason": "cancelled on failure"
},
{
"type": "task_cancelled_exception",
"reason": "cancelled on failure",
"suppressed": [
{
"type": "exception",
"reason": "1 further exceptions were dropped"
},
{
"type": "task_cancelled_exception",
"reason": "cancelled on failure"
}
]
},
{
"type": "task_cancelled_exception",
"reason": "cancelled on failure"
}
]
},
The took time model in ESQL is currently flawed as it relies on consistency of System.nanoTime on separate servers which is an invalid assumption. So the took time calculation model needs to be redesigned.
Steps to Reproduce
None of the ESQL tests, including BWC and mixed cluster tests caught this. It happens only when you are running clusters on different servers (due to details of how System.nanoTime works in Java).
Logs (if relevant)
No response
Metadata
Metadata
Assignees
Labels
:Analytics/ES|QLAKA ESQLAKA ESQL>bugTeam:AnalyticsMeta label for analytical engine team (ESQL/Aggs/Geo)Meta label for analytical engine team (ESQL/Aggs/Geo)blockerv8.16.0