Commit ade7f7c
authored
Backport bugfix esql per cluster took time (#115168)
* ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017)
The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed.
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)
In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.
I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.
I moved relative time tracking into `EsqlExecutionInfo`.
The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.
I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.
Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.
Fixes #115022
* Update execution info at end of planning before kicking off execution phase (#115127)
The revised took time model bug fix #115017
introduced a new bug that allows a race condition between updating the execution info with
"end of planning" timestamp and using that timestamp during execution.
This one line fix reverses the order to ensure the planning phase execution update occurs
before starting the ESQL query execution phase.1 parent 39eb0b7 commit ade7f7c
File tree
7 files changed
+331
-168
lines changed- x-pack/plugin/esql/src
- internalClusterTest/java/org/elasticsearch/xpack/esql/action
- main/java/org/elasticsearch/xpack/esql
- action
- plugin
- session
- test/java/org/elasticsearch/xpack/esql
- plugin
- session
7 files changed
+331
-168
lines changedLines changed: 48 additions & 26 deletions
Large diffs are not rendered by default.
Lines changed: 42 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
58 | | - | |
59 | | - | |
60 | | - | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | | - | |
64 | | - | |
65 | 64 | | |
66 | | - | |
67 | 65 | | |
68 | 66 | | |
69 | 67 | | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
70 | 73 | | |
71 | 74 | | |
72 | 75 | | |
73 | 76 | | |
74 | 77 | | |
75 | 78 | | |
| 79 | + | |
76 | 80 | | |
77 | 81 | | |
78 | 82 | | |
79 | 83 | | |
80 | 84 | | |
| 85 | + | |
81 | 86 | | |
82 | 87 | | |
83 | 88 | | |
| |||
88 | 93 | | |
89 | 94 | | |
90 | 95 | | |
| 96 | + | |
91 | 97 | | |
92 | 98 | | |
93 | 99 | | |
| |||
106 | 112 | | |
107 | 113 | | |
108 | 114 | | |
| 115 | + | |
109 | 116 | | |
110 | 117 | | |
111 | 118 | | |
| |||
125 | 132 | | |
126 | 133 | | |
127 | 134 | | |
128 | | - | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
129 | 164 | | |
130 | 165 | | |
131 | 166 | | |
| |||
Lines changed: 55 additions & 29 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
| |||
61 | 60 | | |
62 | 61 | | |
63 | 62 | | |
64 | | - | |
| 63 | + | |
65 | 64 | | |
66 | 65 | | |
67 | 66 | | |
| |||
75 | 74 | | |
76 | 75 | | |
77 | 76 | | |
78 | | - | |
79 | 77 | | |
80 | 78 | | |
81 | 79 | | |
82 | 80 | | |
83 | 81 | | |
84 | 82 | | |
85 | 83 | | |
86 | | - | |
87 | 84 | | |
88 | 85 | | |
89 | | - | |
| 86 | + | |
90 | 87 | | |
91 | 88 | | |
92 | 89 | | |
93 | 90 | | |
94 | 91 | | |
95 | 92 | | |
96 | 93 | | |
97 | | - | |
98 | 94 | | |
99 | 95 | | |
100 | 96 | | |
101 | 97 | | |
102 | 98 | | |
103 | 99 | | |
104 | 100 | | |
105 | | - | |
106 | 101 | | |
107 | 102 | | |
108 | 103 | | |
| |||
129 | 124 | | |
130 | 125 | | |
131 | 126 | | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
137 | 136 | | |
138 | 137 | | |
139 | 138 | | |
| |||
196 | 195 | | |
197 | 196 | | |
198 | 197 | | |
199 | | - | |
200 | | - | |
| 198 | + | |
| 199 | + | |
201 | 200 | | |
202 | 201 | | |
203 | 202 | | |
| |||
209 | 208 | | |
210 | 209 | | |
211 | 210 | | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
225 | 214 | | |
| 215 | + | |
226 | 216 | | |
227 | | - | |
| 217 | + | |
| 218 | + | |
228 | 219 | | |
229 | | - | |
| 220 | + | |
| 221 | + | |
230 | 222 | | |
231 | 223 | | |
232 | 224 | | |
| |||
237 | 229 | | |
238 | 230 | | |
239 | 231 | | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
240 | 266 | | |
241 | 267 | | |
242 | 268 | | |
| |||
0 commit comments