ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions #115017

quux00 · 2024-10-17T13:51:23Z

The model for calculating per-cluster took times from remote clusters in #112595 was flawed.
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the TimeValue constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the ComputeResponse.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into EsqlExecutionInfo.

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the _shards info is shown in the response.
The rule now is that _shards is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the _shards field is left out of the XContent response.

Fixes #115022

…ed on coordinator but not yet on remote clusters

…onse from remote cluster (older version). In that case, we set took time on the coordinator. Other changes: - changed where markEndPlanning is called since it was not compatiable with 2 phase execution model for INLINESTATS - removed assert to oncally call markEndQuery once since it was not compatiable with 2 phase execution model for INLINESTATS

Removed shard counts in EsqlExecutionInfo for remotes where cluster is older.

elasticsearchmachine · 2024-10-17T20:25:02Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

smalyshev · 2024-10-17T21:31:58Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlExecutionInfo.java

+     */
+    public void markEndQuery() {
+        assert relativeStartNanos != null : "Relative start time must be set when markEndQuery is called";
+        overallTook = new TimeValue(System.nanoTime() - relativeStartNanos, TimeUnit.NANOSECONDS);


Maybe we can use timeValueNanos for brevity?

Do you mean for relativeStartNanos? I prefer the term I used because it emphasizes that this is for relative time (duration) calculations only not related to wall clock time, which is the mental model block that resulted in this bug.

And I took the term from TransportSearchAction.SearchTimeProvider to indicate it has the same function as that field. I considered using SearchTimeProvider here as well, but I haven't found a need for the absoluteStartMillis field in ESQL yet, so I just went with the single field option here.

No I mean using the method TimeValue.timeValueNanos

smalyshev · 2024-10-17T21:38:23Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

    }
+
+    // visible for testing
+    static void updateExecutionInfoWithUnavailableClusters(EsqlExecutionInfo executionInfo, Set<String> unavailableClusters) {


I am wondering: is there a reason why these update methods aren't methods of EsqlExecutionInfo instead of statics?

The basic model of EsqlExecutionInfo (like SearchResponse.Clusters, which it is patterned after) is that is basically a data holder class. It has little to no behavior. The behavior (change of state) is done by outside entities that update the execution info as it flows through the various execution stages. This follows law-of-Demeter type principles. For example, one one these update methods in EsqlSession gets passed an IndexResolution object, which is probably not something EsqlExecutionInfo should have knowledge of and know how to pull data from.

smalyshev · 2024-10-17T21:53:04Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+        for (String clusterAlias : unavailableClusters) {
+            executionInfo.swapCluster(
+                clusterAlias,
+                (k, v) -> new EsqlExecutionInfo.Cluster.Builder(v).setStatus(EsqlExecutionInfo.Cluster.Status.SKIPPED).build()


We zero-out other values in updateExecutionInfoWithClustersWithNoMatchingIndices but not here, why is that?

There is another place that the shard info gets filled for this pathway. I plan to take another look at this in the next ticket where this code gets modified based on skip_unavailable setting. Thanks for flagging.

smalyshev · 2024-10-17T22:14:14Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/TransportEsqlQueryAction.java

            request.profile(),
            request.tables(),
-            System.nanoTime()
+            System.nanoTime()  // MP TODO: remove this


Should this be checked in?

Thanks. I had another commit after this one to remove queryStartTimeNanos from Configuration but that causes bwc tests to fail so I had to revert it and this was left. I'll remove it now.

smalyshev · 2024-10-17T22:15:18Z

LGTM in general, some nits & questions.

…k-time

elasticsearchmachine · 2024-10-18T16:26:36Z

💚 Backport successful

Status	Branch	Result
✅	8.16

…l exceptions (elastic#115017) The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes elastic#115022

astefan

LGTM

…es fatal exceptions (#115017) (#115124) * ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes #115022 * Added fix for #115127 into this since can't get the build to pass

… phase (#115127) The revised took time model bug fix #115017 introduced a new bug that allows a race condition between updating the execution info with "end of planning" timestamp and using that timestamp during execution. This one line fix reverses the order to ensure the planning phase execution update occurs before starting the ESQL query execution phase.

…l exceptions (elastic#115017) The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes elastic#115022

… phase (elastic#115127) The revised took time model bug fix elastic#115017 introduced a new bug that allows a race condition between updating the execution info with "end of planning" timestamp and using that timestamp during execution. This one line fix reverses the order to ensure the planning phase execution update occurs before starting the ESQL query execution phase.

* ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes #115022 * Update execution info at end of planning before kicking off execution phase (#115127) The revised took time model bug fix #115017 introduced a new bug that allows a race condition between updating the execution info with "end of planning" timestamp and using that timestamp during execution. This one line fix reverses the order to ensure the planning phase execution update occurs before starting the ESQL query execution phase.

…l exceptions (elastic#115017) The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes elastic#115022

… phase (elastic#115127) The revised took time model bug fix elastic#115017 introduced a new bug that allows a race condition between updating the execution info with "end of planning" timestamp and using that timestamp during execution. This one line fix reverses the order to ensure the planning phase execution update occurs before starting the ESQL query execution phase.

…l exceptions (elastic#115017) The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes elastic#115022

… phase (elastic#115127) The revised took time model bug fix elastic#115017 introduced a new bug that allows a race condition between updating the execution info with "end of planning" timestamp and using that timestamp during execution. This one line fix reverses the order to ensure the planning phase execution update occurs before starting the ESQL query execution phase.

quux00 added v8.16.0 v9.0.0 labels Oct 17, 2024

quux00 force-pushed the bug/esql-cluster-took-time branch 2 times, most recently from ab705d5 to 4eff900 Compare October 17, 2024 17:47

quux00 added 6 commits October 17, 2024 15:37

Intmd commit - Added 'mark' methods to EsqlExecutionInfo and being us…

079c419

…ed on coordinator but not yet on remote clusters

Removed debugging lines

041b002

Added test for updateExecutionInfoAtEndOfPlanning

e2b2283

Added testAcquireComputeCCSListenerWithComputeResponseFromOlderCluster

d192d8f

Removed shard counts in EsqlExecutionInfo for remotes where cluster is older.

Minor cleanup

c43060f

quux00 force-pushed the bug/esql-cluster-took-time branch from 4eff900 to e6c23c0 Compare October 17, 2024 19:38

quux00 changed the title ~~DRAFT - bug fix for per-cluster took time~~ ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions Oct 17, 2024

quux00 added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged labels Oct 17, 2024

quux00 marked this pull request as ready for review October 17, 2024 20:24

quux00 requested review from astefan, dnhatn and nik9000 October 17, 2024 20:24

quux00 requested a review from smalyshev October 17, 2024 20:25

quux00 added the >non-issue label Oct 17, 2024

quux00 force-pushed the bug/esql-cluster-took-time branch from e6c23c0 to c43060f Compare October 17, 2024 21:07

smalyshev reviewed Oct 17, 2024

View reviewed changes

quux00 added 2 commits October 18, 2024 09:32

PR feedback

9926c2d

Merge remote-tracking branch 'elastic/main' into bug/esql-cluster-too…

446ff59

…k-time

quux00 requested a review from smalyshev October 18, 2024 16:16

smalyshev approved these changes Oct 18, 2024

View reviewed changes

quux00 merged commit 156ba2c into elastic:main Oct 18, 2024
16 checks passed

quux00 mentioned this pull request Oct 18, 2024

[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

Merged

astefan approved these changes Oct 18, 2024

View reviewed changes

quux00 mentioned this pull request Oct 18, 2024

Update execution info at end of planning before kicking off execution phase #115127

Merged

ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions #115017

ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions #115017

Uh oh!

Conversation

quux00 commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smalyshev commented Oct 17, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 18, 2024

💚 Backport successful

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

quux00 commented Oct 17, 2024 •

edited

Loading