Skip to content

Conversation

quux00
Copy link
Contributor

@quux00 quux00 commented Sep 6, 2024

Enhance ES|QL responses to include information about took time (search latency), shards, and clusters against which the query was executed.

The goal of this PR is to begin to provide parity between the metadata displayed for cross-cluster searches in _search and ES|QL.

This PR adds the following features:

  • add overall took time to all ES|QL query responses. And to emphasize: "all" here means: async search, sync search, local-only and cross-cluster searches, so it goes beyond just CCS.
  • add _clusters metadata to the final response for cross-cluster searches, for both async and sync search (see example below)
  • tracking/reporting counts of skipped shards from the can_match (SearchShards API) phase of ES|QL processing
  • marking clusters as skipped if they cannot be connected to (during the field-caps phase of processing)

Out of scope for this PR:

  • honoring the skip_unavailable cluster setting
  • showing _clusters metadata in the async response while the search is still running
  • showing any shard failure messages (since any shard search failures in ES|QL are automatically fatal and _cluster/details is not shown in 4xx/5xx error responses). Note that this also means that the failed shard count is always 0 in ES|QL _clusters section.

Things changed with respect to behavior in _search:

  • the timed_out field in _clusters/details/mycluster was removed in the ESQL response, since ESQL does not support timeouts. It could be added back later if/when ESQL supports timeouts.
  • the failures array in _clusters/details/mycluster/_shards was removed in the ESQL response, since any shard failure causes the whole query to fail.

Example output from ES|QL CCS:

POST /_query
{
  "query": "from blogs,remote2:blo*,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5"
}
{
  "took": 49,
  "columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "values": [
    [
      "Tammy",
      "2009-11-04T04:08:07.000Z"
    ],
    [
      "Theresa",
      "2019-05-10T21:22:32.000Z"
    ],
    [
      "Jason",
      "2021-11-23T00:57:30.000Z"
    ],
    [
      "Craig",
      "2019-12-14T21:24:29.000Z"
    ],
    [
      "Alexandra",
      "2013-02-15T18:13:24.000Z"
    ]
  ],
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 43,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",  // remote2 was offline when this query was run
        "indices": "remote2:blo*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs",
        "took": 47,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}

Fixes #112402 and #110935

@quux00 quux00 force-pushed the esql/ccs-execution-info2 branch 9 times, most recently from 7ca990c to c8f1841 Compare September 11, 2024 14:42
@quux00 quux00 force-pushed the esql/ccs-execution-info2 branch 5 times, most recently from 0195ac2 to 9ff909d Compare September 13, 2024 13:14
@quux00
Copy link
Contributor Author

quux00 commented Sep 13, 2024

Question for reviewers / Product Managers:

Do we want took time in millis or nanos?

The argument for using millis:

took time will be consistent with _search which uses millis, so I've gone with millis, since my goal was to keep the metadata like _search as much as possible unless we decide to change some parts.

This potentially makes it simpler for Kibana as well if they already have code that parses and displays took time.

The argument for using nanos:

ESQL Driver Profiles use took_nanos. So should we also use nanos for overall search latency? Will using millis feel inconsistent from profiles to end users?

And apparently ESQL already issues a took-nanos HTTP header, based on this issue: #110935. I tracked that down - the took-nanos for the header is added here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlResponseListener.java.

Based on #110935, I wonder if the EsqlResponseListener implementation of tracking total took time is flawed. It doesn't work for async-searches that are still running when the initial response is returned. Should we remove/change this implementation? Should it instead grab the overall took time out of the EsqlQueryResponse that I've now added? I asked that question in the code also in this commit.

@quux00
Copy link
Contributor Author

quux00 commented Sep 13, 2024

Question for reviewers / Product Managers:

Do we want to have the partial status in the ESQL metadata? Is that possible to have now? Will it ever be possible?

I've kept all the cluster statuses that we use on the _search side. These are the meanings of those statuses in _search:

successful: All shards searched returned successfully
running: Search is still running on that cluster
partial: At least one shard search failed or the search timed out on the server side (and thus may have searched all index segments), so we are returning partial data.
skipped: No shards were searched because the cluster was offline and marked as skip_unavailable=true
failed: The cluster was offline and marked as skip_unavailable=false. This causes the whole search to fail (return 4xx/5xx)

Do we want to keep these same status meanings in ES|QL?

UPDATE: Based on reviewer feedback and a discussion with Product, we will keep the same meanings, but not document the partial state as a possible status in the ES|QL API docs, since it currently can never be set. We will keep it in the code pending needing that state in the future, if ES|QL ever starts returning partial data (either due some some shard searches failing or ES|QL timeouts, as we have in _search).

@quux00
Copy link
Contributor Author

quux00 commented Sep 13, 2024

Edge case question for reviewers / Product Managers:

What should the status of the cluster be in the following scenario?

The user makes a cross-cluster query where the index expression on a cluster matches no indices:

POST /_query
{
  "query": "from blogs,remote2:no_such_index,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5"
}

Currently, in the code in this PR, I mark this as "SUCCESSFUL" (see example below), but with total shards = 0, indicating that nothing was searched. Is that the behavior we want? Or should we mark it as "SKIPPED"?

(toggle) **ES|QL response when cluster specified with index expression that matching no indices**
{
  "took": 25,
  "columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "values": [
    [
      "Theresa",
      "2020-09-14T23:13:55.000Z"
    ],
  ...
  ],
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 24,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "successful",
        "indices": "remote2:blogs",
        "took": 24,
        "_shards": {
          "total": 4,
          "successful": 4,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}

Comparison to behavior in _search

In _search, the handling of this differs depending on whether the user used a wildcard or not. With a wildcard that matches nothing you get status=successful, _shards.total=0:

        "remote1": {
          "status": "successful",
          "indices": "x*",
          "took": 0,
          "timed_out": false,
          "_shards": {
            "total": 0,
            "successful": 0,
            "skipped": 0,
            "failed": 0
          }
        }

With no wildcard, you get status=skipped and a no_such_index failure entry:

        "remote1": {
          "status": "skipped",
          "indices": "my_index",
          "timed_out": false,
          "failures": [
            {
              "shard": -1,
              "index": null,
              "reason": {
                "type": "index_not_found_exception",
                "reason": "no such index [my_index]",
                "index_uuid": "_na_",
                "resource.type": "index_or_alias",
                "resource.id": "my_index",
                "index": "my_index"
              }
            }
          ]
        }

@quux00 quux00 added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Sep 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @quux00, I've created a changelog YAML for you.

@quux00 quux00 marked this pull request as ready for review September 13, 2024 16:28
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@quux00 quux00 requested a review from a team as a code owner September 13, 2024 16:28
@quux00 quux00 requested a review from nik9000 September 13, 2024 16:28
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the production changes, and they look good to me. Thanks, Michael!

for (SearchShardsGroup group : resp.getGroups()) {
var shardId = group.shardId();
if (group.skipped()) {
totalShards++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you move totalShard++ outside and remove line 566?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Override
public void messageReceived(ClusterComputeRequest request, TransportChannel channel, Task task) {
ChannelActionListener<ComputeResponse> listener = new ChannelActionListener<>(channel);
ChannelActionListener<ComputeResponse> lnr = new ChannelActionListener<>(channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the renaming is a leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

/**
* @return true if the "local" querying/coordinator cluster is being searched in a cross-cluster search
*/
private boolean coordinatingClusterIsSearchedInCCS() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the methods coordinatingClusterIsSearchedInCCS, runningOnRemoteCluster, and isCCSListener, shouldRecordTookTime as we should be able to collect execution info similarly to how we handle profiles and warnings. These methods require careful consideration of when and how to pass the clusterAlias, which I think increases the risk of errors. That said, it's entirely up to you whether to keep them as they are or remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's true. I find the ComputeListener very hard to reason about, so I could be wrong. The core problems with the profiles/warnings approach is that they accumulate with an intent to be shown in the final response, but the EsqlExecutionInfo needs to be updated in real time as the search is running so that when we add _clusters to the async-search response of still running searches (in a later PR) it will have all the current data in the EsqlExecution info, not some possibly inaccessible rollup with the ComputeListener.

And I don't see how we can remove runningOnRemoteCluster() since we need to know in the refs Listener whether to populate the ComputeResponse with execution info metadata or not. The remote and local cluster refs listeners have to act differently.

I also don't see how we can remove isCCSListener since the acquireCompute handler needs to know whether it is listener for a ComputeResponse with remote metadata or not, as again it needs to behave differently than a data-node handler listener. That's why I split them into separate methods in my earlier iteration of the PR, making the context or use case clear.

Finally, another reason context is needed is because when the Status of a cluster gets set to SUCCESSFUL differs between remote and local clusters. For remote clusters, we mark it done when the remote ComputeResponse comes back. But for local, we have to wait until the coordinator is done (in case it has to run operators that aggregate/merge data from the remotes), so those get set in different places, and context/state is needed to know what code to execute in these methods that are shared for the 4 uses cases for which ComputeListener is used.

Again, I could be wrong and there's a cleaner way to do this, but I was not able to figure it out. I'd need some detailed guidance on how we solve the above problems and do local rollup accumulations like we do for profiles and warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this later.

@quux00 quux00 merged commit ddba474 into elastic:main Sep 30, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

quux00 added a commit to quux00/elasticsearch that referenced this pull request Sep 30, 2024
…es (elastic#112595)

Enhance ES|QL responses to include information about `took` time (search latency), shards, and
clusters against which the query was executed.

The goal of this PR is to begin to provide parity between the metadata displayed for 
cross-cluster searches in _search and ES|QL.

This PR adds the following features:
- add overall `took` time to all ES|QL query responses. And to emphasize: "all" here 
means: async search, sync search, local-only and cross-cluster searches, so it goes
beyond just CCS.
- add `_clusters` metadata to the final response for cross-cluster searches, for both
async and sync search (see example below)
- tracking/reporting counts of skipped shards from the can_match (SearchShards API)
phase of ES|QL processing
- marking clusters as skipped if they cannot be connected to (during the field-caps
phase of processing)

Out of scope for this PR:
- honoring the `skip_unavailable` cluster setting
- showing `_clusters` metadata in the async response **while** the search is still running
- showing any shard failure messages (since any shard search failures in ES|QL are
automatically fatal and _cluster/details is not shown in 4xx/5xx error responses). Note that 
this also means that the `failed` shard count is always 0 in ES|QL `_clusters` section.

Things changed with respect to behavior in `_search`:
- the `timed_out` field in `_clusters/details/mycluster` was removed in the ESQL
response, since ESQL does not support timeouts. It could be added back later
if/when ESQL supports timeouts.
- the `failures` array in `_clusters/details/mycluster/_shards` was removed in the ESQL
response, since any shard failure causes the whole query to fail.

Example output from ES|QL CCS:

```es
POST /_query
{
  "query": "from blogs,remote2:bl*,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5"
}
```

```json
{
  "took": 49,
  "columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "values": [
    [
      "Tammy",
      "2009-11-04T04:08:07.000Z"
    ],
    [
      "Theresa",
      "2019-05-10T21:22:32.000Z"
    ],
    [
      "Jason",
      "2021-11-23T00:57:30.000Z"
    ],
    [
      "Craig",
      "2019-12-14T21:24:29.000Z"
    ],
    [
      "Alexandra",
      "2013-02-15T18:13:24.000Z"
    ]
  ],
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 43,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",  // remote2 was offline when this query was run
        "indices": "remote2:bl*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs",
        "took": 47,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}
```

Fixes elastic#112402 and elastic#110935
elasticsearchmachine pushed a commit that referenced this pull request Sep 30, 2024
…es (#112595) (#113820)

Enhance ES|QL responses to include information about `took` time (search latency), shards, and
clusters against which the query was executed.

The goal of this PR is to begin to provide parity between the metadata displayed for 
cross-cluster searches in _search and ES|QL.

This PR adds the following features:
- add overall `took` time to all ES|QL query responses. And to emphasize: "all" here 
means: async search, sync search, local-only and cross-cluster searches, so it goes
beyond just CCS.
- add `_clusters` metadata to the final response for cross-cluster searches, for both
async and sync search (see example below)
- tracking/reporting counts of skipped shards from the can_match (SearchShards API)
phase of ES|QL processing
- marking clusters as skipped if they cannot be connected to (during the field-caps
phase of processing)

Out of scope for this PR:
- honoring the `skip_unavailable` cluster setting
- showing `_clusters` metadata in the async response **while** the search is still running
- showing any shard failure messages (since any shard search failures in ES|QL are
automatically fatal and _cluster/details is not shown in 4xx/5xx error responses). Note that 
this also means that the `failed` shard count is always 0 in ES|QL `_clusters` section.

Things changed with respect to behavior in `_search`:
- the `timed_out` field in `_clusters/details/mycluster` was removed in the ESQL
response, since ESQL does not support timeouts. It could be added back later
if/when ESQL supports timeouts.
- the `failures` array in `_clusters/details/mycluster/_shards` was removed in the ESQL
response, since any shard failure causes the whole query to fail.

Example output from ES|QL CCS:

```es
POST /_query
{
  "query": "from blogs,remote2:bl*,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5"
}
```

```json
{
  "took": 49,
  "columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "values": [
    [
      "Tammy",
      "2009-11-04T04:08:07.000Z"
    ],
    [
      "Theresa",
      "2019-05-10T21:22:32.000Z"
    ],
    [
      "Jason",
      "2021-11-23T00:57:30.000Z"
    ],
    [
      "Craig",
      "2019-12-14T21:24:29.000Z"
    ],
    [
      "Alexandra",
      "2013-02-15T18:13:24.000Z"
    ]
  ],
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 43,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",  // remote2 was offline when this query was run
        "indices": "remote2:bl*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs",
        "took": 47,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}
```

Fixes #112402 and #110935
stratoula added a commit to elastic/kibana that referenced this pull request Oct 4, 2024
## Summary

Now that this PR elastic/elasticsearch#112595
landed in our snapshot, we can have the query time in the inspector.

<img width="798" alt="image"
src="https://github.com/user-attachments/assets/1ba1c59e-f094-4a56-964d-d76bdc1db8b3">


<img width="1017" alt="image"
src="https://github.com/user-attachments/assets/48464d7c-60c0-4924-bfcb-85f82b7caa40">
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 4, 2024
## Summary

Now that this PR elastic/elasticsearch#112595
landed in our snapshot, we can have the query time in the inspector.

<img width="798" alt="image"
src="https://github.com/user-attachments/assets/1ba1c59e-f094-4a56-964d-d76bdc1db8b3">

<img width="1017" alt="image"
src="https://github.com/user-attachments/assets/48464d7c-60c0-4924-bfcb-85f82b7caa40">

(cherry picked from commit 2621bb7)
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 4, 2024
…es (elastic#112595)

Enhance ES|QL responses to include information about `took` time (search latency), shards, and
clusters against which the query was executed.

The goal of this PR is to begin to provide parity between the metadata displayed for 
cross-cluster searches in _search and ES|QL.

This PR adds the following features:
- add overall `took` time to all ES|QL query responses. And to emphasize: "all" here 
means: async search, sync search, local-only and cross-cluster searches, so it goes
beyond just CCS.
- add `_clusters` metadata to the final response for cross-cluster searches, for both
async and sync search (see example below)
- tracking/reporting counts of skipped shards from the can_match (SearchShards API)
phase of ES|QL processing
- marking clusters as skipped if they cannot be connected to (during the field-caps
phase of processing)

Out of scope for this PR:
- honoring the `skip_unavailable` cluster setting
- showing `_clusters` metadata in the async response **while** the search is still running
- showing any shard failure messages (since any shard search failures in ES|QL are
automatically fatal and _cluster/details is not shown in 4xx/5xx error responses). Note that 
this also means that the `failed` shard count is always 0 in ES|QL `_clusters` section.

Things changed with respect to behavior in `_search`:
- the `timed_out` field in `_clusters/details/mycluster` was removed in the ESQL
response, since ESQL does not support timeouts. It could be added back later
if/when ESQL supports timeouts.
- the `failures` array in `_clusters/details/mycluster/_shards` was removed in the ESQL
response, since any shard failure causes the whole query to fail.

Example output from ES|QL CCS:

```es
POST /_query
{
  "query": "from blogs,remote2:bl*,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5"
}
```

```json
{
  "took": 49,
  "columns": [
    {
      "name": "authors.first_name",
      "type": "text"
    },
    {
      "name": "publish_date",
      "type": "date"
    }
  ],
  "values": [
    [
      "Tammy",
      "2009-11-04T04:08:07.000Z"
    ],
    [
      "Theresa",
      "2019-05-10T21:22:32.000Z"
    ],
    [
      "Jason",
      "2021-11-23T00:57:30.000Z"
    ],
    [
      "Craig",
      "2019-12-14T21:24:29.000Z"
    ],
    [
      "Alexandra",
      "2013-02-15T18:13:24.000Z"
    ]
  ],
  "_clusters": {
    "total": 3,
    "successful": 2,
    "running": 0,
    "skipped": 1,
    "partial": 0,
    "failed": 0,
    "details": {
      "(local)": {
        "status": "successful",
        "indices": "blogs",
        "took": 43,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote2": {
        "status": "skipped",  // remote2 was offline when this query was run
        "indices": "remote2:bl*",
        "took": 0,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote1": {
        "status": "successful",
        "indices": "remote1:blogs",
        "took": 47,
        "_shards": {
          "total": 13,
          "successful": 13,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  }
}
```

Fixes elastic#112402 and elastic#110935
tiansivive pushed a commit to tiansivive/kibana that referenced this pull request Oct 7, 2024
## Summary

Now that this PR elastic/elasticsearch#112595
landed in our snapshot, we can have the query time in the inspector.

<img width="798" alt="image"
src="https://github.com/user-attachments/assets/1ba1c59e-f094-4a56-964d-d76bdc1db8b3">


<img width="1017" alt="image"
src="https://github.com/user-attachments/assets/48464d7c-60c0-4924-bfcb-85f82b7caa40">
quux00 added a commit that referenced this pull request Oct 18, 2024
…l exceptions (#115017)

The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes #115022
quux00 added a commit to quux00/elasticsearch that referenced this pull request Oct 18, 2024
…l exceptions (elastic#115017)

The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes elastic#115022
elasticsearchmachine pushed a commit that referenced this pull request Oct 19, 2024
…es fatal exceptions (#115017) (#115124)

* ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017)

The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes #115022

* Added fix for #115127 into this since can't get the build to pass
quux00 added a commit to quux00/elasticsearch that referenced this pull request Oct 19, 2024
…l exceptions (elastic#115017)

The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes elastic#115022
quux00 added a commit that referenced this pull request Oct 19, 2024
* ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017)

The model for calculating per-cluster `took` times from remote clusters in #112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes #115022

* Update execution info at end of planning before kicking off execution phase (#115127)

The revised took time model bug fix #115017
introduced a new bug that allows a race condition between updating the execution info with
"end of planning" timestamp and using that timestamp during execution.

This one line fix reverses the order to ensure the planning phase execution update occurs
before starting the ESQL query execution phase.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…l exceptions (elastic#115017)

The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes elastic#115022
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…l exceptions (elastic#115017)

The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. 
It attempted to use Java's System.nanoTime between the local and remote clusters,
which is not safe. This results in per-cluster took times that have arbitrary (invalid) values
including negative values which cause exceptions to be thrown by the `TimeValue` constructor.
(Note: the overall took time calculation was done correctly, so it was the remote per-cluster
took times that were flawed.)

In this PR, I've done a redesign to address this. A key decision of this re-design was whether
to always calculate took times only on the querying cluster (bypassing this whole problem) or
to continue to allow the remote clusters to calculate their own took times for the remote processing
and report that back to the querying cluster via the `ComputeResponse`.

I decided in favor of having remote clusters compute their own took times for the remote processing
and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so
that total per-cluster took time is a combination of the two. In _search, remote cluster took times are
calculated entirely on the remote cluster, so network time is not included in the per-cluster took times.
This has been helpful in diagnosing issues on user environments because if you see an overall took time
that is significantly larger than the per cluster took times, that may indicate a network issue, which has
happened in diagnosing cross-cluster issues in _search.

I moved relative time tracking into `EsqlExecutionInfo`. 

The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with
the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design
to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the 
current overall took time calculation model with local-only INLINESTATS queries and they
work correctly.

I also fixed another secondary bug in this PR. If the remote cluster is an older version that does
not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then
calculated on the querying cluster as a fallback.

Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response.
The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual
counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report
shard stats, the `_shards` field is left out of the XContent response.

Fixes elastic#115022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collect and display execution metadata for ES|QL cross cluster searches

6 participants