Skip to content

Commit ddba474

Browse files
authored
Collect and display execution metadata for ES|QL cross cluster searches (elastic#112595)
Enhance ES|QL responses to include information about `took` time (search latency), shards, and clusters against which the query was executed. The goal of this PR is to begin to provide parity between the metadata displayed for cross-cluster searches in _search and ES|QL. This PR adds the following features: - add overall `took` time to all ES|QL query responses. And to emphasize: "all" here means: async search, sync search, local-only and cross-cluster searches, so it goes beyond just CCS. - add `_clusters` metadata to the final response for cross-cluster searches, for both async and sync search (see example below) - tracking/reporting counts of skipped shards from the can_match (SearchShards API) phase of ES|QL processing - marking clusters as skipped if they cannot be connected to (during the field-caps phase of processing) Out of scope for this PR: - honoring the `skip_unavailable` cluster setting - showing `_clusters` metadata in the async response **while** the search is still running - showing any shard failure messages (since any shard search failures in ES|QL are automatically fatal and _cluster/details is not shown in 4xx/5xx error responses). Note that this also means that the `failed` shard count is always 0 in ES|QL `_clusters` section. Things changed with respect to behavior in `_search`: - the `timed_out` field in `_clusters/details/mycluster` was removed in the ESQL response, since ESQL does not support timeouts. It could be added back later if/when ESQL supports timeouts. - the `failures` array in `_clusters/details/mycluster/_shards` was removed in the ESQL response, since any shard failure causes the whole query to fail. Example output from ES|QL CCS: ```es POST /_query { "query": "from blogs,remote2:bl*,remote1:blogs|\nkeep authors.first_name,publish_date|\n limit 5" } ``` ```json { "took": 49, "columns": [ { "name": "authors.first_name", "type": "text" }, { "name": "publish_date", "type": "date" } ], "values": [ [ "Tammy", "2009-11-04T04:08:07.000Z" ], [ "Theresa", "2019-05-10T21:22:32.000Z" ], [ "Jason", "2021-11-23T00:57:30.000Z" ], [ "Craig", "2019-12-14T21:24:29.000Z" ], [ "Alexandra", "2013-02-15T18:13:24.000Z" ] ], "_clusters": { "total": 3, "successful": 2, "running": 0, "skipped": 1, "partial": 0, "failed": 0, "details": { "(local)": { "status": "successful", "indices": "blogs", "took": 43, "_shards": { "total": 13, "successful": 13, "skipped": 0, "failed": 0 } }, "remote2": { "status": "skipped", // remote2 was offline when this query was run "indices": "remote2:bl*", "took": 0, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 } }, "remote1": { "status": "successful", "indices": "remote1:blogs", "took": 47, "_shards": { "total": 13, "successful": 13, "skipped": 0, "failed": 0 } } } } } ``` Fixes elastic#112402 and elastic#110935
1 parent 22c770b commit ddba474

File tree

52 files changed

+3047
-316
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+3047
-316
lines changed

docs/changelog/112595.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
pr: 112595
2+
summary: Collect and display execution metadata for ES|QL cross cluster searches
3+
area: ES|QL
4+
type: enhancement
5+
issues:
6+
- 112402

docs/reference/esql/esql-across-clusters.asciidoc

Lines changed: 203 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ POST /_security/role/remote1
8585
"privileges": [ "read","read_cross_cluster" ], <4>
8686
"clusters" : ["my_remote_cluster"] <5>
8787
}
88-
],
88+
],
8989
"remote_cluster": [ <6>
9090
{
9191
"privileges": [
@@ -100,15 +100,23 @@ POST /_security/role/remote1
100100
----
101101

102102
<1> The `cross_cluster_search` cluster privilege is required for the _local_ cluster.
103-
<2> Typically, users will have permissions to read both local and remote indices. However, for cases where the role is intended to ONLY search the remote cluster, the `read` permission is still required for the local cluster. To provide read access to the local cluster, but disallow reading any indices in the local cluster, the `names` field may be an empty string.
104-
<3> The indices allowed read access to the remote cluster. The configured <<security-api-create-cross-cluster-api-key,cross-cluster API key>> must also allow this index to be read.
105-
<4> The `read_cross_cluster` privilege is always required when using {esql} across clusters with the API key based security model.
103+
<2> Typically, users will have permissions to read both local and remote indices. However, for cases where the role
104+
is intended to ONLY search the remote cluster, the `read` permission is still required for the local cluster.
105+
To provide read access to the local cluster, but disallow reading any indices in the local cluster, the `names`
106+
field may be an empty string.
107+
<3> The indices allowed read access to the remote cluster. The configured
108+
<<security-api-create-cross-cluster-api-key,cross-cluster API key>> must also allow this index to be read.
109+
<4> The `read_cross_cluster` privilege is always required when using {esql} across clusters with the API key based
110+
security model.
106111
<5> The remote clusters to which these privileges apply.
107-
This remote cluster must be configured with a <<security-api-create-cross-cluster-api-key,cross-cluster API key>> and connected to the remote cluster before the remote index can be queried.
112+
This remote cluster must be configured with a <<security-api-create-cross-cluster-api-key,cross-cluster API key>>
113+
and connected to the remote cluster before the remote index can be queried.
108114
Verify connection using the <<cluster-remote-info, Remote cluster info>> API.
109-
<6> Required to allow remote enrichment. Without this, the user cannot read from the `.enrich` indices on the remote cluster. The `remote_cluster` security privilege was introduced in version *8.15.0*.
115+
<6> Required to allow remote enrichment. Without this, the user cannot read from the `.enrich` indices on the
116+
remote cluster. The `remote_cluster` security privilege was introduced in version *8.15.0*.
110117

111-
You will then need a user or API key with the permissions you created above. The following example API call creates a user with the `remote1` role.
118+
You will then need a user or API key with the permissions you created above. The following example API call creates
119+
a user with the `remote1` role.
112120

113121
[source,console]
114122
----
@@ -119,11 +127,13 @@ POST /_security/user/remote_user
119127
}
120128
----
121129

122-
Remember that all cross-cluster requests from the local cluster are bound by the cross cluster API key’s privileges, which are controlled by the remote cluster's administrator.
130+
Remember that all cross-cluster requests from the local cluster are bound by the cross cluster API key’s privileges,
131+
which are controlled by the remote cluster's administrator.
123132

124133
[TIP]
125134
====
126-
Cross cluster API keys created in versions prior to 8.15.0 will need to replaced or updated to add the new permissions required for {esql} with ENRICH.
135+
Cross cluster API keys created in versions prior to 8.15.0 will need to replaced or updated to add the new permissions
136+
required for {esql} with ENRICH.
127137
====
128138

129139
[discrete]
@@ -174,6 +184,189 @@ FROM *:my-index-000001
174184
| LIMIT 10
175185
----
176186

187+
[discrete]
188+
[[ccq-cluster-details]]
189+
==== Cross-cluster metadata
190+
191+
ES|QL {ccs} responses include metadata about the search on each cluster when the response format is JSON.
192+
Here we show an example using the async search endpoint. {ccs-cap} metadata is also present in the synchronous
193+
search endpoint.
194+
195+
[source,console]
196+
----
197+
POST /_query/async?format=json
198+
{
199+
"query": """
200+
FROM my-index-000001,cluster_one:my-index-000001,cluster_two:my-index*
201+
| STATS COUNT(http.response.status_code) BY user.id
202+
| LIMIT 2
203+
"""
204+
}
205+
----
206+
// TEST[setup:my_index]
207+
// TEST[s/cluster_one:my-index-000001,cluster_two:my-index//]
208+
209+
Which returns:
210+
211+
[source,console-result]
212+
----
213+
{
214+
"is_running": false,
215+
"took": 42, <1>
216+
"columns" : [
217+
{
218+
"name" : "COUNT(http.response.status_code)",
219+
"type" : "long"
220+
},
221+
{
222+
"name" : "user.id",
223+
"type" : "keyword"
224+
}
225+
],
226+
"values" : [
227+
[4, "elkbee"],
228+
[1, "kimchy"]
229+
],
230+
"_clusters": { <2>
231+
"total": 3,
232+
"successful": 3,
233+
"running": 0,
234+
"skipped": 0,
235+
"partial": 0,
236+
"failed": 0,
237+
"details": { <3>
238+
"(local)": { <4>
239+
"status": "successful",
240+
"indices": "blogs",
241+
"took": 36, <5>
242+
"_shards": { <6>
243+
"total": 13,
244+
"successful": 13,
245+
"skipped": 0,
246+
"failed": 0
247+
}
248+
},
249+
"cluster_one": {
250+
"status": "successful",
251+
"indices": "cluster_one:my-index-000001",
252+
"took": 38,
253+
"_shards": {
254+
"total": 4,
255+
"successful": 4,
256+
"skipped": 0,
257+
"failed": 0
258+
}
259+
},
260+
"cluster_two": {
261+
"status": "successful",
262+
"indices": "cluster_two:my-index*",
263+
"took": 41,
264+
"_shards": {
265+
"total": 18,
266+
"successful": 18,
267+
"skipped": 1,
268+
"failed": 0
269+
}
270+
}
271+
}
272+
}
273+
}
274+
----
275+
// TEST[skip: cross-cluster testing env not set up]
276+
277+
<1> How long the entire search (across all clusters) took, in milliseconds.
278+
<2> This section of counters shows all possible cluster search states and how many cluster
279+
searches are currently in that state. The clusters can have one of the following statuses: *running*,
280+
*successful* (searches on all shards were successful), *skipped* (the search
281+
failed on a cluster marked with `skip_unavailable`=`true`) or *failed* (the search
282+
failed on a cluster marked with `skip_unavailable`=`false`).
283+
<3> The `_clusters/details` section shows metadata about the search on each cluster.
284+
<4> If you included indices from the local cluster you sent the request to in your {ccs},
285+
it is identified as "(local)".
286+
<5> How long (in milliseconds) the search took on each cluster. This can be useful to determine
287+
which clusters have slower response times than others.
288+
<6> The shard details for the search on that cluster, including a count of shards that were
289+
skipped due to the can-match phase. Shards are skipped when they cannot have any matching data
290+
and therefore are not included in the full ES|QL query.
291+
292+
293+
The cross-cluster metadata can be used to determine whether any data came back from a cluster.
294+
For instance, in the query below, the wildcard expression for `cluster-two` did not resolve
295+
to a concrete index (or indices). The cluster is, therefore, marked as 'skipped' and the total
296+
number of shards searched is set to zero.
297+
Since the other cluster did have a matching index, the search did not return an error, but
298+
instead returned all the matching data it could find.
299+
300+
301+
[source,console]
302+
----
303+
POST /_query/async?format=json
304+
{
305+
"query": """
306+
FROM cluster_one:my-index*,cluster_two:logs*
307+
| STATS COUNT(http.response.status_code) BY user.id
308+
| LIMIT 2
309+
"""
310+
}
311+
----
312+
// TEST[continued]
313+
// TEST[s/cluster_one:my-index\*,cluster_two:logs\*/my-index-000001/]
314+
315+
Which returns:
316+
317+
[source,console-result]
318+
----
319+
{
320+
"is_running": false,
321+
"took": 55,
322+
"columns": [
323+
... // not shown
324+
],
325+
"values": [
326+
... // not shown
327+
],
328+
"_clusters": {
329+
"total": 2,
330+
"successful": 2,
331+
"running": 0,
332+
"skipped": 0,
333+
"partial": 0,
334+
"failed": 0,
335+
"details": {
336+
"cluster_one": {
337+
"status": "successful",
338+
"indices": "cluster_one:my-index*",
339+
"took": 38,
340+
"_shards": {
341+
"total": 4,
342+
"successful": 4,
343+
"skipped": 0,
344+
"failed": 0
345+
}
346+
},
347+
"cluster_two": {
348+
"status": "skipped", <1>
349+
"indices": "cluster_two:logs*",
350+
"took": 0,
351+
"_shards": {
352+
"total": 0, <2>
353+
"successful": 0,
354+
"skipped": 0,
355+
"failed": 0
356+
}
357+
}
358+
}
359+
}
360+
}
361+
----
362+
// TEST[skip: cross-cluster testing env not set up]
363+
364+
<1> This cluster is marked as 'skipped', since there were no matching indices on that cluster.
365+
<2> Indicates that no shards were searched (due to not having any matching indices).
366+
367+
368+
369+
177370
[discrete]
178371
[[ccq-enrich]]
179372
==== Enrich across clusters
@@ -331,8 +524,7 @@ setting. As a result, if a remote cluster specified in the request is
331524
unavailable or failed, {ccs} for {esql} queries will fail regardless of the setting.
332525

333526
We are actively working to align the behavior of {ccs} for {esql} with other
334-
{ccs} APIs. This includes providing detailed execution information for each cluster
335-
in the response, such as execution time, selected target indices, and shards.
527+
{ccs} APIs.
336528

337529
[discrete]
338530
[[ccq-during-upgrade]]

docs/reference/esql/esql-rest.asciidoc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ Which returns:
192192
[source,console-result]
193193
----
194194
{
195+
"took": 28,
195196
"columns": [
196197
{"name": "author", "type": "text"},
197198
{"name": "name", "type": "text"},
@@ -206,6 +207,7 @@ Which returns:
206207
]
207208
}
208209
----
210+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
209211

210212
[discrete]
211213
[[esql-locale-param]]
@@ -384,12 +386,13 @@ GET /_query/async/FmNJRUZ1YWZCU3dHY1BIOUhaenVSRkEaaXFlZ3h4c1RTWFNocDdnY2FSaERnUT
384386
// TEST[skip: no access to query ID - may return response values]
385387

386388
If the response's `is_running` value is `false`, the query has finished
387-
and the results are returned.
389+
and the results are returned, along with the `took` time for the query.
388390

389391
[source,console-result]
390392
----
391393
{
392394
"is_running": false,
395+
"took": 48,
393396
"columns": ...
394397
}
395398
----

docs/reference/esql/multivalued-fields.asciidoc

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Multivalued fields come back as a JSON array:
2626
[source,console-result]
2727
----
2828
{
29+
"took": 28,
2930
"columns": [
3031
{ "name": "a", "type": "long"},
3132
{ "name": "b", "type": "long"}
@@ -36,6 +37,8 @@ Multivalued fields come back as a JSON array:
3637
]
3738
}
3839
----
40+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
41+
3942

4043
The relative order of values in a multivalued field is undefined. They'll frequently be in
4144
ascending order but don't rely on that.
@@ -74,6 +77,7 @@ And {esql} sees that removal:
7477
[source,console-result]
7578
----
7679
{
80+
"took": 28,
7781
"columns": [
7882
{ "name": "a", "type": "long"},
7983
{ "name": "b", "type": "keyword"}
@@ -84,6 +88,8 @@ And {esql} sees that removal:
8488
]
8589
}
8690
----
91+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
92+
8793

8894
But other types, like `long` don't remove duplicates.
8995

@@ -115,6 +121,7 @@ And {esql} also sees that:
115121
[source,console-result]
116122
----
117123
{
124+
"took": 28,
118125
"columns": [
119126
{ "name": "a", "type": "long"},
120127
{ "name": "b", "type": "long"}
@@ -125,6 +132,8 @@ And {esql} also sees that:
125132
]
126133
}
127134
----
135+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
136+
128137

129138
This is all at the storage layer. If you store duplicate `long`s and then
130139
convert them to strings the duplicates will stay:
@@ -155,6 +164,7 @@ POST /_query
155164
[source,console-result]
156165
----
157166
{
167+
"took": 28,
158168
"columns": [
159169
{ "name": "a", "type": "long"},
160170
{ "name": "b", "type": "keyword"}
@@ -165,6 +175,7 @@ POST /_query
165175
]
166176
}
167177
----
178+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
168179

169180
[discrete]
170181
[[esql-multivalued-fields-functions]]
@@ -198,6 +209,7 @@ POST /_query
198209
[source,console-result]
199210
----
200211
{
212+
"took": 28,
201213
"columns": [
202214
{ "name": "a", "type": "long"},
203215
{ "name": "b", "type": "long"},
@@ -210,6 +222,7 @@ POST /_query
210222
]
211223
}
212224
----
225+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
213226

214227
Work around this limitation by converting the field to single value with one of:
215228

@@ -233,6 +246,7 @@ POST /_query
233246
[source,console-result]
234247
----
235248
{
249+
"took": 28,
236250
"columns": [
237251
{ "name": "a", "type": "long"},
238252
{ "name": "b", "type": "long"},
@@ -245,4 +259,4 @@ POST /_query
245259
]
246260
}
247261
----
248-
262+
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]

0 commit comments

Comments
 (0)