Skip to content

Commit 9081a95

Browse files
authored
Implement CCS telemetry export as part of _cluster/stats (#112310)
* Implement CCS telemetry export as part of _cluster/stats
1 parent 7784d4f commit 9081a95

File tree

10 files changed

+248
-10
lines changed

10 files changed

+248
-10
lines changed

docs/reference/cluster/stats.asciidoc

Lines changed: 171 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1307,6 +1307,142 @@ Each repository type may also include other statistics about the repositories of
13071307
13081308
====
13091309

1310+
`ccs`::
1311+
(object) Contains information relating to <<modules-cross-cluster-search, {ccs}>> settings and activity in the cluster.
1312+
+
1313+
.Properties of `ccs`
1314+
[%collapsible%open]
1315+
=====
1316+
1317+
1318+
`_search`:::
1319+
(object) Contains the telemetry information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
1320+
+
1321+
.Properties of `_search`
1322+
[%collapsible%open]
1323+
======
1324+
`total`:::
1325+
(integer) The total number of {ccs} requests that have been executed by the cluster.
1326+
1327+
`success`:::
1328+
(integer) The total number of {ccs} requests that have been successfully executed by the cluster.
1329+
1330+
`skipped`:::
1331+
(integer) The total number of {ccs} requests (successful or failed) that had at least one remote cluster skipped.
1332+
1333+
`took`:::
1334+
(object) Contains statistics about the time taken to execute {ccs} requests.
1335+
+
1336+
.Properties of `took`
1337+
[%collapsible%open]
1338+
=======
1339+
`max`:::
1340+
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.
1341+
1342+
`avg`:::
1343+
(integer) The median time taken to execute a {ccs} request, in milliseconds.
1344+
1345+
`p90`:::
1346+
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
1347+
=======
1348+
1349+
`took_mrt_true`::
1350+
(object) Contains statistics about the time taken to execute {ccs} requests for which the
1351+
<<ccs-minimize-roundtrips,`ccs_minimize_roundtrips`>> setting was set to `true`.
1352+
+
1353+
.Properties of `took_mrt_true`
1354+
[%collapsible%open]
1355+
=======
1356+
`max`:::
1357+
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.
1358+
1359+
`avg`:::
1360+
(integer) The median time taken to execute a {ccs} request, in milliseconds.
1361+
1362+
`p90`:::
1363+
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
1364+
=======
1365+
1366+
`took_mrt_false`::
1367+
(object) Contains statistics about the time taken to execute {ccs} requests for which the
1368+
<<ccs-minimize-roundtrips,`ccs_minimize_roundtrips`>> setting was set to `false`.
1369+
+
1370+
.Properties of `took_mrt_false`
1371+
[%collapsible%open]
1372+
=======
1373+
`max`:::
1374+
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.
1375+
1376+
`avg`:::
1377+
(integer) The median time taken to execute a {ccs} request, in milliseconds.
1378+
1379+
`p90`:::
1380+
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
1381+
=======
1382+
1383+
`remotes_per_search_max`::
1384+
(integer) The maximum number of remote clusters that were queried in a single {ccs} request.
1385+
1386+
`remotes_per_search_avg`::
1387+
(float) The average number of remote clusters that were queried in a single {ccs} request.
1388+
1389+
`failure_reasons`::
1390+
(object) Contains statistics about the reasons for {ccs} request failures.
1391+
The keys are the failure reason names and the values are the number of requests that failed for that reason.
1392+
1393+
`features`::
1394+
(object) Contains statistics about the features used in {ccs} requests. The keys are the names of the search feature,
1395+
and the values are the number of requests that used that feature. Single request can use more than one feature
1396+
(e.g. both `async` and `wildcard`). Known features are:
1397+
1398+
* `async` - <<async-search, Async search>>
1399+
1400+
* `mrt` - <<ccs-minimize-roundtrips,`ccs_minimize_roundtrips`>> setting was set to `true`.
1401+
1402+
* `wildcard` - <<api-multi-index,Multi-target syntax>> for indices with wildcards was used in the search request.
1403+
1404+
`clients`::
1405+
(object) Contains statistics about the clients that executed {ccs} requests.
1406+
The keys are the names of the clients, and the values are the number of requests that were executed by that client.
1407+
Only known clients (such as `kibana` or `elasticsearch`) are counted.
1408+
1409+
`clusters`::
1410+
(object) Contains statistics about the clusters that were queried in {ccs} requests.
1411+
The keys are cluster names, and the values are per-cluster telemetry data.
1412+
This also includes the local cluster itself, which uses the name `(local)`.
1413+
+
1414+
.Properties of per-cluster data:
1415+
[%collapsible%open]
1416+
=======
1417+
`total`:::
1418+
(integer) The total number of successful (not skipped) {ccs} requests that were executed against this cluster.
1419+
This may include requests where partial results were returned, but not requests in which the cluster has been skipped entirely.
1420+
1421+
`skipped`:::
1422+
(integer) The total number of {ccs} requests for which this cluster was skipped.
1423+
1424+
`took`:::
1425+
(object) Contains statistics about the time taken to execute requests against this cluster.
1426+
+
1427+
.Properties of `took`
1428+
[%collapsible%open]
1429+
========
1430+
`max`:::
1431+
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.
1432+
1433+
`avg`:::
1434+
(integer) The median time taken to execute a {ccs} request, in milliseconds.
1435+
1436+
`p90`:::
1437+
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
1438+
========
1439+
1440+
=======
1441+
1442+
======
1443+
1444+
=====
1445+
13101446
[[cluster-stats-api-example]]
13111447
==== {api-examples-title}
13121448

@@ -1607,7 +1743,35 @@ The API returns the following response:
16071743
},
16081744
"repositories": {
16091745
...
1610-
}
1746+
},
1747+
"ccs": {
1748+
"_search": {
1749+
"total": 7,
1750+
"success": 7,
1751+
"skipped": 0,
1752+
"took": {
1753+
"max": 36,
1754+
"avg": 20,
1755+
"p90": 33
1756+
},
1757+
"took_mrt_true": {
1758+
"max": 33,
1759+
"avg": 15,
1760+
"p90": 33
1761+
},
1762+
"took_mrt_false": {
1763+
"max": 36,
1764+
"avg": 26,
1765+
"p90": 36
1766+
},
1767+
"remotes_per_search_max": 3,
1768+
"remotes_per_search_avg": 2.0,
1769+
"failure_reasons": { ... },
1770+
"features": { ... },
1771+
"clients": { ... },
1772+
"clusters": { ... }
1773+
}
1774+
}
16111775
}
16121776
--------------------------------------------------
16131777
// TESTRESPONSE[s/"plugins": \[[^\]]*\]/"plugins": $body.$_path/]
@@ -1618,10 +1782,15 @@ The API returns the following response:
16181782
// TESTRESPONSE[s/"packaging_types": \[[^\]]*\]/"packaging_types": $body.$_path/]
16191783
// TESTRESPONSE[s/"snapshots": \{[^\}]*\}/"snapshots": $body.$_path/]
16201784
// TESTRESPONSE[s/"repositories": \{[^\}]*\}/"repositories": $body.$_path/]
1785+
// TESTRESPONSE[s/"clusters": \{[^\}]*\}/"clusters": $body.$_path/]
1786+
// TESTRESPONSE[s/"features": \{[^\}]*\}/"features": $body.$_path/]
1787+
// TESTRESPONSE[s/"clients": \{[^\}]*\}/"clients": $body.$_path/]
1788+
// TESTRESPONSE[s/"failure_reasons": \{[^\}]*\}/"failure_reasons": $body.$_path/]
16211789
// TESTRESPONSE[s/"field_types": \[[^\]]*\]/"field_types": $body.$_path/]
16221790
// TESTRESPONSE[s/"runtime_field_types": \[[^\]]*\]/"runtime_field_types": $body.$_path/]
16231791
// TESTRESPONSE[s/"search": \{[^\}]*\}/"search": $body.$_path/]
1624-
// TESTRESPONSE[s/: true|false/: $body.$_path/]
1792+
// TESTRESPONSE[s/"remotes_per_search_avg": [.0-9]+/"remotes_per_search_avg": $body.$_path/]
1793+
// TESTRESPONSE[s/: (true|false)/: $body.$_path/]
16251794
// TESTRESPONSE[s/: (\-)?[0-9]+/: $body.$_path/]
16261795
// TESTRESPONSE[s/: "[^"]*"/: $body.$_path/]
16271796
// These replacements do a few things:

server/src/main/java/org/elasticsearch/TransportVersions.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,7 @@ static TransportVersion def(int id) {
207207
public static final TransportVersion UNASSIGNED_PRIMARY_COUNT_ON_CLUSTER_HEALTH = def(8_737_00_0);
208208
public static final TransportVersion ESQL_AGGREGATE_EXEC_TRACKS_INTERMEDIATE_ATTRS = def(8_738_00_0);
209209

210+
public static final TransportVersion CCS_TELEMETRY_STATS = def(8_739_00_0);
210211
/*
211212
* STOP! READ THIS FIRST! No, really,
212213
* ____ _____ ___ ____ _ ____ _____ _ ____ _____ _ _ ___ ____ _____ ___ ____ ____ _____ _

server/src/main/java/org/elasticsearch/action/admin/cluster/stats/CCSTelemetrySnapshot.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -277,7 +277,7 @@ public int hashCode() {
277277
*/
278278
public void add(CCSTelemetrySnapshot stats) {
279279
// This should be called in ClusterStatsResponse ctor only, so we don't need to worry about concurrency
280-
if (stats.totalCount == 0) {
280+
if (stats == null || stats.totalCount == 0) {
281281
// Just ignore the empty stats.
282282
// This could happen if the node is brand new or if the stats are not available, e.g. because it runs an old version.
283283
return;
@@ -315,7 +315,7 @@ public void add(CCSTelemetrySnapshot stats) {
315315
* "p90": 2570
316316
* }
317317
*/
318-
public static void publishLatency(XContentBuilder builder, String name, LongMetricValue took) throws IOException {
318+
private static void publishLatency(XContentBuilder builder, String name, LongMetricValue took) throws IOException {
319319
builder.startObject(name);
320320
{
321321
builder.field("max", took.max());

server/src/main/java/org/elasticsearch/action/admin/cluster/stats/CCSUsageTelemetry.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ public static class PerClusterCCSTelemetry {
175175
// The number of successful (not skipped) requests to this cluster.
176176
private final LongAdder count;
177177
private final LongAdder skippedCount;
178-
// This is only over the successful requetss, skipped ones do not count here.
178+
// This is only over the successful requests, skipped ones do not count here.
179179
private final LongMetric took;
180180

181181
PerClusterCCSTelemetry(String clusterAlias) {

server/src/main/java/org/elasticsearch/action/admin/cluster/stats/ClusterStatsNodeResponse.java

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ public class ClusterStatsNodeResponse extends BaseNodeResponse {
3030
private final ClusterHealthStatus clusterStatus;
3131
private final SearchUsageStats searchUsageStats;
3232
private final RepositoryUsageStats repositoryUsageStats;
33+
private final CCSTelemetrySnapshot ccsMetrics;
3334

3435
public ClusterStatsNodeResponse(StreamInput in) throws IOException {
3536
super(in);
@@ -47,6 +48,11 @@ public ClusterStatsNodeResponse(StreamInput in) throws IOException {
4748
} else {
4849
repositoryUsageStats = RepositoryUsageStats.EMPTY;
4950
}
51+
if (in.getTransportVersion().onOrAfter(TransportVersions.CCS_TELEMETRY_STATS)) {
52+
ccsMetrics = new CCSTelemetrySnapshot(in);
53+
} else {
54+
ccsMetrics = new CCSTelemetrySnapshot();
55+
}
5056
}
5157

5258
public ClusterStatsNodeResponse(
@@ -56,7 +62,8 @@ public ClusterStatsNodeResponse(
5662
NodeStats nodeStats,
5763
ShardStats[] shardsStats,
5864
SearchUsageStats searchUsageStats,
59-
RepositoryUsageStats repositoryUsageStats
65+
RepositoryUsageStats repositoryUsageStats,
66+
CCSTelemetrySnapshot ccsTelemetrySnapshot
6067
) {
6168
super(node);
6269
this.nodeInfo = nodeInfo;
@@ -65,6 +72,7 @@ public ClusterStatsNodeResponse(
6572
this.clusterStatus = clusterStatus;
6673
this.searchUsageStats = Objects.requireNonNull(searchUsageStats);
6774
this.repositoryUsageStats = Objects.requireNonNull(repositoryUsageStats);
75+
this.ccsMetrics = ccsTelemetrySnapshot;
6876
}
6977

7078
public NodeInfo nodeInfo() {
@@ -95,6 +103,10 @@ public RepositoryUsageStats repositoryUsageStats() {
95103
return repositoryUsageStats;
96104
}
97105

106+
public CCSTelemetrySnapshot getCcsMetrics() {
107+
return ccsMetrics;
108+
}
109+
98110
@Override
99111
public void writeTo(StreamOutput out) throws IOException {
100112
super.writeTo(out);
@@ -108,5 +120,9 @@ public void writeTo(StreamOutput out) throws IOException {
108120
if (out.getTransportVersion().onOrAfter(TransportVersions.REPOSITORIES_TELEMETRY)) {
109121
repositoryUsageStats.writeTo(out);
110122
} // else just drop these stats, ok for bwc
123+
if (out.getTransportVersion().onOrAfter(TransportVersions.CCS_TELEMETRY_STATS)) {
124+
ccsMetrics.writeTo(out);
125+
}
111126
}
127+
112128
}

server/src/main/java/org/elasticsearch/action/admin/cluster/stats/ClusterStatsResponse.java

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,17 @@
2424
import java.util.List;
2525
import java.util.Locale;
2626

27+
import static org.elasticsearch.action.search.TransportSearchAction.CCS_TELEMETRY_FEATURE_FLAG;
28+
2729
public class ClusterStatsResponse extends BaseNodesResponse<ClusterStatsNodeResponse> implements ToXContentFragment {
2830

2931
final ClusterStatsNodes nodesStats;
3032
final ClusterStatsIndices indicesStats;
3133
final ClusterHealthStatus status;
3234
final ClusterSnapshotStats clusterSnapshotStats;
3335
final RepositoryUsageStats repositoryUsageStats;
36+
37+
final CCSTelemetrySnapshot ccsMetrics;
3438
final long timestamp;
3539
final String clusterUUID;
3640

@@ -50,6 +54,7 @@ public ClusterStatsResponse(
5054
this.timestamp = timestamp;
5155
nodesStats = new ClusterStatsNodes(nodes);
5256
indicesStats = new ClusterStatsIndices(nodes, mappingStats, analysisStats, versionStats);
57+
ccsMetrics = new CCSTelemetrySnapshot();
5358
ClusterHealthStatus status = null;
5459
for (ClusterStatsNodeResponse response : nodes) {
5560
// only the master node populates the status
@@ -58,6 +63,7 @@ public ClusterStatsResponse(
5863
break;
5964
}
6065
}
66+
nodes.forEach(node -> ccsMetrics.add(node.getCcsMetrics()));
6167
this.status = status;
6268
this.clusterSnapshotStats = clusterSnapshotStats;
6369

@@ -90,6 +96,10 @@ public ClusterStatsIndices getIndicesStats() {
9096
return indicesStats;
9197
}
9298

99+
public CCSTelemetrySnapshot getCcsMetrics() {
100+
return ccsMetrics;
101+
}
102+
93103
@Override
94104
public void writeTo(StreamOutput out) throws IOException {
95105
TransportAction.localOnly();
@@ -125,6 +135,12 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
125135
builder.field("repositories");
126136
repositoryUsageStats.toXContent(builder, params);
127137

138+
if (CCS_TELEMETRY_FEATURE_FLAG.isEnabled()) {
139+
builder.startObject("ccs");
140+
ccsMetrics.toXContent(builder, params);
141+
builder.endObject();
142+
}
143+
128144
return builder;
129145
}
130146

server/src/main/java/org/elasticsearch/action/admin/cluster/stats/TransportClusterStatsAction.java

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ public class TransportClusterStatsAction extends TransportNodesAction<
8181
private final IndicesService indicesService;
8282
private final RepositoriesService repositoriesService;
8383
private final SearchUsageHolder searchUsageHolder;
84+
private final CCSUsageTelemetry ccsUsageHolder;
8485

8586
private final MetadataStatsCache<MappingStats> mappingStatsCache;
8687
private final MetadataStatsCache<AnalysisStats> analysisStatsCache;
@@ -108,6 +109,7 @@ public TransportClusterStatsAction(
108109
this.indicesService = indicesService;
109110
this.repositoriesService = repositoriesService;
110111
this.searchUsageHolder = usageService.getSearchUsageHolder();
112+
this.ccsUsageHolder = usageService.getCcsUsageHolder();
111113
this.mappingStatsCache = new MetadataStatsCache<>(threadPool.getThreadContext(), MappingStats::of);
112114
this.analysisStatsCache = new MetadataStatsCache<>(threadPool.getThreadContext(), AnalysisStats::of);
113115
}
@@ -249,6 +251,7 @@ protected ClusterStatsNodeResponse nodeOperation(ClusterStatsNodeRequest nodeReq
249251
final SearchUsageStats searchUsageStats = searchUsageHolder.getSearchUsageStats();
250252

251253
final RepositoryUsageStats repositoryUsageStats = repositoriesService.getUsageStats();
254+
final CCSTelemetrySnapshot ccsTelemetry = ccsUsageHolder.getCCSTelemetrySnapshot();
252255

253256
return new ClusterStatsNodeResponse(
254257
nodeInfo.getNode(),
@@ -257,7 +260,8 @@ protected ClusterStatsNodeResponse nodeOperation(ClusterStatsNodeRequest nodeReq
257260
nodeStats,
258261
shardsStats.toArray(new ShardStats[shardsStats.size()]),
259262
searchUsageStats,
260-
repositoryUsageStats
263+
repositoryUsageStats,
264+
ccsTelemetry
261265
);
262266
}
263267

server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ public class TransportSearchAction extends HandledTransportAction<SearchRequest,
127127
public static final String FROZEN_INDICES_DEPRECATION_MESSAGE = "Searching frozen indices [{}] is deprecated."
128128
+ " Consider cold or frozen tiers in place of frozen indices. The frozen feature will be removed in a feature release.";
129129

130-
private static final FeatureFlag CCS_TELEMETRY_FEATURE_FLAG = new FeatureFlag("ccs_telemetry");
130+
public static final FeatureFlag CCS_TELEMETRY_FEATURE_FLAG = new FeatureFlag("ccs_telemetry");
131131

132132
/** The maximum number of shards for a single search request. */
133133
public static final Setting<Long> SHARD_COUNT_LIMIT_SETTING = Setting.longSetting(

0 commit comments

Comments
 (0)