Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,8 @@ The following are the *expert-level* settings available for configuring an inter
`health.periodic_logger.poll_interval`
: ([Dynamic](docs-content://deploy-manage/stack-settings.md#dynamic-cluster-setting), [time unit value](/reference/elasticsearch/rest-apis/api-conventions.md#time-units)) How often {{es}} logs the health status of the cluster and of each health indicator as observed by the Health API. Defaults to `60s` (60 seconds).

`health.shard_capacity.unhealthy_threshold.yellow`
: ([Dynamic](docs-content://deploy-manage/stack-settings.md#dynamic-cluster-setting)) The minimum number of additional shards the cluster must still be able to allocate (on data or frozen nodes) for shard capacity health to remain `GREEN`. If fewer are available, health becomes `YELLOW` (unless already `RED`). Must be greater than or equal to `health.shard_capacity.unhealthy_threshold.red`. Defaults to `10`.

`health.shard_capacity.unhealthy_threshold.red`
: ([Dynamic](docs-content://deploy-manage/stack-settings.md#dynamic-cluster-setting)) The minimum number of additional shards the cluster must still be able to allocate (on data or frozen nodes) below which shard capacity health becomes `RED`. Must be less than or equal to `health.shard_capacity.unhealthy_threshold.yellow`. Defaults to `5`.
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,59 @@
- match: { indicators.master_is_stable.status: "green" }
- match: { indicators.master_is_stable.symptom: "The cluster has a stable master node" }
- is_false: indicators.master_is_stable.details
---
"cluster health test for shard capacity settings":
- requires:
cluster_features: [ "health.shard_capacity.unhealthy_threshold_settings" ]
reason: "these relevant settings are added in 9.3"
- do:
health_report:
feature: shards_capacity
- is_true: cluster_name
- match: { indicators.shards_capacity.status: "green" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do - match: { indicators.shards_capacity.status: "green" } here. Check out my PR and the related Serverless test failure issue: #109808.

So, I think we need to convert this to a Java REST test, which allows us to wait for the green indicator health. Maybe we could get away with changing this line from match to exist (like I did in that PR), but I'm worried the match: yellow below might still be flaky, and it'll reduce the value of the test slightly as we don't have a guaranteed starting point. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! I wonder why HealthMetadata needs to be cluster state metadata, they seem to just be a copy of these cluster settings. If we read the settings directly from cluster state, then it should eliminate the flakiness, but there are probably reason I don't understand yet. I posted in the team thread to discuss about this.


# set very large threshold to force the indicator to go yellow
- do:
cluster.put_settings:
body:
persistent:
health.shard_capacity.unhealthy_threshold.yellow: 100000000
flat_settings: true
- match: { persistent: { health.shard_capacity.unhealthy_threshold.yellow: "100000000" } }

- do:
health_report:
feature: shards_capacity
- is_true: cluster_name
- match: { indicators.shards_capacity.status: "yellow" }

# set very large threshold to force the indicator to go red
- do:
cluster.put_settings:
body:
persistent:
health.shard_capacity.unhealthy_threshold.red: 90000000
flat_settings: true
- match: { persistent: { health.shard_capacity.unhealthy_threshold.red: "90000000" } }

- do:
health_report:
feature: shards_capacity
- is_true: cluster_name
- match: { indicators.shards_capacity.status: "red" }

# set back to default
- do:
cluster.put_settings:
body:
persistent:
health.shard_capacity.unhealthy_threshold.yellow: 10
health.shard_capacity.unhealthy_threshold.red: 5
flat_settings: true
- match: { acknowledged: true }

- do:
health_report:
feature: shards_capacity
- is_true: cluster_name
- match: { indicators.shards_capacity.status: "green" }
3 changes: 2 additions & 1 deletion server/src/main/java/module-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -436,7 +436,8 @@
org.elasticsearch.search.retriever.RetrieversFeatures,
org.elasticsearch.action.admin.cluster.stats.ClusterStatsFeatures,
org.elasticsearch.ingest.IngestFeatures,
org.elasticsearch.action.admin.indices.resolve.ResolveIndexFeatures;
org.elasticsearch.action.admin.indices.resolve.ResolveIndexFeatures,
org.elasticsearch.health.HealthFeatures;

uses org.elasticsearch.plugins.internal.SettingsExtension;
uses RestExtension;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
import org.elasticsearch.gateway.PersistedClusterStateService;
import org.elasticsearch.health.HealthPeriodicLogger;
import org.elasticsearch.health.node.LocalHealthMonitor;
import org.elasticsearch.health.node.ShardsCapacityHealthIndicatorService;
import org.elasticsearch.health.node.action.TransportHealthNodeAction;
import org.elasticsearch.health.node.selection.HealthNodeTaskExecutor;
import org.elasticsearch.http.HttpTransportSettings;
Expand Down Expand Up @@ -650,6 +651,8 @@ public void apply(Settings value, Settings current, Settings previous) {
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_HIGH_UTILIZATION_THRESHOLD_SETTING,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_HIGH_UTILIZATION_DURATION_SETTING,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_QUEUE_LATENCY_THRESHOLD_SETTING,
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_REROUTE_INTERVAL_SETTING
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_REROUTE_INTERVAL_SETTING,
ShardsCapacityHealthIndicatorService.SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW,
ShardsCapacityHealthIndicatorService.SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED
);
}
25 changes: 25 additions & 0 deletions server/src/main/java/org/elasticsearch/health/HealthFeatures.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the "Elastic License
* 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side
* Public License v 1"; you may not use this file except in compliance with, at
* your election, the "Elastic License 2.0", the "GNU Affero General Public
* License v3.0 only", or the "Server Side Public License, v 1".
*/

package org.elasticsearch.health;

import org.elasticsearch.features.FeatureSpecification;
import org.elasticsearch.features.NodeFeature;

import java.util.Set;

import static org.elasticsearch.health.node.ShardsCapacityHealthIndicatorService.SHARD_CAPACITY_UNHEALTHY_THRESHOLD_SETTINGS;

public class HealthFeatures implements FeatureSpecification {

@Override
public Set<NodeFeature> getTestFeatures() {
return Set.of(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_SETTINGS);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
import org.elasticsearch.common.ReferenceDocs;
import org.elasticsearch.common.TriFunction;
import org.elasticsearch.common.settings.Setting;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.features.NodeFeature;
import org.elasticsearch.health.Diagnosis;
import org.elasticsearch.health.HealthIndicatorDetails;
import org.elasticsearch.health.HealthIndicatorImpact;
Expand All @@ -25,18 +27,24 @@
import org.elasticsearch.health.metadata.HealthMetadata;
import org.elasticsearch.indices.ShardLimitValidator;

import java.util.Iterator;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.stream.Stream;

/**
* This indicator reports health data about the shard capacity across the cluster.
*
* <p>

Comment on lines -36 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting looks a bit off with an empty line in the middle of a JavaDoc. Could you revert the <p> line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the <p> tag make a difference in your IDE? The rendering looks the same in my Intellij.

* The indicator will report:
* * RED when there's room for less than 5 shards (either data or frozen nodes)
* * YELLOW when there's room for less than 10 shards (either data or frozen nodes)
* * GREEN otherwise
* </p>
* <ul>
* <li> {@code RED} when there's room for less than the configured {@code health.shard_capacity.unhealthy_threshold.red} (default 5) shards
* (either data or frozen nodes)
* <li> {@code YELLOW} when there's room for less than the configured {@code health.shard_capacity.unhealthy_threshold.yellow} (default 10)
* shards (either data or frozen nodes)
* <li> {@code GREEN} otherwise
* </ul>
*
* Although the `max_shard_per_node(.frozen)?` information is scoped by Node, we use the information from master because there is where
* the available room for new shards is checked before creating new indices.
Expand Down Expand Up @@ -89,10 +97,94 @@ public class ShardsCapacityHealthIndicatorService implements HealthIndicatorServ
"frozen"
);

public static final NodeFeature SHARD_CAPACITY_UNHEALTHY_THRESHOLD_SETTINGS = new NodeFeature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we include _FEATURE somewhere in the field name to avoid confusion between this field and the actual settings?

"health.shard_capacity.unhealthy_threshold_setting"
);

public static final Setting<Integer> SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW = Setting.intSetting(
"health.shard_capacity.unhealthy_threshold.yellow",
10,
0,
new Setting.Validator<>() {
@Override
public void validate(Integer value) {}

@Override
public void validate(Integer value, Map<Setting<?>, Object> settings) {
Integer redThreshold = (Integer) settings.get(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED);
if (value < redThreshold) {
throw new IllegalArgumentException(
String.format(
Locale.ROOT,
"Setting [%s] (%d) must be greater than or equal to [%s] (%d)",
SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW.getKey(),
value,
SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED.getKey(),
redThreshold
)
);
}
}

@Override
public Iterator<Setting<?>> settings() {
final List<Setting<?>> settings = List.of(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED);
return settings.iterator();
}
},
Setting.Property.Dynamic,
Setting.Property.NodeScope
);

public static final Setting<Integer> SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED = Setting.intSetting(
"health.shard_capacity.unhealthy_threshold.red",
5,
0,
new Setting.Validator<>() {
@Override
public void validate(Integer value) {}

@Override
public void validate(Integer value, Map<Setting<?>, Object> settings) {
Integer yellowThreshold = (Integer) settings.get(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW);
if (value > yellowThreshold) {
throw new IllegalArgumentException(
String.format(
Locale.ROOT,
"Setting [%s] (%d) must be less than or equal to [%s] (%d)",
SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED.getKey(),
value,
SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW.getKey(),
yellowThreshold
)
);
}
}

@Override
public Iterator<Setting<?>> settings() {
final List<Setting<?>> settings = List.of(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW);
return settings.iterator();
}
},
Setting.Property.Dynamic,
Setting.Property.NodeScope
);

private final ClusterService clusterService;

public ShardsCapacityHealthIndicatorService(ClusterService clusterService) {
private int unhealthyThresholdYellow;
private int unhealthyThresholdRed;

public ShardsCapacityHealthIndicatorService(ClusterService clusterService, Settings settings) {
this.clusterService = clusterService;
this.unhealthyThresholdYellow = SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW.get(settings);
this.unhealthyThresholdRed = SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED.get(settings);

clusterService.getClusterSettings()
.addSettingsUpdateConsumer(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW, this::setUnhealthyThresholdYellow);
clusterService.getClusterSettings()
.addSettingsUpdateConsumer(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED, this::setUnhealthyThresholdRed);
}

@Override
Expand All @@ -115,13 +207,17 @@ public HealthIndicatorResult calculate(boolean verbose, int maxAffectedResources
shardLimitsMetadata.maxShardsPerNode(),
state.nodes(),
state.metadata(),
ShardLimitValidator::checkShardLimitForNormalNodes
ShardLimitValidator::checkShardLimitForNormalNodes,
unhealthyThresholdYellow,
unhealthyThresholdRed
),
calculateFrom(
shardLimitsMetadata.maxShardsPerNodeFrozen(),
state.nodes(),
state.metadata(),
ShardLimitValidator::checkShardLimitForFrozenNodes
ShardLimitValidator::checkShardLimitForFrozenNodes,
unhealthyThresholdYellow,
unhealthyThresholdRed
)
);
}
Expand Down Expand Up @@ -164,7 +260,9 @@ private HealthIndicatorResult mergeIndicators(boolean verbose, StatusResult data
return createIndicator(
finalStatus,
symptomBuilder.toString(),
verbose ? buildDetails(dataNodes.result, frozenNodes.result) : HealthIndicatorDetails.EMPTY,
verbose
? buildDetails(dataNodes.result, frozenNodes.result, unhealthyThresholdYellow, unhealthyThresholdRed)
: HealthIndicatorDetails.EMPTY,
indicatorImpacts,
verbose ? diagnoses : List.of()
);
Expand All @@ -174,22 +272,29 @@ static StatusResult calculateFrom(
int maxShardsPerNodeSetting,
DiscoveryNodes discoveryNodes,
Metadata metadata,
ShardsCapacityChecker checker
ShardsCapacityChecker checker,
int shardThresholdYellow,
int shardThresholdRed
) {
var result = checker.check(maxShardsPerNodeSetting, 5, 1, discoveryNodes, metadata);
var result = checker.check(maxShardsPerNodeSetting, shardThresholdRed, 1, discoveryNodes, metadata);
if (result.canAddShards() == false) {
return new StatusResult(HealthStatus.RED, result);
}

result = checker.check(maxShardsPerNodeSetting, 10, 1, discoveryNodes, metadata);
result = checker.check(maxShardsPerNodeSetting, shardThresholdYellow, 1, discoveryNodes, metadata);
if (result.canAddShards() == false) {
return new StatusResult(HealthStatus.YELLOW, result);
}

return new StatusResult(HealthStatus.GREEN, result);
}

static HealthIndicatorDetails buildDetails(ShardLimitValidator.Result dataNodes, ShardLimitValidator.Result frozenNodes) {
static HealthIndicatorDetails buildDetails(
ShardLimitValidator.Result dataNodes,
ShardLimitValidator.Result frozenNodes,
int unhealthyThresholdYellow,
int unhealthyThresholdRed
) {
return (builder, params) -> {
builder.startObject();
{
Expand All @@ -208,6 +313,12 @@ static HealthIndicatorDetails buildDetails(ShardLimitValidator.Result dataNodes,
}
builder.endObject();
}
{
builder.startObject("settings");
builder.field(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW.getKey(), unhealthyThresholdYellow);
builder.field(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED.getKey(), unhealthyThresholdRed);
builder.endObject();
}
Comment on lines +307 to +312
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you're including the settings in the indicator output? I don't immediately see the value tbh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's more for convenience of the health report. Since these thresholds are now dynamic settings, I thought it'd be good to report them as part of the health report. If someone's looking at previous health report logs they know what the thresholds were and don't need to guess if these settings have been changed.

builder.endObject();
return builder;
};
Expand All @@ -223,6 +334,14 @@ private HealthIndicatorResult unknownIndicator() {
);
}

private void setUnhealthyThresholdYellow(int value) {
this.unhealthyThresholdYellow = value;
}

private void setUnhealthyThresholdRed(int value) {
this.unhealthyThresholdRed = value;
}

record StatusResult(HealthStatus status, ShardLimitValidator.Result result) {}

@FunctionalInterface
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1469,7 +1469,7 @@ private Module loadDiagnosticServices(
new StableMasterHealthIndicatorService(coordinationDiagnosticsService, clusterService),
new RepositoryIntegrityHealthIndicatorService(clusterService),
new DiskHealthIndicatorService(clusterService),
new ShardsCapacityHealthIndicatorService(clusterService),
new ShardsCapacityHealthIndicatorService(clusterService, settings),
new FileSettingsHealthIndicatorService()
);
var pluginHealthIndicatorServices = pluginsService.filterPlugins(HealthPlugin.class)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ org.elasticsearch.cluster.routing.RoutingFeatures
org.elasticsearch.action.admin.cluster.stats.ClusterStatsFeatures
org.elasticsearch.ingest.IngestFeatures
org.elasticsearch.action.admin.indices.resolve.ResolveIndexFeatures
org.elasticsearch.health.HealthFeatures
Loading
Loading