Skip to content

Conversation

samxbr
Copy link
Contributor

@samxbr samxbr commented Oct 7, 2025

Add dynamic settings to allow user to configure the unhealthy thresholds (yellow/red) of shard capacity health indicator. They replace the current hard-coded values:

  • health.shard_capacity.unhealthy_threshold.yellow (default 10)
  • health.shard_capacity.unhealthy_threshold.red (default 5)

Behavior:

  • both thresholds need to be positive integers
  • RED threshold needs to be smaller than YELLOW threshold, so the health indicator follows the order from GREEN -> YELLOW -> RED

Closes #116697

Copy link
Contributor

github-actions bot commented Oct 8, 2025

Copy link
Contributor

github-actions bot commented Oct 8, 2025

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@samxbr samxbr requested a review from Copilot October 8, 2025 23:43
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces configurable thresholds for the shard capacity health indicator, allowing administrators to customize when the indicator reports YELLOW and RED statuses. Previously, these thresholds were hardcoded to 10 and 5 remaining shards respectively.

  • Adds two new dynamic cluster settings for configuring YELLOW and RED thresholds
  • Updates the ShardsCapacityHealthIndicatorService to use configurable thresholds instead of hardcoded values
  • Includes comprehensive test coverage and documentation for the new settings

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ShardsCapacityHealthIndicatorService.java Adds configurable threshold settings with validation and dynamic updates
ShardsCapacityHealthIndicatorServiceTests.java Updates all tests to use the new constructor and adds threshold validation tests
NodeConstruction.java Updates service instantiation to pass settings parameter
ClusterSettings.java Registers the new settings with cluster configuration
HealthFeatures.java New feature specification for the settings
module-info.java Exports the new HealthFeatures class
META-INF/services/org.elasticsearch.features.FeatureSpecification Registers the HealthFeatures service
health-diagnostic-settings.md Documents the new configuration settings
30_feature.yml Adds integration tests for the new settings functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

…ityHealthIndicatorService.java

Co-authored-by: Copilot <[email protected]>
@elasticsearchmachine
Copy link
Collaborator

Hi @samxbr, I've created a changelog YAML for you.

@samxbr samxbr marked this pull request as ready for review October 9, 2025 15:02
@samxbr samxbr requested a review from a team as a code owner October 9, 2025 15:03
@samxbr samxbr requested a review from a team October 9, 2025 15:15
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Contributor

@seanzatzdev seanzatzdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

health_report:
feature: shards_capacity
- is_true: cluster_name
- match: { indicators.shards_capacity.status: "green" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do - match: { indicators.shards_capacity.status: "green" } here. Check out my PR and the related Serverless test failure issue: #109808.

So, I think we need to convert this to a Java REST test, which allows us to wait for the green indicator health. Maybe we could get away with changing this line from match to exist (like I did in that PR), but I'm worried the match: yellow below might still be flaky, and it'll reduce the value of the test slightly as we don't have a guaranteed starting point. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! I wonder why HealthMetadata needs to be cluster state metadata, they seem to just be a copy of these cluster settings. If we read the settings directly from cluster state, then it should eliminate the flakiness, but there are probably reason I don't understand yet. I posted in the team thread to discuss about this.

Comment on lines -36 to +41
* <p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formatting looks a bit off with an empty line in the middle of a JavaDoc. Could you revert the <p> line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the <p> tag make a difference in your IDE? The rendering looks the same in my Intellij.

"frozen"
);

public static final NodeFeature SHARD_CAPACITY_UNHEALTHY_THRESHOLD_SETTINGS = new NodeFeature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we include _FEATURE somewhere in the field name to avoid confusion between this field and the actual settings?

Comment on lines +307 to +312
{
builder.startObject("settings");
builder.field(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_YELLOW.getKey(), unhealthyThresholdYellow);
builder.field(SHARD_CAPACITY_UNHEALTHY_THRESHOLD_RED.getKey(), unhealthyThresholdRed);
builder.endObject();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you're including the settings in the indicator output? I don't immediately see the value tbh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's more for convenience of the health report. Since these thresholds are now dynamic settings, I thought it'd be good to report them as part of the health report. If someone's looking at previous health report logs they know what the thresholds were and don't need to guess if these settings have been changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ShardCapacity Health Indicator could use dynamic or user-configurable thresholds

4 participants