Skip to content

Conversation

@srikanthpadakanti
Copy link
Contributor

@srikanthpadakanti srikanthpadakanti commented Jan 18, 2026

Description

This change fixes an issue where a node’s local term and version information could be truncated in cluster formation failure logs when host providers return very long IP addresses or host strings.

The truncation made critical coordination diagnostics difficult, especially in environments with custom or dynamic host providers that emit unusually large address values. This update ensures that the full local term and version information is preserved and logged correctly, improving observability and debuggability during cluster formation failures.

The fix includes:

  • Safer handling of long host provider address strings during log construction.
  • Adjustments to avoid truncation of local term and version fields.
  • Expanded test coverage to validate behavior with large host/IP inputs.

Related Issues

Resolves #19249

Check List

  • [ X] Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@srikanthpadakanti srikanthpadakanti requested a review from a team as a code owner January 18, 2026 03:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This pull request adds a configurable limit for truncating cluster formation warning addresses in logs. A new setting DISCOVERY_CLUSTER_FORMATION_WARNING_ADDRESS_LIMIT_SETTING controls the maximum number of addresses displayed, with a utility method formatListForLog() implementing the truncation logic to prevent excessively long log messages from host providers.

Changes

Cohort / File(s) Summary
Configuration & Settings
CHANGELOG.md, server/src/main/java/org/opensearch/cluster/coordination/ClusterFormationFailureHelper.java, server/src/main/java/org/opensearch/common/settings/ClusterSettings.java
Adds DISCOVERY_CLUSTER_FORMATION_WARNING_ADDRESS_LIMIT_SETTING (default: 200) to configure address truncation behavior, registers it in built-in cluster settings, and documents the change in changelog
Core Implementation
server/src/main/java/org/opensearch/cluster/coordination/ClusterFormationFailureHelper.java
Introduces formatListForLog(List<?> items, int limit) utility method for truncating lists in log output; applies this method to format resolved addresses in cluster formation warnings using the new setting
Test Coverage
server/src/test/java/org/opensearch/cluster/coordination/ClusterFormationFailureHelperTests.java
Updates test expectations to reflect new log message structure; adds three new test methods: testFormatListForLogTruncates(), testDescriptionWithLongHostsProviderAddressesListTruncates(), and testDescriptionWithLongHostsProviderAddressesList() to validate truncation and formatting behavior

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: addressing log truncation of node local term/version information when encountering long host provider addresses.
Description check ✅ Passed The pull request description includes all required sections: a clear description of changes, related issue reference (#19249), and a completed checklist with testing confirmed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

✅ Gradle check result for 95dfa41: SUCCESS

@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.19%. Comparing base (967c809) to head (1f5ceba).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20432      +/-   ##
============================================
- Coverage     73.23%   73.19%   -0.04%     
+ Complexity    71953    71938      -15     
============================================
  Files          5795     5796       +1     
  Lines        329248   329270      +22     
  Branches      47410    47415       +5     
============================================
- Hits         241122   241021     -101     
- Misses        68841    68904      +63     
- Partials      19285    19345      +60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Setting.Property.NodeScope
);

public static final Setting<Integer> DISCOVERY_CLUSTER_FORMATION_WARNING_ADDRESS_LIMIT_SETTING = Setting.intSetting(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to avoid the complexity of another setting. I think reordering the log message is probably sufficient. Log the critical data first, followed by the addresses and nodes. If the data gets truncated then at least you still have the term and version data. What do you think?

Copy link
Contributor Author

@srikanthpadakanti srikanthpadakanti Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree reordering is the core fix, and I’ve already done that so term/version are logged first and remain visible even if the line truncates.

I also added bounded printing of the hosts-provider list since reordering alone can still produce very large, repeatedly emitted log lines when many addresses are present. This keeps logs readable and reduces noise while preserving context via a small sample and remaining count.
If you ask me, I want to keep this setting as is. But to your point, to keep this minimal, I’m happy to drop the new setting and use a small fixed internal limit instead.

Let me know your 2cents on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with truncating the list is that it's really hard to know at what value to truncate. The original issue only mentioned loosing important information, which is fixed by the reordering. If log noise is a problem then I'd suggest potentially splitting the message across different log statements at different log levels. I'd be careful about bounding the list unless we know for sure that log noise is a problem. Maybe @SwethaGuptha can weigh in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue is addressed by reordering, and I don’t want to optimize further without clear evidence that log noise is a problem. To keep this minimal, I’ll drop the new setting and truncation logic and keep only the reordering so term and version are always logged first.

If log volume becomes an issue later, we can revisit bounded output or split logging in a followup. I’ll update the PR accordingly.

Srikanth Padakanti added 7 commits January 22, 2026 12:55
…ddress provided by host providers is huge

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
…ddress provided by host providers is huge

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
…ddress provided by host providers is huge

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti srikanthpadakanti force-pushed the bugfix/19249-term-version-log-truncation-oss branch from 0619656 to 3a95016 Compare January 22, 2026 18:57
@github-actions
Copy link
Contributor

✅ Gradle check result for 3a95016: SUCCESS

Srikanth Padakanti and others added 2 commits January 23, 2026 15:12
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Contributor

✅ Gradle check result for 1f5ceba: SUCCESS

Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 41cec11: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

srikanthpadakanti and others added 2 commits January 26, 2026 12:31
Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti
Copy link
Contributor Author

srikanthpadakanti commented Jan 26, 2026

Hello @andrross

The failure is in :distribution:docker:buildArm64DockerImage
Root cause: AlmaLinux mirror 404 during dnf install.

Re-ran the CI pipeline but issue still persists. Any inputs on, how to address this?

@github-actions
Copy link
Contributor

❌ Gradle check result for 043d82d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Node's local term and version truncated when IP address provided by host providers is huge

2 participants