-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Fix node local term/version log truncation with long host provider addresses #20432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix node local term/version log truncation with long host provider addresses #20432
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis pull request adds a configurable limit for truncating cluster formation warning addresses in logs. A new setting Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
6e605e6 to
95dfa41
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20432 +/- ##
============================================
- Coverage 73.23% 73.19% -0.04%
+ Complexity 71953 71938 -15
============================================
Files 5795 5796 +1
Lines 329248 329270 +22
Branches 47410 47415 +5
============================================
- Hits 241122 241021 -101
- Misses 68841 68904 +63
- Partials 19285 19345 +60 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| Setting.Property.NodeScope | ||
| ); | ||
|
|
||
| public static final Setting<Integer> DISCOVERY_CLUSTER_FORMATION_WARNING_ADDRESS_LIMIT_SETTING = Setting.intSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to avoid the complexity of another setting. I think reordering the log message is probably sufficient. Log the critical data first, followed by the addresses and nodes. If the data gets truncated then at least you still have the term and version data. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree reordering is the core fix, and I’ve already done that so term/version are logged first and remain visible even if the line truncates.
I also added bounded printing of the hosts-provider list since reordering alone can still produce very large, repeatedly emitted log lines when many addresses are present. This keeps logs readable and reduces noise while preserving context via a small sample and remaining count.
If you ask me, I want to keep this setting as is. But to your point, to keep this minimal, I’m happy to drop the new setting and use a small fixed internal limit instead.
Let me know your 2cents on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with truncating the list is that it's really hard to know at what value to truncate. The original issue only mentioned loosing important information, which is fixed by the reordering. If log noise is a problem then I'd suggest potentially splitting the message across different log statements at different log levels. I'd be careful about bounding the list unless we know for sure that log noise is a problem. Maybe @SwethaGuptha can weigh in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue is addressed by reordering, and I don’t want to optimize further without clear evidence that log noise is a problem. To keep this minimal, I’ll drop the new setting and truncation logic and keep only the reordering so term and version are always logged first.
If log volume becomes an issue later, we can revisit bounded output or split logging in a followup. I’ll update the PR accordingly.
…ddress provided by host providers is huge Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
…ddress provided by host providers is huge Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
…ddress provided by host providers is huge Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
0619656 to
3a95016
Compare
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
…-log-truncation-oss
Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
|
❌ Gradle check result for 41cec11: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
|
Hello @andrross The failure is in :distribution:docker:buildArm64DockerImage Re-ran the CI pipeline but issue still persists. Any inputs on, how to address this? |
|
❌ Gradle check result for 043d82d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
This change fixes an issue where a node’s local term and version information could be truncated in cluster formation failure logs when host providers return very long IP addresses or host strings.
The truncation made critical coordination diagnostics difficult, especially in environments with custom or dynamic host providers that emit unusually large address values. This update ensures that the full local term and version information is preserved and logged correctly, improving observability and debuggability during cluster formation failures.
The fix includes:
Related Issues
Resolves #19249
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.