-
Notifications
You must be signed in to change notification settings - Fork 126
📖 add troubleshooting note for MTU mismatch bootstrap failure #1320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
📖 add troubleshooting note for MTU mismatch bootstrap failure #1320
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Shrushti72 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds a local MTU detection helper and a pre-check in CreateSelfSubjectAccessReviews to fail when a detected MTU exceeds 1500, and adds a troubleshooting guide documenting TLS handshake timeouts caused by MTU mismatches. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @pkg/common/helpers/ssar.go:
- Around line 42-50: The MTU fail-fast in the detectLocalMTU branch (call to
detectLocalMTU) should be made configurable: add support for an opt-in skip
flag/annotation (e.g., klusterlet.io/skip-mtu-check: "true") or a function
parameter (skipMTUCheck bool) and if that flag/annotation is present return
normally (no error) and log a warning instead of failing; otherwise keep the
current behavior but change the processLogger call to emit a warning rather than
returning fmt.Errorf so false positives on loopback/jumbo interfaces don't block
bootstrap. Ensure you check the annotation/flag before the error return in the
code path that currently returns fmt.Errorf and reference detectLocalMTU and the
surrounding SSAR/validation function when implementing the flag lookup and
logging change.
- Around line 13-34: The detectLocalMTU function currently returns the maximum
MTU across all UP interfaces (leading to loopback 65536 false positives); update
detectLocalMTU to skip loopback interfaces (check iface.Flags&net.FlagLoopback)
and only consider non-loopback UP interfaces when computing MTU, and instead of
taking the max choose the MTU of the interface used to reach the hub by
resolving the hub endpoint and inspecting the route/interface for that
destination (or, if route-based detection is out of scope, make the check opt-in
via a config/flag and document the limitation); ensure you update the logic in
detectLocalMTU to return an error when no suitable non-loopback interface is
found.
In @troubleshooting/README.md:
- Around line 33-34: Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link for consistency and accessibility; update the line that currently reads the
plain URL (the "Related issue:" entry) to use a descriptive markdown link such
as "Related issue: [open-cluster-management issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)" so the
link text is meaningful while the URL remains the same.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/common/helpers/ssar.gotroubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.
Applied to files:
troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md
34-34: Bare URL used
(MD034, no-bare-urls)
🔇 Additional comments (2)
troubleshooting/README.md (1)
1-25: LGTM! Clear and helpful troubleshooting guidance.The issue description and symptoms are well-articulated. The documentation effectively explains the MTU mismatch problem and its impact on klusterlet bootstrap and certificate rotation.
pkg/common/helpers/ssar.go (1)
54-56: LGTM! Formatting improvement.The multi-line formatting improves readability without changing functionality.
| func detectLocalMTU() (int, error) { | ||
| ifaces, err := net.Interfaces() | ||
| if err != nil { | ||
| return 0, err | ||
| } | ||
|
|
||
| maxMTU := 0 | ||
| for _, iface := range ifaces { | ||
| if iface.Flags&net.FlagUp == 0 { | ||
| continue | ||
| } | ||
| if iface.MTU > maxMTU { | ||
| maxMTU = iface.MTU | ||
| } | ||
| } | ||
|
|
||
| if maxMTU == 0 { | ||
| return 0, fmt.Errorf("unable to detect MTU") | ||
| } | ||
|
|
||
| return maxMTU, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Loopback interface causes false positives.
The function returns the maximum MTU across all UP interfaces, which will almost always include the loopback interface (lo) with MTU 65536. This causes false positives for virtually all systems, blocking legitimate configurations where the actual hub-facing network interface has standard MTU.
Additionally, even excluding loopback, using the maximum MTU is incorrect—you need the MTU of the specific interface used to communicate with the hub, not the highest MTU among all interfaces.
Recommended approach:
To accurately detect MTU mismatch, you should:
- Exclude the loopback interface from detection
- Consider detecting the MTU of the route to the hub API server endpoint, or
- Document this as a known limitation and make the check opt-in via a flag/annotation
🔧 Quick fix: Exclude loopback interface
func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
if err != nil {
return 0, err
}
maxMTU := 0
for _, iface := range ifaces {
- if iface.Flags&net.FlagUp == 0 {
+ // Skip interfaces that are down or loopback
+ if iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagLoopback != 0 {
continue
}
if iface.MTU > maxMTU {
maxMTU = iface.MTU
}
}
if maxMTU == 0 {
return 0, fmt.Errorf("unable to detect MTU")
}
return maxMTU, nil
}Note: Even with this fix, using the maximum MTU across all non-loopback interfaces may still produce false positives in environments with multiple networks (e.g., a high-MTU storage network and a standard-MTU management network).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func detectLocalMTU() (int, error) { | |
| ifaces, err := net.Interfaces() | |
| if err != nil { | |
| return 0, err | |
| } | |
| maxMTU := 0 | |
| for _, iface := range ifaces { | |
| if iface.Flags&net.FlagUp == 0 { | |
| continue | |
| } | |
| if iface.MTU > maxMTU { | |
| maxMTU = iface.MTU | |
| } | |
| } | |
| if maxMTU == 0 { | |
| return 0, fmt.Errorf("unable to detect MTU") | |
| } | |
| return maxMTU, nil | |
| } | |
| func detectLocalMTU() (int, error) { | |
| ifaces, err := net.Interfaces() | |
| if err != nil { | |
| return 0, err | |
| } | |
| maxMTU := 0 | |
| for _, iface := range ifaces { | |
| // Skip interfaces that are down or loopback | |
| if iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagLoopback != 0 { | |
| continue | |
| } | |
| if iface.MTU > maxMTU { | |
| maxMTU = iface.MTU | |
| } | |
| } | |
| if maxMTU == 0 { | |
| return 0, fmt.Errorf("unable to detect MTU") | |
| } | |
| return maxMTU, nil | |
| } |
🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 13 - 34, The detectLocalMTU function
currently returns the maximum MTU across all UP interfaces (leading to loopback
65536 false positives); update detectLocalMTU to skip loopback interfaces (check
iface.Flags&net.FlagLoopback) and only consider non-loopback UP interfaces when
computing MTU, and instead of taking the max choose the MTU of the interface
used to reach the hub by resolving the hub endpoint and inspecting the
route/interface for that destination (or, if route-based detection is out of
scope, make the check opt-in via a config/flag and document the limitation);
ensure you update the logic in detectLocalMTU to return an error when no
suitable non-loopback interface is found.
| mtu, err := detectLocalMTU() | ||
| if err == nil && mtu > 1500 { | ||
| return false, nil, fmt.Errorf( | ||
| "MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+ | ||
| "This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+ | ||
| "Please align MTU values between hub and managed cluster.", | ||
| mtu, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making the MTU check configurable.
While the fail-fast approach is well-intentioned, the flawed MTU detection (including loopback interfaces) will block many legitimate configurations during bootstrap. This creates a poor user experience for environments with jumbo frames on non-hub-facing interfaces or high-MTU loopback.
Consider making this check:
- Optional via an annotation/flag (e.g.,
klusterlet.io/skip-mtu-check: "true"), or - Logged as a warning instead of blocking bootstrap, allowing users to investigate if they encounter TLS timeout issues
This preserves the troubleshooting value while avoiding false-positive failures.
🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 42 - 50, The MTU fail-fast in the
detectLocalMTU branch (call to detectLocalMTU) should be made configurable: add
support for an opt-in skip flag/annotation (e.g., klusterlet.io/skip-mtu-check:
"true") or a function parameter (skipMTUCheck bool) and if that flag/annotation
is present return normally (no error) and log a warning instead of failing;
otherwise keep the current behavior but change the processLogger call to emit a
warning rather than returning fmt.Errorf so false positives on loopback/jumbo
interfaces don't block bootstrap. Ensure you check the annotation/flag before
the error return in the code path that currently returns fmt.Errorf and
reference detectLocalMTU and the surrounding SSAR/validation function when
implementing the flag lookup and logging change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @troubleshooting/README.md:
- Around line 14-17: The sentence ending with "resulting in errors such as:" is
incomplete; either add concrete example messages (e.g., "TLS handshake timeout",
"i/o timeout", "connection reset by peer", "connection refused", or Kubernetes
API errors referencing SelfSubjectAccessReview failures) after the colon to
illustrate likely symptoms, or remove the phrase "such as:" so the sentence
reads as a general statement about potential TLS handshake timeouts. Update the
README to include one or two representative error lines (plain text) following
the colon or delete the colon phrase to avoid implying missing examples, and
mention SelfSubjectAccessReview as an example of a bootstrap API that can
trigger these errors.
- Around line 34-35: Replace the bare URL after the "Related issue:" line with a
Markdown link; update the text so the line reads something like: Related issue:
[open-cluster-management-io/ocm#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)
or use [Related
issue](https://github.com/open-cluster-management-io/ocm/issues/1314) to satisfy
markdown linting and improve accessibility.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
troubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.
Applied to files:
troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md
35-35: Bare URL used
(MD034, no-bare-urls)
| In such cases, some bootstrap API calls (for example, | ||
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | ||
| the hub path MTU. These packets may be silently dropped, resulting in | ||
| errors such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete the error examples section.
The text states "resulting in errors such as:" but no specific error messages are listed. Either provide concrete error examples (e.g., timeout errors, connection refused messages) to help readers diagnose the issue, or remove the "such as:" phrase to clarify that these are general TLS handshake timeouts.
🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 14 - 17, The sentence ending with
"resulting in errors such as:" is incomplete; either add concrete example
messages (e.g., "TLS handshake timeout", "i/o timeout", "connection reset by
peer", "connection refused", or Kubernetes API errors referencing
SelfSubjectAccessReview failures) after the colon to illustrate likely symptoms,
or remove the phrase "such as:" so the sentence reads as a general statement
about potential TLS handshake timeouts. Update the README to include one or two
representative error lines (plain text) following the colon or delete the colon
phrase to avoid implying missing examples, and mention SelfSubjectAccessReview
as an example of a bootstrap API that can trigger these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @troubleshooting/README.md:
- Around line 37-38: Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link (for example: [OCM issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)) so the
reference conforms to MD034 and renders/accessibility is improved; update the
line containing the URL to use that markdown link text.
- Around line 15-21: The README's paragraph ends with "resulting in errors such
as:" but no examples follow; fix by either (A) adding concrete error examples
after that phrase (e.g., "TLS handshake timeout", "i/o timeout", "connection
refused", or specific kubernetes SelfSubjectAccessReview-related errors)
formatted as a short bullet list or inline comma-separated examples to make the
statement complete, or (B) remove the trailing "such as:" so the sentence reads
without implying missing examples; update the text around that phrase in
troubleshooting/README.md accordingly.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
troubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.
Applied to files:
troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md
38-38: Bare URL used
(MD034, no-bare-urls)
| In such cases, some bootstrap API calls (for example, | ||
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | ||
| the hub path MTU. These packets may be silently dropped, resulting in | ||
| errors such as: | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete the error examples section.
The text ends with "resulting in errors such as:" but lines 20–21 are blank—no error messages are provided. Either add concrete error examples (e.g., "TLS handshake timeout", "i/o timeout", "connection refused") or remove the "such as:" phrase to avoid implying missing examples.
💡 Suggested fix
Option 1: Provide concrete error examples:
the hub path MTU. These packets may be silently dropped, resulting in
- errors such as:
+ errors such as:
+ - `TLS handshake timeout`
+ - `context deadline exceeded`
+ - `i/o timeout`Option 2: Remove the incomplete phrase:
the hub path MTU. These packets may be silently dropped, resulting in
- errors such as:
+ TLS handshake timeouts.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| In such cases, some bootstrap API calls (for example, | |
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | |
| the hub path MTU. These packets may be silently dropped, resulting in | |
| errors such as: | |
| In such cases, some bootstrap API calls (for example, | |
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | |
| the hub path MTU. These packets may be silently dropped, resulting in | |
| errors such as: | |
| - `TLS handshake timeout` | |
| - `context deadline exceeded` | |
| - `i/o timeout` |
| In such cases, some bootstrap API calls (for example, | |
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | |
| the hub path MTU. These packets may be silently dropped, resulting in | |
| errors such as: | |
| In such cases, some bootstrap API calls (for example, | |
| `SelfSubjectAccessReview`) can generate TLS handshake packets that exceed | |
| the hub path MTU. These packets may be silently dropped, resulting in | |
| TLS handshake timeouts. |
🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 15 - 21, The README's paragraph ends
with "resulting in errors such as:" but no examples follow; fix by either (A)
adding concrete error examples after that phrase (e.g., "TLS handshake timeout",
"i/o timeout", "connection refused", or specific kubernetes
SelfSubjectAccessReview-related errors) formatted as a short bullet list or
inline comma-separated examples to make the statement complete, or (B) remove
the trailing "such as:" so the sentence reads without implying missing examples;
update the text around that phrase in troubleshooting/README.md accordingly.
| Related issue: | ||
| https://github.com/open-cluster-management-io/ocm/issues/1314 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format the URL as a markdown link.
The bare URL violates markdown linting (MD034). Format it as a proper markdown link for consistency and accessibility.
📝 Proposed fix
- Related issue:
- https://github.com/open-cluster-management-io/ocm/issues/1314
+ **Related issue:** [#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Related issue: | |
| https://github.com/open-cluster-management-io/ocm/issues/1314 | |
| **Related issue:** [#1314](https://github.com/open-cluster-management-io/ocm/issues/1314) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
38-38: Bare URL used
(MD034, no-bare-urls)
🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 37 - 38, Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link (for example: [OCM issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)) so the
reference conforms to MD034 and renders/accessibility is improved; update the
line containing the URL to use that markdown link text.
Adds troubleshooting documentation for klusterlet bootstrap and
certificate rotation failures caused by MTU mismatch between hub and
managed clusters.
Related to #1314.
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.