Skip to content

Conversation

@Shrushti72
Copy link

@Shrushti72 Shrushti72 commented Jan 7, 2026

Adds troubleshooting documentation for klusterlet bootstrap and
certificate rotation failures caused by MTU mismatch between hub and
managed clusters.

Related to #1314.

Summary by CodeRabbit

  • New Features

    • Added MTU mismatch detection and validation between hub and managed clusters; operations now validate local MTU and surface an error when a significant mismatch is detected.
  • Documentation

    • Added a troubleshooting guide with symptoms and resolution steps for TLS handshake timeouts related to MTU configuration mismatches.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Shrushti72
Once this PR has been reviewed and has the lgtm label, please assign qiujian16 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 January 7, 2026 15:27
@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

Walkthrough

Adds a local MTU detection helper and a pre-check in CreateSelfSubjectAccessReviews to fail when a detected MTU exceeds 1500, and adds a troubleshooting guide documenting TLS handshake timeouts caused by MTU mismatches.

Changes

Cohort / File(s) Summary
MTU detection & SSAR pre-check
pkg/common/helpers/ssar.go
Added unexported detectLocalMTU() that enumerates UP network interfaces to determine the maximum MTU and returns an error if none found. Integrated a pre-check in CreateSelfSubjectAccessReviews to call this helper and return an error when local MTU > 1500. Minor formatting change to the Kubernetes client method chaining.
Troubleshooting documentation
troubleshooting/README.md
Added new troubleshooting guide documenting TLS handshake timeout symptoms tied to MTU mismatches, resolution steps, and a related issue reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately reflects the main change: adding troubleshooting documentation for MTU mismatch bootstrap failures.
Description check ✅ Passed The pull request description provides a clear summary and related issue reference, though it omits the category emoji from the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @pkg/common/helpers/ssar.go:
- Around line 42-50: The MTU fail-fast in the detectLocalMTU branch (call to
detectLocalMTU) should be made configurable: add support for an opt-in skip
flag/annotation (e.g., klusterlet.io/skip-mtu-check: "true") or a function
parameter (skipMTUCheck bool) and if that flag/annotation is present return
normally (no error) and log a warning instead of failing; otherwise keep the
current behavior but change the processLogger call to emit a warning rather than
returning fmt.Errorf so false positives on loopback/jumbo interfaces don't block
bootstrap. Ensure you check the annotation/flag before the error return in the
code path that currently returns fmt.Errorf and reference detectLocalMTU and the
surrounding SSAR/validation function when implementing the flag lookup and
logging change.
- Around line 13-34: The detectLocalMTU function currently returns the maximum
MTU across all UP interfaces (leading to loopback 65536 false positives); update
detectLocalMTU to skip loopback interfaces (check iface.Flags&net.FlagLoopback)
and only consider non-loopback UP interfaces when computing MTU, and instead of
taking the max choose the MTU of the interface used to reach the hub by
resolving the hub endpoint and inspecting the route/interface for that
destination (or, if route-based detection is out of scope, make the check opt-in
via a config/flag and document the limitation); ensure you update the logic in
detectLocalMTU to return an error when no suitable non-loopback interface is
found.

In @troubleshooting/README.md:
- Around line 33-34: Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link for consistency and accessibility; update the line that currently reads the
plain URL (the "Related issue:" entry) to use a descriptive markdown link such
as "Related issue: [open-cluster-management issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)" so the
link text is meaningful while the URL remains the same.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b40e72 and d037683.

📒 Files selected for processing (2)
  • pkg/common/helpers/ssar.go
  • troubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.

Applied to files:

  • troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md

34-34: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (2)
troubleshooting/README.md (1)

1-25: LGTM! Clear and helpful troubleshooting guidance.

The issue description and symptoms are well-articulated. The documentation effectively explains the MTU mismatch problem and its impact on klusterlet bootstrap and certificate rotation.

pkg/common/helpers/ssar.go (1)

54-56: LGTM! Formatting improvement.

The multi-line formatting improves readability without changing functionality.

Comment on lines +13 to +34
func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
if err != nil {
return 0, err
}

maxMTU := 0
for _, iface := range ifaces {
if iface.Flags&net.FlagUp == 0 {
continue
}
if iface.MTU > maxMTU {
maxMTU = iface.MTU
}
}

if maxMTU == 0 {
return 0, fmt.Errorf("unable to detect MTU")
}

return maxMTU, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Loopback interface causes false positives.

The function returns the maximum MTU across all UP interfaces, which will almost always include the loopback interface (lo) with MTU 65536. This causes false positives for virtually all systems, blocking legitimate configurations where the actual hub-facing network interface has standard MTU.

Additionally, even excluding loopback, using the maximum MTU is incorrect—you need the MTU of the specific interface used to communicate with the hub, not the highest MTU among all interfaces.

Recommended approach:

To accurately detect MTU mismatch, you should:

  1. Exclude the loopback interface from detection
  2. Consider detecting the MTU of the route to the hub API server endpoint, or
  3. Document this as a known limitation and make the check opt-in via a flag/annotation
🔧 Quick fix: Exclude loopback interface
 func detectLocalMTU() (int, error) {
 	ifaces, err := net.Interfaces()
 	if err != nil {
 		return 0, err
 	}
 
 	maxMTU := 0
 	for _, iface := range ifaces {
-		if iface.Flags&net.FlagUp == 0 {
+		// Skip interfaces that are down or loopback
+		if iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagLoopback != 0 {
 			continue
 		}
 		if iface.MTU > maxMTU {
 			maxMTU = iface.MTU
 		}
 	}
 
 	if maxMTU == 0 {
 		return 0, fmt.Errorf("unable to detect MTU")
 	}
 
 	return maxMTU, nil
 }

Note: Even with this fix, using the maximum MTU across all non-loopback interfaces may still produce false positives in environments with multiple networks (e.g., a high-MTU storage network and a standard-MTU management network).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
if err != nil {
return 0, err
}
maxMTU := 0
for _, iface := range ifaces {
if iface.Flags&net.FlagUp == 0 {
continue
}
if iface.MTU > maxMTU {
maxMTU = iface.MTU
}
}
if maxMTU == 0 {
return 0, fmt.Errorf("unable to detect MTU")
}
return maxMTU, nil
}
func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
if err != nil {
return 0, err
}
maxMTU := 0
for _, iface := range ifaces {
// Skip interfaces that are down or loopback
if iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagLoopback != 0 {
continue
}
if iface.MTU > maxMTU {
maxMTU = iface.MTU
}
}
if maxMTU == 0 {
return 0, fmt.Errorf("unable to detect MTU")
}
return maxMTU, nil
}
🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 13 - 34, The detectLocalMTU function
currently returns the maximum MTU across all UP interfaces (leading to loopback
65536 false positives); update detectLocalMTU to skip loopback interfaces (check
iface.Flags&net.FlagLoopback) and only consider non-loopback UP interfaces when
computing MTU, and instead of taking the max choose the MTU of the interface
used to reach the hub by resolving the hub endpoint and inspecting the
route/interface for that destination (or, if route-based detection is out of
scope, make the check opt-in via a config/flag and document the limitation);
ensure you update the logic in detectLocalMTU to return an error when no
suitable non-loopback interface is found.

Comment on lines +42 to +50
mtu, err := detectLocalMTU()
if err == nil && mtu > 1500 {
return false, nil, fmt.Errorf(
"MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+
"This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+
"Please align MTU values between hub and managed cluster.",
mtu,
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Consider making the MTU check configurable.

While the fail-fast approach is well-intentioned, the flawed MTU detection (including loopback interfaces) will block many legitimate configurations during bootstrap. This creates a poor user experience for environments with jumbo frames on non-hub-facing interfaces or high-MTU loopback.

Consider making this check:

  • Optional via an annotation/flag (e.g., klusterlet.io/skip-mtu-check: "true"), or
  • Logged as a warning instead of blocking bootstrap, allowing users to investigate if they encounter TLS timeout issues

This preserves the troubleshooting value while avoiding false-positive failures.

🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 42 - 50, The MTU fail-fast in the
detectLocalMTU branch (call to detectLocalMTU) should be made configurable: add
support for an opt-in skip flag/annotation (e.g., klusterlet.io/skip-mtu-check:
"true") or a function parameter (skipMTUCheck bool) and if that flag/annotation
is present return normally (no error) and log a warning instead of failing;
otherwise keep the current behavior but change the processLogger call to emit a
warning rather than returning fmt.Errorf so false positives on loopback/jumbo
interfaces don't block bootstrap. Ensure you check the annotation/flag before
the error return in the code path that currently returns fmt.Errorf and
reference detectLocalMTU and the surrounding SSAR/validation function when
implementing the flag lookup and logging change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @troubleshooting/README.md:
- Around line 14-17: The sentence ending with "resulting in errors such as:" is
incomplete; either add concrete example messages (e.g., "TLS handshake timeout",
"i/o timeout", "connection reset by peer", "connection refused", or Kubernetes
API errors referencing SelfSubjectAccessReview failures) after the colon to
illustrate likely symptoms, or remove the phrase "such as:" so the sentence
reads as a general statement about potential TLS handshake timeouts. Update the
README to include one or two representative error lines (plain text) following
the colon or delete the colon phrase to avoid implying missing examples, and
mention SelfSubjectAccessReview as an example of a bootstrap API that can
trigger these errors.
- Around line 34-35: Replace the bare URL after the "Related issue:" line with a
Markdown link; update the text so the line reads something like: Related issue:
[open-cluster-management-io/ocm#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)
or use [Related
issue](https://github.com/open-cluster-management-io/ocm/issues/1314) to satisfy
markdown linting and improve accessibility.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d037683 and af13db8.

📒 Files selected for processing (1)
  • troubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.

Applied to files:

  • troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md

35-35: Bare URL used

(MD034, no-bare-urls)

Comment on lines +14 to +17
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
errors such as:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Complete the error examples section.

The text states "resulting in errors such as:" but no specific error messages are listed. Either provide concrete error examples (e.g., timeout errors, connection refused messages) to help readers diagnose the issue, or remove the "such as:" phrase to clarify that these are general TLS handshake timeouts.

🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 14 - 17, The sentence ending with
"resulting in errors such as:" is incomplete; either add concrete example
messages (e.g., "TLS handshake timeout", "i/o timeout", "connection reset by
peer", "connection refused", or Kubernetes API errors referencing
SelfSubjectAccessReview failures) after the colon to illustrate likely symptoms,
or remove the phrase "such as:" so the sentence reads as a general statement
about potential TLS handshake timeouts. Update the README to include one or two
representative error lines (plain text) following the colon or delete the colon
phrase to avoid implying missing examples, and mention SelfSubjectAccessReview
as an example of a bootstrap API that can trigger these errors.

@Shrushti72 Shrushti72 changed the title docs: add troubleshooting note for MTU mismatch bootstrap failure add troubleshooting note for MTU mismatch bootstrap failure Jan 7, 2026
@Shrushti72 Shrushti72 changed the title add troubleshooting note for MTU mismatch bootstrap failure 📖 add troubleshooting note for MTU mismatch bootstrap failure Jan 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @troubleshooting/README.md:
- Around line 37-38: Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link (for example: [OCM issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)) so the
reference conforms to MD034 and renders/accessibility is improved; update the
line containing the URL to use that markdown link text.
- Around line 15-21: The README's paragraph ends with "resulting in errors such
as:" but no examples follow; fix by either (A) adding concrete error examples
after that phrase (e.g., "TLS handshake timeout", "i/o timeout", "connection
refused", or specific kubernetes SelfSubjectAccessReview-related errors)
formatted as a short bullet list or inline comma-separated examples to make the
statement complete, or (B) remove the trailing "such as:" so the sentence reads
without implying missing examples; update the text around that phrase in
troubleshooting/README.md accordingly.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af13db8 and 51aaa2b.

📒 Files selected for processing (1)
  • troubleshooting/README.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-24T00:18:33.339Z
Learnt from: zhiweiyin318
Repo: open-cluster-management-io/ocm PR: 1194
File: deploy/klusterlet/chart/klusterlet/templates/bootstrap_kubeconfig_secret.yaml:25-27
Timestamp: 2025-09-24T00:18:33.339Z
Learning: gRPC config in OCM klusterlet bootstrap secrets does not support multiHubBootstrapHubKubeConfigs scenarios - it is intentionally designed only for single hub bootstrap configurations.

Applied to files:

  • troubleshooting/README.md
🪛 markdownlint-cli2 (0.18.1)
troubleshooting/README.md

38-38: Bare URL used

(MD034, no-bare-urls)

Comment on lines +15 to +21
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
errors such as:



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Complete the error examples section.

The text ends with "resulting in errors such as:" but lines 20–21 are blank—no error messages are provided. Either add concrete error examples (e.g., "TLS handshake timeout", "i/o timeout", "connection refused") or remove the "such as:" phrase to avoid implying missing examples.

💡 Suggested fix

Option 1: Provide concrete error examples:

  the hub path MTU. These packets may be silently dropped, resulting in
- errors such as:
+ errors such as:
+ - `TLS handshake timeout`
+ - `context deadline exceeded`
+ - `i/o timeout`

Option 2: Remove the incomplete phrase:

  the hub path MTU. These packets may be silently dropped, resulting in
- errors such as:
+ TLS handshake timeouts.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
errors such as:
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
errors such as:
- `TLS handshake timeout`
- `context deadline exceeded`
- `i/o timeout`
Suggested change
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
errors such as:
In such cases, some bootstrap API calls (for example,
`SelfSubjectAccessReview`) can generate TLS handshake packets that exceed
the hub path MTU. These packets may be silently dropped, resulting in
TLS handshake timeouts.
🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 15 - 21, The README's paragraph ends
with "resulting in errors such as:" but no examples follow; fix by either (A)
adding concrete error examples after that phrase (e.g., "TLS handshake timeout",
"i/o timeout", "connection refused", or specific kubernetes
SelfSubjectAccessReview-related errors) formatted as a short bullet list or
inline comma-separated examples to make the statement complete, or (B) remove
the trailing "such as:" so the sentence reads without implying missing examples;
update the text around that phrase in troubleshooting/README.md accordingly.

Comment on lines +37 to +38
Related issue:
https://github.com/open-cluster-management-io/ocm/issues/1314
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Format the URL as a markdown link.

The bare URL violates markdown linting (MD034). Format it as a proper markdown link for consistency and accessibility.

📝 Proposed fix
- Related issue:
- https://github.com/open-cluster-management-io/ocm/issues/1314
+ **Related issue:** [#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Related issue:
https://github.com/open-cluster-management-io/ocm/issues/1314
**Related issue:** [#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

38-38: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In @troubleshooting/README.md around lines 37 - 38, Replace the bare URL
"https://github.com/open-cluster-management-io/ocm/issues/1314" with a markdown
link (for example: [OCM issue
#1314](https://github.com/open-cluster-management-io/ocm/issues/1314)) so the
reference conforms to MD034 and renders/accessibility is improved; update the
line containing the URL to use that markdown link text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant