Skip to content

Conversation

@Shrushti72
Copy link

@Shrushti72 Shrushti72 commented Jan 7, 2026

Fix klusterlet TLS handshake timeouts caused by MTU mismatch
Klusterlet bootstrap and certificate rotation can fail when the managed
cluster MTU exceeds the hub path MTU due to large TLS handshake packets
during SelfSubjectAccessReview calls.
This change adds a pre-flight MTU validation to fail fast with a clear,
actionable error message instead of a silent TLS handshake timeout.

Summary by CodeRabbit

  • Bug Fixes
    • Added network MTU detection and validation that surfaces errors when detected MTU exceeds expected limits ( >1500 ), preventing connectivity and operational issues caused by incompatible network settings.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci bot requested review from jnpacker and qiujian16 January 7, 2026 15:09
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Shrushti72
Once this PR has been reviewed and has the lgtm label, please assign jnpacker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 7, 2026

Walkthrough

Added a local MTU detection helper and an MTU-based validation guard in CreateSelfSubjectAccessReviews; adjusted the function parameter list formatting, reformatted a client call chain to multiple lines, and added fmt and net imports.

Changes

Cohort / File(s) Summary
MTU Detection & Validation
pkg/common/helpers/ssar.go
Added detectLocalMTU() to find the maximum MTU among active interfaces; added MTU check in CreateSelfSubjectAccessReviews that returns an error if detected MTU > 1500.
Function Signature & Call Formatting
pkg/common/helpers/ssar.go
Trailing comma added to CreateSelfSubjectAccessReviews parameter list; refactored kubeClient.AuthorizationV1().SelfSubjectAccessReviews().Create into a multi-line chained call (no behavior change).
Imports
pkg/common/helpers/ssar.go
Added fmt and net imports to support MTU detection and error formatting.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description explains the problem (MTU mismatch causing TLS timeouts) and the solution (pre-flight validation with clear error), but lacks the template structure with required sections like Summary and Related issue(s). Restructure the description to follow the template: add a Summary section, include Related issue(s) with issue number, and format with proper sections.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding MTU mismatch detection before SelfSubjectAccessReview calls to fail fast.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/common/helpers/ssar.go (1)

36-65: Enhance test coverage for new MTU validation logic.

The test file (ssar_test.go) currently has a TODO comment acknowledging it needs enhancement. While existing tests call the function with different SSAR types, they don't cover the new MTU detection path:

  • MTU detection success (mtu ≤ 1500)
  • MTU detection failure (detectLocalMTU error)
  • MTU mismatch error (mtu > 1500)

All three callers handle errors generically with fmt.Errorf wrapping, so the new error message won't break existing error handling patterns. However, add test cases for the MTU validation scenarios to ensure the detection logic works as intended.

🤖 Fix all issues with AI agents
In @pkg/common/helpers/ssar.go:
- Around line 13-34: The detectLocalMTU function wrongly returns the maximum MTU
across all "up" interfaces (often the loopback MTU), causing false positives;
update detectLocalMTU to ignore loopback and non-routable interfaces by
filtering out interfaces with net.FlagLoopback and/or interfaces lacking an
IPv4/IPv6 address, or better: determine the interface used to reach the hub
endpoint (e.g., perform a UDP/TCP dial to the hub address and inspect the local
address/route) and return the MTU for that specific interface; ensure references
to net.Interfaces(), iface.Flags, and iface.MTU in detectLocalMTU are updated
accordingly so loopback/non-routable interfaces are excluded or the correct
outbound interface is selected.
- Around line 42-50: The MTU detection error is currently ignored, so update the
SSAR MTU check to log a warning via klog when detectLocalMTU() returns a non-nil
error while preserving the fail-open behavior; specifically, after calling
detectLocalMTU() (the detectLocalMTU symbol) add klog.Warningf("failed to detect
local MTU: %v; skipping MTU safety check", err) (or equivalent) and then
continue without changing the existing control flow that proceeds when err !=
nil, but keep the existing error/return path when err == nil and mtu > 1500.
🧹 Nitpick comments (1)
pkg/common/helpers/ssar.go (1)

39-39: Unnecessary trailing comma.

The trailing comma after selfSubjectAccessReviews is a stylistic choice typically used for multi-element lists that span multiple lines. For a function signature with parameters already on separate lines, it's not strictly necessary and adds minor churn to the diff.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b40e72 and ee84616.

📒 Files selected for processing (1)
  • pkg/common/helpers/ssar.go
🔇 Additional comments (2)
pkg/common/helpers/ssar.go (2)

5-6: LGTM!

The new imports are necessary for MTU detection and error formatting.


54-56: LGTM - improved readability.

Reformatting the method chain to multiple lines makes the client-go API call pattern easier to read without changing behavior.

Comment on lines +42 to +50
mtu, err := detectLocalMTU()
if err == nil && mtu > 1500 {
return false, nil, fmt.Errorf(
"MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+
"This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+
"Please align MTU values between hub and managed cluster.",
mtu,
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the ssar.go file to see the full context
cat -n pkg/common/helpers/ssar.go

Repository: open-cluster-management-io/ocm

Length of output: 6793


🏁 Script executed:

# Find logging patterns used in this package and broader codebase
rg -n 'klog\.|logr\.|zap\.|fmt\.Printf|log\.' --type=go pkg/common/helpers/ | head -20

Repository: open-cluster-management-io/ocm

Length of output: 355


🏁 Script executed:

# Also check for logging in the parent package
rg -n 'klog\.|logr\.|zap\.|fmt\.Printf|log\.' --type=go pkg/common/ | head -30

Repository: open-cluster-management-io/ocm

Length of output: 591


Silent error handling masks MTU detection failures.

When detectLocalMTU() returns an error, it's silently ignored and the code proceeds with the SSAR check. This "fail-open" approach may mask real issues:

  • Permission errors preventing network interface enumeration
  • Platform-specific issues where net.Interfaces() fails
  • Configuration problems that should be surfaced

Since the codebase uses klog for logging (already in use in pkg/common/helpers/), log a warning when MTU detection fails so operators can diagnose why the safety check isn't running.

💡 Suggested enhancement
 	mtu, err := detectLocalMTU()
+	if err != nil {
+		klog.Warningf("MTU detection failed: %v. Skipping MTU validation check.", err)
+	} else if mtu > 1500 {
 		return false, nil, fmt.Errorf(
 			"MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+
 				"This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+
 				"Please align MTU values between hub and managed cluster.",
 			mtu,
 		)
 	}

This maintains the fail-open behavior while providing operator visibility into why the check was skipped.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mtu, err := detectLocalMTU()
if err == nil && mtu > 1500 {
return false, nil, fmt.Errorf(
"MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+
"This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+
"Please align MTU values between hub and managed cluster.",
mtu,
)
}
mtu, err := detectLocalMTU()
if err != nil {
klog.Warningf("MTU detection failed: %v. Skipping MTU validation check.", err)
} else if mtu > 1500 {
return false, nil, fmt.Errorf(
"MTU mismatch detected: managed cluster MTU (%d) exceeds typical hub path MTU (1500). "+
"This can cause TLS handshake timeouts during bootstrap or certificate rotation. "+
"Please align MTU values between hub and managed cluster.",
mtu,
)
}
🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 42 - 50, The MTU detection error is
currently ignored, so update the SSAR MTU check to log a warning via klog when
detectLocalMTU() returns a non-nil error while preserving the fail-open
behavior; specifically, after calling detectLocalMTU() (the detectLocalMTU
symbol) add klog.Warningf("failed to detect local MTU: %v; skipping MTU safety
check", err) (or equivalent) and then continue without changing the existing
control flow that proceeds when err != nil, but keep the existing error/return
path when err == nil and mtu > 1500.

@Shrushti72 Shrushti72 changed the title Fail fast on MTU mismatch before SelfSubjectAccessReview 🐛 Fail fast on MTU mismatch before SelfSubjectAccessReview Jan 7, 2026
@Shrushti72 Shrushti72 changed the title 🐛 Fail fast on MTU mismatch before SelfSubjectAccessReview (:bug:) fail fast on MTU mismatch before SelfSubjectAccessReview Jan 7, 2026
@Shrushti72 Shrushti72 changed the title (:bug:) fail fast on MTU mismatch before SelfSubjectAccessReview 🐛 fail fast on MTU mismatch before SelfSubjectAccessReview Jan 7, 2026
@Shrushti72 Shrushti72 changed the title 🐛 fail fast on MTU mismatch before SelfSubjectAccessReview \:bug\: fail fast on MTU mismatch before SelfSubjectAccessReview Jan 7, 2026
@Shrushti72 Shrushti72 changed the title \:bug\: fail fast on MTU mismatch before SelfSubjectAccessReview 🐛 fail fast on MTU mismatch before SelfSubjectAccessReview Jan 7, 2026
@Shrushti72
Copy link
Author

actually pr was not verified ,it was creating issue so many commits was pushed because of this

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @pkg/common/helpers/ssar.go:
- Around line 1-12: The import block in package helpers (ssar.go) is not
grouped/ordered per gci rules (standard, external, local); fix by reordering and
grouping the imports accordingly (standard libs first: context, fmt, net; then
external k8s packages; then any local packages) and run the gci formatter on the
file (e.g., use "gci write" for this file) to apply the correct ordering and
formatting.
- Around line 168-170: Remove the unnecessary trailing blank lines inserted in
pkg/common/helpers/ssar.go (around the end of the affected block near lines
168–170); delete the extra empty lines so there is no trailing whitespace or
blank lines after the surrounding code (ensure file ends cleanly and run go fmt
if needed).
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee84616 and d7f4072.

📒 Files selected for processing (1)
  • pkg/common/helpers/ssar.go
🧰 Additional context used
🪛 golangci-lint (2.5.0)
pkg/common/helpers/ssar.go

[error] 1-1: File is not properly formatted

(gci)

🔇 Additional comments (3)
pkg/common/helpers/ssar.go (3)

40-41: Idiomatic trailing comma for multi-line signature.

The trailing comma in the parameter list follows Go's style conventions for multi-line function signatures.


55-57: Formatting change improves readability.

Breaking the chained method call across multiple lines is a reasonable style choice with no functional impact.


43-51: Silent error handling masks MTU detection failures (as previously flagged).

When detectLocalMTU() fails, the error is ignored and execution continues. This fail-open behavior provides no visibility into why the safety check was skipped, masking permission errors, platform-specific failures, or configuration issues.

As noted in the previous review, add logging (the codebase uses klog) to warn operators when MTU detection fails.

Based on previous review feedback on this PR.

Likely an incorrect or invalid review comment.

Comment on lines +1 to 12

package helpers

import (
"context"
"fmt"
"net"

authorizationv1 "k8s.io/api/authorization/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix import ordering per gci linter.

The imports should be properly grouped and ordered according to the gci linter rules (standard library, external packages, local packages).

Run the following to fix:

#!/bin/bash
# Format imports according to gci rules
gci write pkg/common/helpers/ssar.go
🧰 Tools
🪛 golangci-lint (2.5.0)

[error] 1-1: File is not properly formatted

(gci)

🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 1 - 12, The import block in package
helpers (ssar.go) is not grouped/ordered per gci rules (standard, external,
local); fix by reordering and grouping the imports accordingly (standard libs
first: context, fmt, net; then external k8s packages; then any local packages)
and run the gci formatter on the file (e.g., use "gci write" for this file) to
apply the correct ordering and formatting.

Comment on lines +14 to +35
func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
if err != nil {
return 0, err
}

maxMTU := 0
for _, iface := range ifaces {
if iface.Flags&net.FlagUp == 0 {
continue
}
if iface.MTU > maxMTU {
maxMTU = iface.MTU
}
}

if maxMTU == 0 {
return 0, fmt.Errorf("unable to detect MTU")
}

return maxMTU, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: loopback MTU will cause false positives (as previously flagged).

This function returns the maximum MTU across all "up" interfaces, including loopback (typically 65536 on Linux). The validation at line 44 (mtu > 1500) will fail in nearly all deployments, blocking legitimate cluster registrations even when the actual hub-facing interface has the correct MTU.

As noted in the previous review, you must either:

  • Filter out loopback/non-routable interfaces (check net.FlagLoopback), or
  • Detect the specific interface used to reach the hub endpoint

Based on previous review feedback on this PR.

Comment on lines +168 to +170



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unnecessary trailing whitespace.

The added blank lines serve no purpose and should be removed.

♻️ Remove the blank lines
 		})
 	}
 	return reviews
 }
-

-

-			
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
})
}
return reviews
}
🤖 Prompt for AI Agents
In @pkg/common/helpers/ssar.go around lines 168 - 170, Remove the unnecessary
trailing blank lines inserted in pkg/common/helpers/ssar.go (around the end of
the affected block near lines 168–170); delete the extra empty lines so there is
no trailing whitespace or blank lines after the surrounding code (ensure file
ends cleanly and run go fmt if needed).

Copy link
Member

@qiujian16 qiujian16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this detection to be opt-in and disabled by default. It might break in existing env otherwise

selfSubjectAccessReviews []authorizationv1.SelfSubjectAccessReview,
) (bool, *authorizationv1.SelfSubjectAccessReview, error) {

mtu, err := detectLocalMTU()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this mtu detection in a separated func instead of calling inside CreateSelfSubjectAccessReviews

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this mtu detection in a separated func instead of calling inside CreateSelfSubjectAccessReviews

I agree that MTU detection shouldn’t be embedded directly inside CreateSelfSubjectAccessReviews.
I’ll refactor this if wanted so MTU detection lives in a separate helper and is passed in only when the feature is enabled.

) (bool, *authorizationv1.SelfSubjectAccessReview, error) {

mtu, err := detectLocalMTU()
if err == nil && mtu > 1500 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if local MTU is larger than 1500 and remote is also larger?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if local MTU is larger than 1500 and remote is also larger?

You’re right, local MTU alone is only a heuristic.
The intent here is not to guarantee a mismatch, but to conservatively reduce SSAR payload size only when jumbo frames are detected locally and the opt-in flag is enabled.

)

func detectLocalMTU() (int, error) {
ifaces, err := net.Interfaces()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to clarify system permission it needs to execute this command.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to clarify system permission it needs to execute this command.

Good catch. net.Interfaces() relies on standard OS networking APIs and does not require elevated privileges in typical container environments, but this should be documented.

@Shrushti72
Copy link
Author

I would like this detection to be opt-in and disabled by default. It might break in existing env otherwise

I’ll gate the MTU-aware behavior behind an explicit flag e.g. env var or config option, and keep the default path unchanged when it’s disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants