fix: Add NetworkPolicy for metrics endpoint security#49
Conversation
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughAdds NetworkPolicy reconciliation for TrainJob pods: RBAC rule extended for Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Controller as TrainJob Controller
participant Rhai as Rhai.Module
participant SA as ServiceAccount FS
participant K8s as Kubernetes API
Controller->>Rhai: ReconcileNetworkPolicy(ctx, client, trainJob)
Rhai->>SA: read namespace file (if mounted)
SA-->>Rhai: namespace (or none)
Rhai->>Rhai: resolve controller namespace (fallback)
Rhai->>Rhai: build desired NetworkPolicy (labels, podSelector, ingress rules, ownerRef)
Rhai->>K8s: Get NetworkPolicy by name/namespace
alt not found
Rhai->>K8s: Create NetworkPolicy
K8s-->>Rhai: Created
else found
Rhai->>K8s: Update NetworkPolicy (if diff)
K8s-->>Rhai: Updated
end
Rhai-->>Controller: return result or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Test Coverage Report for Build 20747341008Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
pkg/rhai/constants/constants.go (2)
33-38: Clarify whether port range is enforced or purely advisoryThe comment now specifies 1024–65535 as the valid metrics-port range, but neither
GetMetricsPortnor the NetworkPolicy builder currently enforce it (only non‑numeric values fall back to default). Consider either:
- enforcing the documented range (e.g., clamp or fall back to default), or
- rephrasing the comment to make it clear the range is a recommendation, not a hard validation.
66-72: New NetworkPolicy constants look good; ensure deployment docs match
NetworkPolicyNameSuffixandDefaultControllerNamespaceare reasonable defaults and are cleanly consumed fromnetworkpolicy.go. Just make sure controller deployment manifests and docs clearly describe:
- the default controller namespace (
opendatahub), and- the
CONTROLLER_NAMESPACEoverride behavior.pkg/rhai/progression/progression.go (1)
697-702: Non‑fatal NetworkPolicy reconciliation fits the progression flowCalling
ReconcileNetworkPolicyonce progression tracking is enabled, and treating failures as non‑fatal with V(1) logging, matches the “best‑effort hardening” goal and avoids breaking metrics collection in clusters without NetworkPolicy support. If you ever want to reduce churn, you could optionally gate this onisRunning, but the current approach is safe.pkg/rhai/progression/networkpolicy.go (1)
137-168: Ensure OwnerReferences are reconciled on update and be explicit about label overwrites
ReconcileNetworkPolicybehaves correctly for the happy path, but two edge‑case behaviors are worth tightening:
OwnerReferences on update
On create,buildNetworkPolicysets an OwnerReference so the policy is garbage‑collected with theTrainJob. On update you only copySpecandLabels:existingPolicy.Spec = desiredPolicy.Spec existingPolicy.Labels = desiredPolicy.LabelsIf an older deployment or a user created a policy without OwnerReferences, it will remain orphaned even after this controller starts managing it. Consider also reconciling
OwnerReferences(e.g., replace or at least set when empty) to guarantee cleanup semantics for all managed policies.Label replacement semantics
Replacing the entireLabelsmap is usually fine for a controller‑owned object but will drop any user‑added labels on existing policies. If you expect users or other tooling to attach labels (e.g., for monitoring), merging controller‑owned labels into the existing map instead of wholesale replacement would be friendlier.Both tweaks are backward‑compatible and make the reconciliation behavior more predictable.
pkg/rhai/progression/networkpolicy_test.go (2)
119-307: Broader port validation and negative tests would strengthen coverage
TestBuildNetworkPolicythoroughly checks metadata, selectors, owner refs, and both default/custom ports, plus the non‑numeric fallback case. Given the documented valid range, you might add one more table entry for a numeric but invalid port (e.g.,"0"or"70000") to assert whatever behavior you choose inbuildNetworkPolicy(fallback or clamp). That will lock in the semantics once you tighten the range handling.
309-461: Consider asserting OwnerReferences on update scenarios as well
TestReconcileNetworkPolicycovers create/no‑op/update flows with a fake client, which is great. For the “updates existing NetworkPolicy” case, you currently only validate that the port changed. OnceReconcileNetworkPolicystarts reconciling OwnerReferences on update, it’d be useful to extend this test to assert that:
- the updated policy has exactly one OwnerReference, and
- it points at the expected
TrainJobUID.That will prevent regressions in cleanup behavior.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
manifests/rhoai/rbac_progression_patch.yaml(1 hunks)pkg/rhai/constants/constants.go(2 hunks)pkg/rhai/progression/networkpolicy.go(1 hunks)pkg/rhai/progression/networkpolicy_test.go(1 hunks)pkg/rhai/progression/progression.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (2)
DefaultControllerNamespace(72-72)NetworkPolicyNameSuffix(69-69)pkg/rhai/progression/progression.go (2)
GetMetricsPort(207-215)IsProgressionTrackingEnabled(88-94)
🔇 Additional comments (6)
manifests/rhoai/rbac_progression_patch.yaml (1)
14-28: RBAC additions are minimal and aligned with controller behaviorGranting
get,create, andupdateonnetworkpoliciesmatches the actual usage inReconcileNetworkPolicyand keeps the permission surface tight. No additional verbs seem necessary given the current implementation.pkg/rhai/progression/networkpolicy.go (2)
36-42: Controller namespace helper aligns with env‑override design
getControllerNamespacecorrectly prefers theCONTROLLER_NAMESPACEenv var and falls back toDefaultControllerNamespace. This matches the tests and RBAC patch; just ensure the operator manifest sets this env var appropriately in non‑default deployments.
48-126: Tie default port to shared constant and consider enforcing valid range
buildNetworkPolicylooks solid overall (selectors, owner refs, ingress rules), but two small robustness points:
Default port literal
You currently fall back to28080via a literal:portNum, err := strconv.Atoi(metricsPort) if err != nil { portNum = 28080 }To avoid divergence if the default ever changes, consider deriving this from a shared constant instead of hard-coding.
Range validation
Numeric but out-of-range values (e.g.,"0","65536", negative) will passAtoiand produce aNetworkPolicythat the API server may reject. Consider validatingportNumand falling back to the default if it's outside the allowed range, and/or logging a warning when clamping/falling back so misconfigurations are visible.These changes would keep metrics port handling consistent and harden against bad annotations.
pkg/rhai/progression/networkpolicy_test.go (3)
37-84: Env override tests are sound and restore global state correctly
TestGetControllerNamespacenicely exercises default and env‑override behavior while restoringCONTROLLER_NAMESPACEper subtest. Good pattern for avoiding cross‑test contamination.
463-520: Controller namespace from env is well‑covered
TestReconcileNetworkPolicy_ControllerNamespaceFromEnvcorrectly verifies that theNamespaceSelectoron the controller peer uses the custom namespace whenCONTROLLER_NAMESPACEis set. The save/restore logic around the env var is also correct.
522-604: Security‑property tests match the intended policy design
TestBuildNetworkPolicy_SecurityPropertiesclearly encodes the key invariants: ingress‑only, controller‑only access to the metrics port, same‑namespace restrictions for same‑job pods, and owner‑based cleanup. This is a solid safety net around the NetworkPolicy structure.
robert-bell
left a comment
There was a problem hiding this comment.
lgtm - thanks for this @abhijeet-dhumal!
Have you been able to test this in a real cluster?
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/rhai/progression/networkpolicy.go (2)
59-64: Use the existing constant for the default port fallback.The hardcoded
28080duplicates the value defined inconstants.DefaultMetricsPort. Using the constant ensures consistency if the default ever changes.metricsPort := GetMetricsPort(trainJob) portNum, err := strconv.Atoi(metricsPort) if err != nil { - portNum = 28080 // default + portNum, _ = strconv.Atoi(constants.DefaultMetricsPort) // fallback to default }Alternatively, consider extracting a helper like
GetMetricsPortInt()to avoid duplicate parsing logic.
171-175: Consider restoring OwnerReferences during update.Currently, only
SpecandLabelsare synchronized during updates. If OwnerReferences are manually removed from the existing policy, they won't be restored, which could break automatic garbage collection.existingPolicy.Spec = desiredPolicy.Spec existingPolicy.Labels = desiredPolicy.Labels + existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {This is a minor concern since OwnerReferences are rarely modified manually, but including them ensures complete reconciliation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
manifests/rhoai/rbac_progression_patch.yaml(1 hunks)pkg/rhai/constants/constants.go(2 hunks)pkg/rhai/progression/networkpolicy.go(1 hunks)pkg/rhai/progression/networkpolicy_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/rhai/constants/constants.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (2)
DefaultControllerNamespace(73-73)NetworkPolicyNameSuffix(69-69)pkg/rhai/progression/progression.go (2)
GetMetricsPort(207-215)IsProgressionTrackingEnabled(88-94)
🔇 Additional comments (5)
manifests/rhoai/rbac_progression_patch.yaml (1)
14-30: LGTM! Well-documented RBAC permissions for NetworkPolicy management.The permissions are appropriately scoped:
get/list/watchenables controller-runtime's caching mechanismcreate/updateallows reconciliationdeleteis correctly omitted since OwnerReference-based garbage collection handles cleanupThe inline comments clearly explain the security rationale.
pkg/rhai/progression/networkpolicy.go (1)
39-51: Solid namespace detection with proper fallback chain.The priority order (SA file → env var → default) correctly handles both in-cluster and local development scenarios.
pkg/rhai/progression/networkpolicy_test.go (3)
62-85: Good test coverage with proper environment cleanup.The save/restore pattern for environment variables ensures test isolation. The tests adequately cover the fallback chain when the SA namespace file isn't available.
311-463: Comprehensive reconciliation tests covering all key paths.The tests cover:
- Policy creation when progression tracking is enabled
- No-op when tracking is disabled or missing
- Policy updates with changed configuration
Good use of the fake client with proper scheme registration.
524-606: Excellent security property tests.These tests effectively document and verify the security invariants of the NetworkPolicy:
- Ingress-only enforcement
- Controller-restricted metrics access
- Namespace isolation for same-job communication
- Automatic cleanup via OwnerReference
This is a good practice for security-critical code.
pkg/rhai/constants/constants.go
Outdated
| // NetworkPolicy constants for metrics endpoint security | ||
|
|
||
| // NetworkPolicyNameSuffix is appended to TrainJob name to create NetworkPolicy name. | ||
| NetworkPolicyNameSuffix string = "-metrics-netpol" |
There was a problem hiding this comment.
The NWP is not specific to metrics, it also covers pod-to-pod traffic.
It could be named as the train job.
There was a problem hiding this comment.
Ahh, that awesome catch..
The NetworkPolicy isn't just for metrics - it has two rules:
- Controller > metrics port
- Same-job pods > all ports (for NCCL/MPI/gRPC)
I will update it to trainjob name, so that the trainjob and Networkpolicy will have same name as you suggested
There was a problem hiding this comment.
Thanks a lot @astefanutti for reviewing the PR !
pkg/rhai/progression/progression.go
Outdated
| } | ||
|
|
||
| // Ensure NetworkPolicy exists to restrict metrics endpoint access to controller only | ||
| if err := ReconcileNetworkPolicy(ctx, c, trainJob); err != nil { |
There was a problem hiding this comment.
The NWP scope is larger than metrics / progression.
Would that make sense to reconcile it in the TrainJob controller reconcileObjects method directly or in runtime.NewObjects?
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
pkg/rhai/progression/networkpolicy_test.go (2)
220-223: Fix inconsistent error message.The error message says
"want manager"but the code checks for"controller". This was flagged in a previous review but appears unfixed.Apply this diff:
if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" { - t.Errorf("Rule 1: Controller component label = %q, want manager", + t.Errorf("Rule 1: Controller component label = %q, want controller", controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"]) }
424-426: Fix inconsistent error message.The error message says
"Missing manager component label requirement"but the code checks for"controller". This was flagged in a previous review but appears unfixed.Apply this diff:
if peer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" { - t.Error("Missing manager component label requirement") + t.Error("Missing controller component label requirement") }
🧹 Nitpick comments (2)
pkg/rhai/constants/constants.go (1)
68-69: Consider a more generic default namespace for upstream.The
DefaultControllerNamespace = "opendatahub"is specific to Red Hat's RHOAI distribution. For the upstream Kubeflow project, a more generic default like"kubeflow"or making it configurable might be more appropriate. This fallback is rarely used (only when the service account namespace file is unavailable), but it could cause confusion for non-RHOAI deployments.pkg/rhai/progression/networkpolicy.go (1)
134-140: Consider using standard pointer utilities.The
boolPtrandprotocolPtrhelpers duplicate functionality already available ink8s.io/utils/ptr.To[T](). Using the standard library improves consistency across the codebase.For example:
import "k8s.io/utils/ptr" // Replace boolPtr(true) with ptr.To(true) // Replace protocolPtr(corev1.ProtocolTCP) with ptr.To(corev1.ProtocolTCP)Note: I see that
k8s.io/utils/ptris already imported intrainjob_controller.go(line 33), suggesting it's an accepted pattern in this codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
pkg/controller/trainjob_controller.go(1 hunks)pkg/rhai/constants/constants.go(2 hunks)pkg/rhai/progression/networkpolicy.go(1 hunks)pkg/rhai/progression/networkpolicy_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
pkg/rhai/progression/networkpolicy_test.go (3)
pkg/rhai/constants/constants.go (2)
AnnotationMetricsPort(38-38)AnnotationProgressionTracking(26-26)pkg/util/testing/client.go (1)
NewClientBuilder(35-47)pkg/rhai/progression/networkpolicy.go (1)
ReconcileNetworkPolicy(144-170)
pkg/controller/trainjob_controller.go (1)
pkg/rhai/progression/networkpolicy.go (1)
ReconcileNetworkPolicy(144-170)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
DefaultControllerNamespace(69-69)pkg/rhai/progression/progression.go (1)
GetMetricsPort(207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: pre-commit
- GitHub Check: Test
- GitHub Check: Generate
🔇 Additional comments (9)
pkg/rhai/constants/constants.go (1)
34-36: LGTM: Clear documentation of port constraints.The updated comment properly explains the valid port range and the security rationale for avoiding privileged ports.
pkg/rhai/progression/networkpolicy.go (4)
37-47: LGTM: Robust namespace detection with fallback.The implementation correctly reads the service account namespace file with appropriate error handling and fallback to the default constant.
49-51: LGTM: Simplified naming aligns with PR feedback.Based on the past review comments, this was appropriately changed from metrics-specific naming to match the TrainJob name directly, since the NetworkPolicy covers both metrics and pod-to-pod traffic.
64-131: LGTM: NetworkPolicy design follows security best practices.The policy correctly:
- Uses OwnerReferences for automatic cleanup
- Restricts metrics port access to controller pods in the controller namespace
- Allows same-job pod communication on all ports (for NCCL/MPI/gRPC)
- Prevents cross-namespace and cross-job access via appropriate selectors
- Uses Ingress-only policy type
The implementation aligns well with the multi-tenancy security goals stated in the PR.
143-170: LGTM: Standard Kubernetes reconciliation pattern.The create-or-update logic follows best practices with:
- Proper NotFound handling
- Descriptive error wrapping
- Updating both Spec and Labels on the update path
pkg/rhai/progression/networkpolicy_test.go (4)
36-67: LGTM: Clear test coverage for name derivation.The test appropriately verifies that NetworkPolicy names match the TrainJob name.
69-257: LGTM: Comprehensive policy construction tests.Excellent coverage of:
- Default, custom, and invalid port handling
- Metadata, labels, and OwnerReferences
- PodSelector and ingress rules
- Both controller and same-job peer configurations
259-380: LGTM: Solid reconciliation flow tests.The tests properly verify both create and update paths using a fake client, including ownership and spec updates.
382-464: LGTM: Excellent security property validation.The security-focused test suite effectively validates:
- Ingress-only policy type
- Controller-restricted metrics port access
- Namespace isolation for same-job pods
- OwnerReference cleanup semantics
This demonstrates a security-conscious approach to the NetworkPolicy design.
| // Reconcile NetworkPolicy for pod isolation | ||
| if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for the networkpolicy implementation and annotation checks
rg -n -A10 -B2 'func.*ReconcileNetworkPolicy' pkg/rhai/progression/Repository: opendatahub-io/trainer
Length of output: 2193
🏁 Script executed:
#!/bin/bash
# Also search for annotation checks in the progression package
rg -n 'AnnotationProgressionTracking' pkg/rhai/progression/Repository: opendatahub-io/trainer
Length of output: 1095
🏁 Script executed:
#!/bin/bash
# Check if there's an annotation constant defined
rg -n 'AnnotationProgressionTracking' --type-list | head -20
rg -n 'AnnotationProgressionTracking' pkg/Repository: opendatahub-io/trainer
Length of output: 1746
🏁 Script executed:
#!/bin/bash
# Check progression.go around line 92 to see if annotation check wraps ReconcileNetworkPolicy
sed -n '85,110p' pkg/rhai/progression/progression.goRepository: opendatahub-io/trainer
Length of output: 1048
🏁 Script executed:
#!/bin/bash
# Check the buildNetworkPolicy function
rg -n -A15 'func buildNetworkPolicy' pkg/rhai/progression/networkpolicy.goRepository: opendatahub-io/trainer
Length of output: 649
🏁 Script executed:
#!/bin/bash
# Check the full context of how ReconcileNetworkPolicy is called in progression.go
rg -n -B5 -A5 'ReconcileNetworkPolicy' pkg/rhai/progression/progression.goRepository: opendatahub-io/trainer
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Check the trainjob_controller.go to understand the exact call context
sed -n '155,175p' pkg/controller/trainjob_controller.goRepository: opendatahub-io/trainer
Length of output: 785
🏁 Script executed:
#!/bin/bash
# Search for all calls to ReconcileNetworkPolicy to see if they're conditional elsewhere
rg -n -B3 -A1 'ReconcileNetworkPolicy' pkg/Repository: opendatahub-io/trainer
Length of output: 1970
🏁 Script executed:
#!/bin/bash
# Check if there are any conditions checking progression tracking before the NetworkPolicy call
sed -n '155,170p' pkg/controller/trainjob_controller.goRepository: opendatahub-io/trainer
Length of output: 600
🏁 Script executed:
#!/bin/bash
# Check what happens in the broader reconciliation context
sed -n '130,175p' pkg/controller/trainjob_controller.goRepository: opendatahub-io/trainer
Length of output: 1688
Add conditional check for progression-tracking annotation before NetworkPolicy reconciliation
The ReconcileNetworkPolicy call is unconditional, but the PR description states the NetworkPolicy is added "when progression-tracking is enabled." Since IsProgressionTrackingEnabled() already exists, wrap the call:
if progression.IsProgressionTrackingEnabled(trainJob) {
if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
return err
}
}Otherwise, NetworkPolicy is created for all TrainJobs regardless of the progression-tracking setting.
🤖 Prompt for AI Agents
In pkg/controller/trainjob_controller.go around lines 161 to 164, the call to
progression.ReconcileNetworkPolicy runs unconditionally causing NetworkPolicies
to be created for all TrainJobs; wrap that call with a conditional check using
progression.IsProgressionTrackingEnabled(trainJob) and only call
ReconcileNetworkPolicy if it returns true, preserving the existing error
handling to return the error when reconciliation fails.
There was a problem hiding this comment.
nit: I think this is a good suggestion, can we ensure, for any existing trainjob with progression tracking disabled, the pod behaviour does not change.
cc. @abhijeet-dhumal
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@kapil27 Actually this was a intentional design decision. The NetworkPolicy serves two purposes as described in PR description :
- Pod isolation > only same-job pods can communicate, cross-job traffic blocked (primary- not dependent on progression )
- Progression Metrics port restriction > controller-only access
These benefits all TrainJobs, regardless of progression tracking. Without it, any pod in the namespace could access training pods.
There was a problem hiding this comment.
Ah to think of it you are right 🤔
we should make progression based rule conditional on IsProgressionTrackingEnabled() since the metrics server only runs when progression tracking is enabled 👀
Thanks @kapil27 , On it !
There was a problem hiding this comment.
nit: does it make sense to move this netpol code into the rhai package, rather than progression package?
I'm happy for this to be merged as is though :)
There was a problem hiding this comment.
Yeah definitely.. as this netpol is not centralised for progression scope.. it would be good to refactor accordingly
Thanks Rob, on it!!
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
pkg/rhai/networkpolicy.go (1)
54-61: Use the shared default metrics port constant instead of a hardcoded literalThe fallback to
28080on parse error duplicates the default already encoded inconstants.DefaultMetricsPort. Using the constant keeps behavior centralized and avoids drift if the default ever changes.You can reuse the constant here:
func buildNetworkPolicy(trainJob *trainer.TrainJob) *networkingv1.NetworkPolicy { metricsPort := progression.GetMetricsPort(trainJob) portNum, err := strconv.Atoi(metricsPort) if err != nil { - portNum = 28080 // default + // Parse default port constant as fallback to keep it in sync with progression defaults + portNum, _ = strconv.Atoi(constants.DefaultMetricsPort) } port := intstr.FromInt(portNum)pkg/controller/trainjob_controller.go (1)
162-165: Gate NetworkPolicy reconciliation on progression-tracking being enabledRight now
rhai.ReconcileNetworkPolicyruns for everyTrainJob, which contradicts the stated behavior “when progression-tracking is enabled” and may surprise users by tightening network policy even when they don’t use progression tracking.Wrap the call with the existing progression flag check so NetworkPolicies are only managed when progression tracking is on:
for _, object := range objects { if err := r.client.Apply(ctx, object, client.FieldOwner("trainer"), client.ForceOwnership); err != nil { return err } } - // Reconcile NetworkPolicy for pod isolation - if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil { - return err - } + // Reconcile NetworkPolicy for pod isolation when progression tracking is enabled + if progression.IsProgressionTrackingEnabled(trainJob) { + if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil { + return err + } + } return nilpkg/rhai/networkpolicy_test.go (1)
220-223: Fix mismatch in expected component label in test failure messageThe assertion checks for
"controller"but the error message still says"want manager", which is confusing when the test fails.Update the message to match the actual expectation:
- if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" { - t.Errorf("Rule 1: Controller component label = %q, want manager", - controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"]) - } + if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" { + t.Errorf("Rule 1: Controller component label = %q, want controller", + controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"]) + }
🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)
164-170: Consider also reconciling OwnerReferences when updating an existing NetworkPolicyOn update, you currently copy
SpecandLabelsbut notOwnerReferences. If a matchingNetworkPolicypre-exists (same name/namespace) without the expected owner reference, it will be functionally “adopted” by this controller (spec/labels overwritten) but still won’t be garbage-collected with theTrainJob.If you want full convergence toward the desired resource shape, including cleanup semantics, consider also syncing owner references:
- existingPolicy.Spec = desiredPolicy.Spec - existingPolicy.Labels = desiredPolicy.Labels + existingPolicy.Spec = desiredPolicy.Spec + existingPolicy.Labels = desiredPolicy.Labels + existingPolicy.OwnerReferences = desiredPolicy.OwnerReferencesThis does change behavior for any manually created NP with the same name, so if that scenario matters you may want to guard reconciliation with a label/annotation check instead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/controller/trainjob_controller.go(2 hunks)pkg/rhai/networkpolicy.go(1 hunks)pkg/rhai/networkpolicy_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/controller/trainjob_controller.go (1)
pkg/rhai/networkpolicy.go (1)
ReconcileNetworkPolicy(145-171)
pkg/rhai/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
DefaultControllerNamespace(69-69)pkg/rhai/progression/progression.go (1)
GetMetricsPort(207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Generate
- GitHub Check: Test
- GitHub Check: pre-commit
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/rhai/networkpolicy.go (1)
77-79: Use the constant for the default metrics port.This hardcodes
28080as the fallback, butconstants.DefaultMetricsPortalready defines this value. Using the constant improves maintainability and consistency with other code paths.if err != nil { - portNum = 28080 // default + portNum, _ = strconv.Atoi(constants.DefaultMetricsPort) }
🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)
38-48: Consider caching the controller namespace.
getControllerNamespace()reads from the filesystem on every call during reconciliation. Since the namespace is static for the controller's lifetime, caching would avoid repeated I/O.+var cachedControllerNamespace string + // getControllerNamespace returns the controller's namespace from SA mount. func getControllerNamespace() string { + if cachedControllerNamespace != "" { + return cachedControllerNamespace + } if data, err := os.ReadFile(serviceAccountNamespaceFile); err == nil { if ns := strings.TrimSpace(string(data)); ns != "" { + cachedControllerNamespace = ns return ns } } - return constants.DefaultControllerNamespace + cachedControllerNamespace = constants.DefaultControllerNamespace + return cachedControllerNamespace }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/rhai/networkpolicy.go(1 hunks)pkg/rhai/networkpolicy_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
DefaultControllerNamespace(69-69)pkg/rhai/progression/progression.go (2)
IsProgressionTrackingEnabled(88-94)GetMetricsPort(207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: pre-commit
- GitHub Check: Test
- GitHub Check: Generate
🔇 Additional comments (4)
pkg/rhai/networkpolicy.go (1)
149-177: LGTM on the reconciliation logic.The create-or-update pattern is correctly implemented with proper error handling and context wrapping. The OwnerReference ensures automatic cleanup when the TrainJob is deleted.
pkg/rhai/networkpolicy_test.go (3)
36-67: LGTM!Table-driven tests correctly verify the network policy name derivation.
69-258: LGTM!Comprehensive test coverage for
buildNetworkPolicywith well-structured assertions covering metadata, labels, OwnerReference, selectors, policy types, and ingress rule structure for both progression-enabled and disabled scenarios.
372-449: LGTM!Security-focused tests correctly verify important properties: ingress-only policy type, controller-exclusive metrics access, namespace isolation for same-job pods, and OwnerReference for automatic cleanup.
|
/lgtm |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)
152-180: Consider preserving OwnerReferences on update.The update path (lines 173-177) copies
SpecandLabelsfrom the desired policy but notOwnerReferences. If the OwnerReference is accidentally removed (e.g., by manual edit), it won't be restored, potentially leaving orphaned NetworkPolicies after TrainJob deletion.existingPolicy.Spec = desiredPolicy.Spec existingPolicy.Labels = desiredPolicy.Labels + existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
manifests/rhoai/rbac_progression_patch.yaml(1 hunks)pkg/rhai/constants/constants.go(3 hunks)pkg/rhai/networkpolicy.go(1 hunks)pkg/rhai/networkpolicy_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- manifests/rhoai/rbac_progression_patch.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: pre-commit
- GitHub Check: Generate
- GitHub Check: Test
🔇 Additional comments (10)
pkg/rhai/networkpolicy_test.go (4)
36-67: LGTM!Clean table-driven test that verifies the network policy name derivation correctly returns the TrainJob name.
69-258: LGTM!Comprehensive test covering all key scenarios: progression enabled/disabled, default/custom ports, and proper verification of metadata, labels, OwnerReferences, selectors, and ingress rules. The rule identification logic based on port presence is sound.
260-380: LGTM!The reconciliation tests properly verify create and update paths using a fake client. The update test case correctly includes the progression tracking annotation (addressing the past review comment).
382-459: LGTM!Excellent security property tests that validate critical isolation guarantees: ingress-only policy, controller-restricted metrics access, same-namespace enforcement for pod isolation, and owner reference cleanup semantics.
pkg/rhai/constants/constants.go (2)
65-87: LGTM!Well-documented constants with clear explanations of the deployment context differences between RHOAI and upstream Kubeflow. Using standard Kubernetes label conventions (
app.kubernetes.io/) for controller pod identification is the right approach.
24-25: LGTM!Helpful documentation improvements clarifying the annotation values and port restrictions for non-privileged environments.
Also applies to: 34-36
pkg/rhai/networkpolicy.go (4)
41-49: LGTM!Standard pattern for namespace discovery from the service account mount with appropriate fallback. This correctly addresses the earlier review suggestion to read from the SA namespace file.
51-53: LGTM!Simple function that correctly implements the reviewed naming strategy - the NetworkPolicy shares the TrainJob's name for easy correlation.
55-142: LGTM!Well-structured NetworkPolicy construction with:
- Same-job pod isolation rule (implicit same-namespace via nil NamespaceSelector)
- Conditional controller access rule with proper namespace + pod selectors
- Correct OwnerReference setup for garbage collection
- Port parsing with warning log and constant fallback
144-150: LGTM!Standard pointer helper functions for Kubernetes resource construction.
|
@abhijeet-dhumal let's hold this PR now until after the 3.2. release. It's too close to code freeze and adds unnecessary risk. We can release it in 3.3. |
ChughShilpa
left a comment
There was a problem hiding this comment.
Tested with the latest image - http://quay.io/abdhumal/trainer:v2.1.0-jan2-netpol
and works as expected
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
54b3016 to
bb1e230
Compare
RHOAIENG-39323
Implement controller-managed NetworkPolicy to provide pod isolation for TrainJob workloads, addressing security concerns.
The NetworkPolicy provides pod isolation for TrainJob pods:
Rule 1: Same-Job Pod Communication (Always)
Rule 2: Controller Access to Metrics Port (When Progression Enabled)
What gets blocked:
This addresses multi-tenancy security concerns where training metrics could be accessed by other pods.
Checklist:
Summary by CodeRabbit
New Features
Security / Permissions
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.