fix: Add NetworkPolicy for metrics endpoint security by abhijeet-dhumal · Pull Request #49 · opendatahub-io/trainer

abhijeet-dhumal · 2025-12-10T05:50:41Z

Implement controller-managed NetworkPolicy to provide pod isolation for TrainJob workloads, addressing security concerns.

The NetworkPolicy provides pod isolation for TrainJob pods:

Rule 1: Same-Job Pod Communication (Always)

Pods belonging to the same TrainJob can communicate on all ports
Required for distributed training (NCCL, MPI, gRPC, etc.)
Applied to ALL TrainJobs

Rule 2: Controller Access to Metrics Port (When Progression Enabled)

Only controller pods can access the metrics port (default 28080)
Conditional on trainer.opendatahub.io/progression-tracking: "true" annotation
External pods/services cannot scrape training metrics

What gets blocked:

Pods from other TrainJobs in the same namespace cannot access your training pods
Random pods in the namespace cannot probe the metrics endpoint
Cross-job traffic is denied

This addresses multi-tenancy security concerns where training metrics could be accessed by other pods.

Checklist:

Docs included if any changes are user facing

Summary by CodeRabbit

New Features
- Automatic creation and reconciliation of NetworkPolicy to isolate TrainJob pods and allow controlled metrics access, with a namespace fallback.
Security / Permissions
- RBAC extended to permit managing NetworkPolicy resources (get, list, watch, create, update, patch).
Tests
- Added comprehensive unit tests for policy construction, reconciliation, updates, and security behaviors.
Documentation
- Clarified metrics port guidance and valid non-root port ranges.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-10T05:50:47Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds NetworkPolicy reconciliation for TrainJob pods: RBAC rule extended for networking.k8s.io/networkpolicies, new controller namespace and pod-label constants, TrainJob controller invokes Rhai.ReconcileNetworkPolicy, a new networkpolicy implementation and unit tests added.

Changes

Cohort / File(s)	Summary
RBAC Configuration `manifests/rhoai/rbac_progression_patch.yaml`	Appends a second RBAC rule granting `networking.k8s.io` `networkpolicies` verbs `get, list, watch, create, update, patch`; retains existing pods rule and RHAI comments.
Constants & Configuration `pkg/rhai/constants/constants.go`	Updates annotation docs (progression tracking value and metrics port range) and adds NetworkPolicy constants: `DefaultControllerNamespace`, `ControllerPodLabelName`, `ControllerPodLabelNameValue`, `ControllerPodLabelComponent`, `ControllerPodLabelComponentValue`.
Controller Integration `pkg/controller/trainjob_controller.go`	Imports `rhai` and calls `Rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob)` after applying runtime objects; propagates reconciliation errors.
NetworkPolicy Implementation `pkg/rhai/networkpolicy.go`	New: determine controller namespace (SA file with fallback), build NetworkPolicy (podSelector by jobset-name, labels, OwnerReference, PolicyTypes=Ingress), add ingress rules (same-job pods; optional controller metrics access when progression enabled), implement Get/Create/Update reconciliation with wrapped errors.
NetworkPolicy Tests `pkg/rhai/networkpolicy_test.go`	New unit tests covering policy naming, build behavior (default/custom metrics ports), metadata/labels/ownerRefs, ingress rules semantics, reconciliation create/update flows, progression enabled/disabled cases, and security properties.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Controller as TrainJob Controller
  participant Rhai as Rhai.Module
  participant SA as ServiceAccount FS
  participant K8s as Kubernetes API

  Controller->>Rhai: ReconcileNetworkPolicy(ctx, client, trainJob)
  Rhai->>SA: read namespace file (if mounted)
  SA-->>Rhai: namespace (or none)
  Rhai->>Rhai: resolve controller namespace (fallback)
  Rhai->>Rhai: build desired NetworkPolicy (labels, podSelector, ingress rules, ownerRef)
  Rhai->>K8s: Get NetworkPolicy by name/namespace
  alt not found
    Rhai->>K8s: Create NetworkPolicy
    K8s-->>Rhai: Created
  else found
    Rhai->>K8s: Update NetworkPolicy (if diff)
    K8s-->>Rhai: Updated
  end
  Rhai-->>Controller: return result or error

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through YAML, Go, and tests with care,
I stitched pod gates and labeled namespace lair,
A metrics gap for the trainer to peep,
OwnerRefs tidy, secrets tucked deep,
I nibble bugs away — a soft, patchy hare 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: Add NetworkPolicy for metrics endpoint security' directly and accurately describes the main change: implementing a NetworkPolicy for securing the metrics endpoint in TrainJob workloads.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54b3016 and bb1e230.

📒 Files selected for processing (5)

manifests/rhoai/rbac_progression_patch.yaml
pkg/controller/trainjob_controller.go
pkg/rhai/constants/constants.go
pkg/rhai/networkpolicy.go
pkg/rhai/networkpolicy_test.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coveralls · 2025-12-10T05:55:19Z

Pull Request Test Coverage Report for Build 20747341008

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

106 of 122 (86.89%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+1.3%) to 54.988%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller/trainjob_controller.go	0	3	0.0%
pkg/rhai/networkpolicy.go	106	119	89.08%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller/trainjob_controller.go	1	0.0%

Totals
Change from base Build 20170001728:	1.3%
Covered Lines:	1670
Relevant Lines:	3037

💛 - Coveralls

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

pkg/rhai/constants/constants.go (2)

33-38: Clarify whether port range is enforced or purely advisory

The comment now specifies 1024–65535 as the valid metrics-port range, but neither GetMetricsPort nor the NetworkPolicy builder currently enforce it (only non‑numeric values fall back to default). Consider either:

enforcing the documented range (e.g., clamp or fall back to default), or

rephrasing the comment to make it clear the range is a recommendation, not a hard validation.

66-72: New NetworkPolicy constants look good; ensure deployment docs match

NetworkPolicyNameSuffix and DefaultControllerNamespace are reasonable defaults and are cleanly consumed from networkpolicy.go. Just make sure controller deployment manifests and docs clearly describe:

the default controller namespace (opendatahub), and

the CONTROLLER_NAMESPACE override behavior.

pkg/rhai/progression/progression.go (1)

697-702: Non‑fatal NetworkPolicy reconciliation fits the progression flow

Calling ReconcileNetworkPolicy once progression tracking is enabled, and treating failures as non‑fatal with V(1) logging, matches the “best‑effort hardening” goal and avoids breaking metrics collection in clusters without NetworkPolicy support. If you ever want to reduce churn, you could optionally gate this on isRunning, but the current approach is safe.
pkg/rhai/progression/networkpolicy.go (1)
137-168: Ensure OwnerReferences are reconciled on update and be explicit about label overwrites

ReconcileNetworkPolicy behaves correctly for the happy path, but two edge‑case behaviors are worth tightening:
OwnerReferences on update
On create, buildNetworkPolicy sets an OwnerReference so the policy is garbage‑collected with the TrainJob. On update you only copy Spec and Labels:
existingPolicy.Spec = desiredPolicy.Spec
existingPolicy.Labels = desiredPolicy.Labels
If an older deployment or a user created a policy without OwnerReferences, it will remain orphaned even after this controller starts managing it. Consider also reconciling OwnerReferences (e.g., replace or at least set when empty) to guarantee cleanup semantics for all managed policies.
Label replacement semantics
Replacing the entire Labels map is usually fine for a controller‑owned object but will drop any user‑added labels on existing policies. If you expect users or other tooling to attach labels (e.g., for monitoring), merging controller‑owned labels into the existing map instead of wholesale replacement would be friendlier.
Both tweaks are backward‑compatible and make the reconciliation behavior more predictable.
pkg/rhai/progression/networkpolicy_test.go (2)

119-307: Broader port validation and negative tests would strengthen coverage

TestBuildNetworkPolicy thoroughly checks metadata, selectors, owner refs, and both default/custom ports, plus the non‑numeric fallback case. Given the documented valid range, you might add one more table entry for a numeric but invalid port (e.g., "0" or "70000") to assert whatever behavior you choose in buildNetworkPolicy (fallback or clamp). That will lock in the semantics once you tighten the range handling.

309-461: Consider asserting OwnerReferences on update scenarios as well

TestReconcileNetworkPolicy covers create/no‑op/update flows with a fake client, which is great. For the “updates existing NetworkPolicy” case, you currently only validate that the port changed. Once ReconcileNetworkPolicy starts reconciling OwnerReferences on update, it’d be useful to extend this test to assert that:

the updated policy has exactly one OwnerReference, and

it points at the expected TrainJob UID.

That will prevent regressions in cleanup behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fea7ab and a493113.

📒 Files selected for processing (5)

manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
pkg/rhai/constants/constants.go (2 hunks)
pkg/rhai/progression/networkpolicy.go (1 hunks)
pkg/rhai/progression/networkpolicy_test.go (1 hunks)
pkg/rhai/progression/progression.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/rhai/progression/networkpolicy.go (2)

pkg/rhai/constants/constants.go (2)

DefaultControllerNamespace (72-72)

NetworkPolicyNameSuffix (69-69)

pkg/rhai/progression/progression.go (2)

GetMetricsPort (207-215)

IsProgressionTrackingEnabled (88-94)

🔇 Additional comments (6)

manifests/rhoai/rbac_progression_patch.yaml (1)

14-28: RBAC additions are minimal and aligned with controller behavior

Granting get, create, and update on networkpolicies matches the actual usage in ReconcileNetworkPolicy and keeps the permission surface tight. No additional verbs seem necessary given the current implementation.
pkg/rhai/progression/networkpolicy.go (2)
36-42: Controller namespace helper aligns with env‑override design

getControllerNamespace correctly prefers the CONTROLLER_NAMESPACE env var and falls back to DefaultControllerNamespace. This matches the tests and RBAC patch; just ensure the operator manifest sets this env var appropriately in non‑default deployments.

48-126: Tie default port to shared constant and consider enforcing valid range

buildNetworkPolicy looks solid overall (selectors, owner refs, ingress rules), but two small robustness points:
Default port literal
You currently fall back to 28080 via a literal:
portNum, err := strconv.Atoi(metricsPort)
if err != nil {
    portNum = 28080
}
To avoid divergence if the default ever changes, consider deriving this from a shared constant instead of hard-coding.
Range validation
Numeric but out-of-range values (e.g., "0", "65536", negative) will pass Atoi and produce a NetworkPolicy that the API server may reject. Consider validating portNum and falling back to the default if it's outside the allowed range, and/or logging a warning when clamping/falling back so misconfigurations are visible.
These changes would keep metrics port handling consistent and harden against bad annotations.
pkg/rhai/progression/networkpolicy_test.go (3)

37-84: Env override tests are sound and restore global state correctly

TestGetControllerNamespace nicely exercises default and env‑override behavior while restoring CONTROLLER_NAMESPACE per subtest. Good pattern for avoiding cross‑test contamination.

463-520: Controller namespace from env is well‑covered

TestReconcileNetworkPolicy_ControllerNamespaceFromEnv correctly verifies that the NamespaceSelector on the controller peer uses the custom namespace when CONTROLLER_NAMESPACE is set. The save/restore logic around the env var is also correct.

522-604: Security‑property tests match the intended policy design

TestBuildNetworkPolicy_SecurityProperties clearly encodes the key invariants: ingress‑only, controller‑only access to the metrics port, same‑namespace restrictions for same‑job pods, and owner‑based cleanup. This is a solid safety net around the NetworkPolicy structure.

robert-bell

lgtm - thanks for this @abhijeet-dhumal!

Have you been able to test this in a real cluster?

pkg/rhai/constants/constants.go

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

pkg/rhai/progression/networkpolicy.go (2)
59-64: Use the existing constant for the default port fallback.

The hardcoded 28080 duplicates the value defined in constants.DefaultMetricsPort. Using the constant ensures consistency if the default ever changes.
 	metricsPort := GetMetricsPort(trainJob)
 	portNum, err := strconv.Atoi(metricsPort)
 	if err != nil {
-		portNum = 28080 // default
+		portNum, _ = strconv.Atoi(constants.DefaultMetricsPort) // fallback to default
 	}
Alternatively, consider extracting a helper like GetMetricsPortInt() to avoid duplicate parsing logic.

171-175: Consider restoring OwnerReferences during update.

Currently, only Spec and Labels are synchronized during updates. If OwnerReferences are manually removed from the existing policy, they won't be restored, which could break automatic garbage collection.
 	existingPolicy.Spec = desiredPolicy.Spec
 	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences
 	if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {
This is a minor concern since OwnerReferences are rarely modified manually, but including them ensures complete reconciliation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a493113 and f3b0bd7.

📒 Files selected for processing (4)

manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
pkg/rhai/constants/constants.go (2 hunks)
pkg/rhai/progression/networkpolicy.go (1 hunks)
pkg/rhai/progression/networkpolicy_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/rhai/constants/constants.go

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/rhai/progression/networkpolicy.go (2)

pkg/rhai/constants/constants.go (2)

DefaultControllerNamespace (73-73)

NetworkPolicyNameSuffix (69-69)

pkg/rhai/progression/progression.go (2)

GetMetricsPort (207-215)

IsProgressionTrackingEnabled (88-94)

🔇 Additional comments (5)

manifests/rhoai/rbac_progression_patch.yaml (1)

14-30: LGTM! Well-documented RBAC permissions for NetworkPolicy management.

The permissions are appropriately scoped:

get/list/watch enables controller-runtime's caching mechanism

create/update allows reconciliation

delete is correctly omitted since OwnerReference-based garbage collection handles cleanup

The inline comments clearly explain the security rationale.

pkg/rhai/progression/networkpolicy.go (1)

39-51: Solid namespace detection with proper fallback chain.

The priority order (SA file → env var → default) correctly handles both in-cluster and local development scenarios.

pkg/rhai/progression/networkpolicy_test.go (3)

62-85: Good test coverage with proper environment cleanup.

The save/restore pattern for environment variables ensures test isolation. The tests adequately cover the fallback chain when the SA namespace file isn't available.

311-463: Comprehensive reconciliation tests covering all key paths.

The tests cover:

Policy creation when progression tracking is enabled

No-op when tracking is disabled or missing

Policy updates with changed configuration

Good use of the fake client with proper scheme registration.

524-606: Excellent security property tests.

These tests effectively document and verify the security invariants of the NetworkPolicy:

Ingress-only enforcement

Controller-restricted metrics access

Namespace isolation for same-job communication

Automatic cleanup via OwnerReference

This is a good practice for security-critical code.

pkg/rhai/networkpolicy_test.go

astefanutti · 2025-12-11T09:52:01Z

pkg/rhai/constants/constants.go

+	// NetworkPolicy constants for metrics endpoint security
+
+	// NetworkPolicyNameSuffix is appended to TrainJob name to create NetworkPolicy name.
+	NetworkPolicyNameSuffix string = "-metrics-netpol"


The NWP is not specific to metrics, it also covers pod-to-pod traffic.
It could be named as the train job.

Ahh, that awesome catch..
The NetworkPolicy isn't just for metrics - it has two rules:

Controller > metrics port

Same-job pods > all ports (for NCCL/MPI/gRPC)

I will update it to trainjob name, so that the trainjob and Networkpolicy will have same name as you suggested

Thanks a lot @astefanutti for reviewing the PR !

astefanutti · 2025-12-11T09:54:42Z

pkg/rhai/progression/progression.go

 	}

+	// Ensure NetworkPolicy exists to restrict metrics endpoint access to controller only
+	if err := ReconcileNetworkPolicy(ctx, c, trainJob); err != nil {


The NWP scope is larger than metrics / progression.
Would that make sense to reconcile it in the TrainJob controller reconcileObjects method directly or in runtime.NewObjects?

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

pkg/rhai/progression/networkpolicy_test.go (2)
220-223: Fix inconsistent error message.

The error message says "want manager" but the code checks for "controller". This was flagged in a previous review but appears unfixed.

Apply this diff:
 		if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-			t.Errorf("Rule 1: Controller component label = %q, want manager",
+			t.Errorf("Rule 1: Controller component label = %q, want controller",
 				controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
 		}
424-426: Fix inconsistent error message.

The error message says "Missing manager component label requirement" but the code checks for "controller". This was flagged in a previous review but appears unfixed.

Apply this diff:
 		if peer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-			t.Error("Missing manager component label requirement")
+			t.Error("Missing controller component label requirement")
 		}

🧹 Nitpick comments (2)

pkg/rhai/constants/constants.go (1)

68-69: Consider a more generic default namespace for upstream.

The DefaultControllerNamespace = "opendatahub" is specific to Red Hat's RHOAI distribution. For the upstream Kubeflow project, a more generic default like "kubeflow" or making it configurable might be more appropriate. This fallback is rarely used (only when the service account namespace file is unavailable), but it could cause confusion for non-RHOAI deployments.
pkg/rhai/progression/networkpolicy.go (1)
134-140: Consider using standard pointer utilities.

The boolPtr and protocolPtr helpers duplicate functionality already available in k8s.io/utils/ptr.To[T](). Using the standard library improves consistency across the codebase.

For example:
import "k8s.io/utils/ptr"

// Replace boolPtr(true) with ptr.To(true)
// Replace protocolPtr(corev1.ProtocolTCP) with ptr.To(corev1.ProtocolTCP)
Note: I see that k8s.io/utils/ptr is already imported in trainjob_controller.go (line 33), suggesting it's an accepted pattern in this codebase.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a33fea0 and 9765b9a.

📒 Files selected for processing (4)

pkg/controller/trainjob_controller.go (1 hunks)
pkg/rhai/constants/constants.go (2 hunks)
pkg/rhai/progression/networkpolicy.go (1 hunks)
pkg/rhai/progression/networkpolicy_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

pkg/rhai/progression/networkpolicy_test.go (3)

pkg/rhai/constants/constants.go (2)

AnnotationMetricsPort (38-38)

AnnotationProgressionTracking (26-26)

pkg/util/testing/client.go (1)

NewClientBuilder (35-47)

pkg/rhai/progression/networkpolicy.go (1)

ReconcileNetworkPolicy (144-170)

pkg/controller/trainjob_controller.go (1)

pkg/rhai/progression/networkpolicy.go (1)

ReconcileNetworkPolicy (144-170)

pkg/rhai/progression/networkpolicy.go (2)

pkg/rhai/constants/constants.go (1)

DefaultControllerNamespace (69-69)

pkg/rhai/progression/progression.go (1)

GetMetricsPort (207-215)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-commit
GitHub Check: Test
GitHub Check: Generate

🔇 Additional comments (9)

pkg/rhai/constants/constants.go (1)

34-36: LGTM: Clear documentation of port constraints.

The updated comment properly explains the valid port range and the security rationale for avoiding privileged ports.

pkg/rhai/progression/networkpolicy.go (4)

37-47: LGTM: Robust namespace detection with fallback.

The implementation correctly reads the service account namespace file with appropriate error handling and fallback to the default constant.

49-51: LGTM: Simplified naming aligns with PR feedback.

Based on the past review comments, this was appropriately changed from metrics-specific naming to match the TrainJob name directly, since the NetworkPolicy covers both metrics and pod-to-pod traffic.

64-131: LGTM: NetworkPolicy design follows security best practices.

The policy correctly:

Uses OwnerReferences for automatic cleanup

Restricts metrics port access to controller pods in the controller namespace

Allows same-job pod communication on all ports (for NCCL/MPI/gRPC)

Prevents cross-namespace and cross-job access via appropriate selectors

Uses Ingress-only policy type

The implementation aligns well with the multi-tenancy security goals stated in the PR.

143-170: LGTM: Standard Kubernetes reconciliation pattern.

The create-or-update logic follows best practices with:

Proper NotFound handling

Descriptive error wrapping

Updating both Spec and Labels on the update path

pkg/rhai/progression/networkpolicy_test.go (4)

36-67: LGTM: Clear test coverage for name derivation.

The test appropriately verifies that NetworkPolicy names match the TrainJob name.

69-257: LGTM: Comprehensive policy construction tests.

Excellent coverage of:

Default, custom, and invalid port handling

Metadata, labels, and OwnerReferences

PodSelector and ingress rules

Both controller and same-job peer configurations

259-380: LGTM: Solid reconciliation flow tests.

The tests properly verify both create and update paths using a fake client, including ownership and spec updates.

382-464: LGTM: Excellent security property validation.

The security-focused test suite effectively validates:

Ingress-only policy type

Controller-restricted metrics port access

Namespace isolation for same-job pods

OwnerReference cleanup semantics

This demonstrates a security-conscious approach to the NetworkPolicy design.

coderabbitai · 2025-12-11T14:03:08Z

pkg/controller/trainjob_controller.go

+	// Reconcile NetworkPolicy for pod isolation
+	if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
+		return err
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for the networkpolicy implementation and annotation checks rg -n -A10 -B2 'func.*ReconcileNetworkPolicy' pkg/rhai/progression/

Repository: opendatahub-io/trainer

Length of output: 2193

🏁 Script executed:

#!/bin/bash # Also search for annotation checks in the progression package rg -n 'AnnotationProgressionTracking' pkg/rhai/progression/

Repository: opendatahub-io/trainer

Length of output: 1095

🏁 Script executed:

#!/bin/bash # Check if there's an annotation constant defined rg -n 'AnnotationProgressionTracking' --type-list | head -20 rg -n 'AnnotationProgressionTracking' pkg/

Repository: opendatahub-io/trainer

Length of output: 1746

🏁 Script executed:

#!/bin/bash # Check progression.go around line 92 to see if annotation check wraps ReconcileNetworkPolicy sed -n '85,110p' pkg/rhai/progression/progression.go

Repository: opendatahub-io/trainer

Length of output: 1048

🏁 Script executed:

#!/bin/bash # Check the buildNetworkPolicy function rg -n -A15 'func buildNetworkPolicy' pkg/rhai/progression/networkpolicy.go

Repository: opendatahub-io/trainer

Length of output: 649

🏁 Script executed:

#!/bin/bash # Check the full context of how ReconcileNetworkPolicy is called in progression.go rg -n -B5 -A5 'ReconcileNetworkPolicy' pkg/rhai/progression/progression.go

Repository: opendatahub-io/trainer

Length of output: 48

🏁 Script executed:

#!/bin/bash # Check the trainjob_controller.go to understand the exact call context sed -n '155,175p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 785

🏁 Script executed:

#!/bin/bash # Search for all calls to ReconcileNetworkPolicy to see if they're conditional elsewhere rg -n -B3 -A1 'ReconcileNetworkPolicy' pkg/

Repository: opendatahub-io/trainer

Length of output: 1970

🏁 Script executed:

#!/bin/bash # Check if there are any conditions checking progression tracking before the NetworkPolicy call sed -n '155,170p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 600

🏁 Script executed:

#!/bin/bash # Check what happens in the broader reconciliation context sed -n '130,175p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 1688

Add conditional check for progression-tracking annotation before NetworkPolicy reconciliation

The ReconcileNetworkPolicy call is unconditional, but the PR description states the NetworkPolicy is added "when progression-tracking is enabled." Since IsProgressionTrackingEnabled() already exists, wrap the call:

if progression.IsProgressionTrackingEnabled(trainJob) { if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil { return err } }

Otherwise, NetworkPolicy is created for all TrainJobs regardless of the progression-tracking setting.

🤖 Prompt for AI Agents

In pkg/controller/trainjob_controller.go around lines 161 to 164, the call to progression.ReconcileNetworkPolicy runs unconditionally causing NetworkPolicies to be created for all TrainJobs; wrap that call with a conditional check using progression.IsProgressionTrackingEnabled(trainJob) and only call ReconcileNetworkPolicy if it returns true, preserving the existing error handling to return the error when reconciliation fails.

nit: I think this is a good suggestion, can we ensure, for any existing trainjob with progression tracking disabled, the pod behaviour does not change.
cc. @abhijeet-dhumal

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@kapil27 Actually this was a intentional design decision. The NetworkPolicy serves two purposes as described in PR description :

Pod isolation > only same-job pods can communicate, cross-job traffic blocked (primary- not dependent on progression )

Progression Metrics port restriction > controller-only access
These benefits all TrainJobs, regardless of progression tracking. Without it, any pod in the namespace could access training pods.

Ah to think of it you are right 🤔
we should make progression based rule conditional on IsProgressionTrackingEnabled() since the metrics server only runs when progression tracking is enabled 👀
Thanks @kapil27 , On it !

pkg/rhai/networkpolicy.go

robert-bell · 2025-12-11T14:09:42Z

pkg/rhai/networkpolicy.go

nit: does it make sense to move this netpol code into the rhai package, rather than progression package?

I'm happy for this to be merged as is though :)

Yeah definitely.. as this netpol is not centralised for progression scope.. it would be good to refactor accordingly
Thanks Rob, on it!!

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

pkg/rhai/networkpolicy.go (1)
54-61: Use the shared default metrics port constant instead of a hardcoded literal

The fallback to 28080 on parse error duplicates the default already encoded in constants.DefaultMetricsPort. Using the constant keeps behavior centralized and avoids drift if the default ever changes.

You can reuse the constant here:
 func buildNetworkPolicy(trainJob *trainer.TrainJob) *networkingv1.NetworkPolicy {
 	metricsPort := progression.GetMetricsPort(trainJob)
 	portNum, err := strconv.Atoi(metricsPort)
 	if err != nil {
-		portNum = 28080 // default
+		// Parse default port constant as fallback to keep it in sync with progression defaults
+		portNum, _ = strconv.Atoi(constants.DefaultMetricsPort)
 	}
 	port := intstr.FromInt(portNum)
pkg/controller/trainjob_controller.go (1)
162-165: Gate NetworkPolicy reconciliation on progression-tracking being enabled

Right now rhai.ReconcileNetworkPolicy runs for every TrainJob, which contradicts the stated behavior “when progression-tracking is enabled” and may surprise users by tightening network policy even when they don’t use progression tracking.

Wrap the call with the existing progression flag check so NetworkPolicies are only managed when progression tracking is on:
 	for _, object := range objects {
 		if err := r.client.Apply(ctx, object, client.FieldOwner("trainer"), client.ForceOwnership); err != nil {
 			return err
 		}
 	}
-	// Reconcile NetworkPolicy for pod isolation
-	if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
-		return err
-	}
+	// Reconcile NetworkPolicy for pod isolation when progression tracking is enabled
+	if progression.IsProgressionTrackingEnabled(trainJob) {
+		if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
+			return err
+		}
+	}
 	return nil
pkg/rhai/networkpolicy_test.go (1)
220-223: Fix mismatch in expected component label in test failure message

The assertion checks for "controller" but the error message still says "want manager", which is confusing when the test fails.

Update the message to match the actual expectation:
-			if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-				t.Errorf("Rule 1: Controller component label = %q, want manager",
-					controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
-			}
+			if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
+				t.Errorf("Rule 1: Controller component label = %q, want controller",
+					controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
+			}

🧹 Nitpick comments (1)

pkg/rhai/networkpolicy.go (1)
164-170: Consider also reconciling OwnerReferences when updating an existing NetworkPolicy

On update, you currently copy Spec and Labels but not OwnerReferences. If a matching NetworkPolicy pre-exists (same name/namespace) without the expected owner reference, it will be functionally “adopted” by this controller (spec/labels overwritten) but still won’t be garbage-collected with the TrainJob.

If you want full convergence toward the desired resource shape, including cleanup semantics, consider also syncing owner references:
-	existingPolicy.Spec = desiredPolicy.Spec
-	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.Spec = desiredPolicy.Spec
+	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences
This does change behavior for any manually created NP with the same name, so if that scenario matters you may want to guard reconciliation with a label/annotation check instead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9765b9a and 070e02f.

📒 Files selected for processing (3)

pkg/controller/trainjob_controller.go (2 hunks)
pkg/rhai/networkpolicy.go (1 hunks)
pkg/rhai/networkpolicy_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

pkg/controller/trainjob_controller.go (1)

pkg/rhai/networkpolicy.go (1)

ReconcileNetworkPolicy (145-171)

pkg/rhai/networkpolicy.go (2)

pkg/rhai/constants/constants.go (1)

DefaultControllerNamespace (69-69)

pkg/rhai/progression/progression.go (1)

GetMetricsPort (207-215)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Generate
GitHub Check: Test
GitHub Check: pre-commit

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pkg/rhai/networkpolicy.go (1)
77-79: Use the constant for the default metrics port.

This hardcodes 28080 as the fallback, but constants.DefaultMetricsPort already defines this value. Using the constant improves maintainability and consistency with other code paths.
 		if err != nil {
-			portNum = 28080 // default
+			portNum, _ = strconv.Atoi(constants.DefaultMetricsPort)
 		}

🧹 Nitpick comments (1)

pkg/rhai/networkpolicy.go (1)

38-48: Consider caching the controller namespace.

getControllerNamespace() reads from the filesystem on every call during reconciliation. Since the namespace is static for the controller's lifetime, caching would avoid repeated I/O.

+var cachedControllerNamespace string
+
 // getControllerNamespace returns the controller's namespace from SA mount.
 func getControllerNamespace() string {
+	if cachedControllerNamespace != "" {
+		return cachedControllerNamespace
+	}
 	if data, err := os.ReadFile(serviceAccountNamespaceFile); err == nil {
 		if ns := strings.TrimSpace(string(data)); ns != "" {
+			cachedControllerNamespace = ns
 			return ns
 		}
 	}
-	return constants.DefaultControllerNamespace
+	cachedControllerNamespace = constants.DefaultControllerNamespace
+	return cachedControllerNamespace
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 070e02f and 9e316d4.

📒 Files selected for processing (2)

pkg/rhai/networkpolicy.go (1 hunks)
pkg/rhai/networkpolicy_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/rhai/networkpolicy.go (2)

pkg/rhai/constants/constants.go (1)

DefaultControllerNamespace (69-69)

pkg/rhai/progression/progression.go (2)

IsProgressionTrackingEnabled (88-94)

GetMetricsPort (207-215)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-commit
GitHub Check: Test
GitHub Check: Generate

🔇 Additional comments (4)

pkg/rhai/networkpolicy.go (1)

149-177: LGTM on the reconciliation logic.

The create-or-update pattern is correctly implemented with proper error handling and context wrapping. The OwnerReference ensures automatic cleanup when the TrainJob is deleted.

pkg/rhai/networkpolicy_test.go (3)

36-67: LGTM!

Table-driven tests correctly verify the network policy name derivation.

69-258: LGTM!

Comprehensive test coverage for buildNetworkPolicy with well-structured assertions covering metadata, labels, OwnerReference, selectors, policy types, and ingress rule structure for both progression-enabled and disabled scenarios.

372-449: LGTM!

Security-focused tests correctly verify important properties: ingress-only policy type, controller-exclusive metrics access, namespace isolation for same-job pods, and OwnerReference for automatic cleanup.

pkg/rhai/networkpolicy_test.go

kapil27 · 2025-12-12T11:58:29Z

/lgtm

sutaakar

/lgtm

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/rhai/networkpolicy.go (1)
152-180: Consider preserving OwnerReferences on update.

The update path (lines 173-177) copies Spec and Labels from the desired policy but not OwnerReferences. If the OwnerReference is accidentally removed (e.g., by manual edit), it won't be restored, potentially leaving orphaned NetworkPolicies after TrainJob deletion.
 	existingPolicy.Spec = desiredPolicy.Spec
 	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences
 	if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e0cc4b and 54b3016.

📒 Files selected for processing (4)

manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
pkg/rhai/constants/constants.go (3 hunks)
pkg/rhai/networkpolicy.go (1 hunks)
pkg/rhai/networkpolicy_test.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

manifests/rhoai/rbac_progression_patch.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-commit
GitHub Check: Generate
GitHub Check: Test

🔇 Additional comments (10)

pkg/rhai/networkpolicy_test.go (4)

36-67: LGTM!

Clean table-driven test that verifies the network policy name derivation correctly returns the TrainJob name.

69-258: LGTM!

Comprehensive test covering all key scenarios: progression enabled/disabled, default/custom ports, and proper verification of metadata, labels, OwnerReferences, selectors, and ingress rules. The rule identification logic based on port presence is sound.

260-380: LGTM!

The reconciliation tests properly verify create and update paths using a fake client. The update test case correctly includes the progression tracking annotation (addressing the past review comment).

382-459: LGTM!

Excellent security property tests that validate critical isolation guarantees: ingress-only policy, controller-restricted metrics access, same-namespace enforcement for pod isolation, and owner reference cleanup semantics.

pkg/rhai/constants/constants.go (2)

65-87: LGTM!

Well-documented constants with clear explanations of the deployment context differences between RHOAI and upstream Kubeflow. Using standard Kubernetes label conventions (app.kubernetes.io/) for controller pod identification is the right approach.

24-25: LGTM!

Helpful documentation improvements clarifying the annotation values and port restrictions for non-privileged environments.

Also applies to: 34-36

pkg/rhai/networkpolicy.go (4)

41-49: LGTM!

Standard pattern for namespace discovery from the service account mount with appropriate fallback. This correctly addresses the earlier review suggestion to read from the SA namespace file.

51-53: LGTM!

Simple function that correctly implements the reviewed naming strategy - the NetworkPolicy shares the TrainJob's name for easy correlation.

55-142: LGTM!

Well-structured NetworkPolicy construction with:

Same-job pod isolation rule (implicit same-namespace via nil NamespaceSelector)

Conditional controller access rule with proper namespace + pod selectors

Correct OwnerReference setup for garbage collection

Port parsing with warning log and constant fallback

144-150: LGTM!

Standard pointer helper functions for Kubernetes resource construction.

robert-bell · 2025-12-17T09:01:15Z

@abhijeet-dhumal let's hold this PR now until after the 3.2. release.

It's too close to code freeze and adds unnecessary risk. We can release it in 3.3.

ChughShilpa

Tested with the latest image - http://quay.io/abdhumal/trainer:v2.1.0-jan2-netpol
and works as expected

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal requested review from kapil27, robert-bell and sutaakar December 10, 2025 07:32

abhijeet-dhumal marked this pull request as ready for review December 10, 2025 07:33

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

robert-bell approved these changes Dec 10, 2025

View reviewed changes

pkg/rhai/constants/constants.go Show resolved Hide resolved

abhijeet-dhumal requested a review from astefanutti December 11, 2025 09:20

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

pkg/rhai/networkpolicy_test.go Outdated Show resolved Hide resolved

pkg/rhai/networkpolicy_test.go Outdated Show resolved Hide resolved

astefanutti reviewed Dec 11, 2025

View reviewed changes

abhijeet-dhumal requested review from astefanutti and robert-bell December 11, 2025 14:02

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

robert-bell reviewed Dec 11, 2025

View reviewed changes

abhijeet-dhumal requested a review from robert-bell December 11, 2025 14:22

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

pkg/rhai/networkpolicy_test.go Show resolved Hide resolved

sutaakar reviewed Dec 12, 2025

View reviewed changes

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

robert-bell marked this pull request as draft December 17, 2025 09:02

ChughShilpa self-requested a review January 5, 2026 07:50

ChughShilpa approved these changes Jan 5, 2026

View reviewed changes

abhijeet-dhumal marked this pull request as ready for review January 5, 2026 10:53

robert-bell approved these changes Jan 5, 2026

View reviewed changes

feat: Add network policy for TrainJobs

bb1e230

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal force-pushed the add-network-policy branch from 54b3016 to bb1e230 Compare January 6, 2026 11:46

abhijeet-dhumal merged commit c0df748 into opendatahub-io:main Jan 6, 2026
8 of 9 checks passed

Comments

Conversation

abhijeet-dhumal commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coveralls commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20747341008

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

robert-bell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robert-bell Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kapil27 commented Dec 12, 2025

Uh oh!

sutaakar left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

abhijeet-dhumal commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 10, 2025 •

edited

Loading

coveralls commented Dec 10, 2025 •

edited

Loading

coderabbitai bot Dec 11, 2025 •

edited

Loading

robert-bell Dec 11, 2025 •

edited

Loading