Skip to content

Comments

fix: Add NetworkPolicy for metrics endpoint security#49

Merged
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:add-network-policy
Jan 6, 2026
Merged

fix: Add NetworkPolicy for metrics endpoint security#49
abhijeet-dhumal merged 1 commit intoopendatahub-io:mainfrom
abhijeet-dhumal:add-network-policy

Conversation

@abhijeet-dhumal
Copy link
Member

@abhijeet-dhumal abhijeet-dhumal commented Dec 10, 2025

RHOAIENG-39323

Implement controller-managed NetworkPolicy to provide pod isolation for TrainJob workloads, addressing security concerns.

The NetworkPolicy provides pod isolation for TrainJob pods:

Rule 1: Same-Job Pod Communication (Always)

  • Pods belonging to the same TrainJob can communicate on all ports
  • Required for distributed training (NCCL, MPI, gRPC, etc.)
  • Applied to ALL TrainJobs

Rule 2: Controller Access to Metrics Port (When Progression Enabled)

  • Only controller pods can access the metrics port (default 28080)
  • Conditional on trainer.opendatahub.io/progression-tracking: "true" annotation
  • External pods/services cannot scrape training metrics

What gets blocked:

  • Pods from other TrainJobs in the same namespace cannot access your training pods
  • Random pods in the namespace cannot probe the metrics endpoint
  • Cross-job traffic is denied

This addresses multi-tenancy security concerns where training metrics could be accessed by other pods.

Screenshot 2025-12-11 at 2 15 26 PM Screenshot 2025-12-11 at 2 17 24 PM Screenshot 2025-12-11 at 2 17 46 PM

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

  • New Features

    • Automatic creation and reconciliation of NetworkPolicy to isolate TrainJob pods and allow controlled metrics access, with a namespace fallback.
  • Security / Permissions

    • RBAC extended to permit managing NetworkPolicy resources (get, list, watch, create, update, patch).
  • Tests

    • Added comprehensive unit tests for policy construction, reconciliation, updates, and security behaviors.
  • Documentation

    • Clarified metrics port guidance and valid non-root port ranges.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds NetworkPolicy reconciliation for TrainJob pods: RBAC rule extended for networking.k8s.io/networkpolicies, new controller namespace and pod-label constants, TrainJob controller invokes Rhai.ReconcileNetworkPolicy, a new networkpolicy implementation and unit tests added.

Changes

Cohort / File(s) Summary
RBAC Configuration
manifests/rhoai/rbac_progression_patch.yaml
Appends a second RBAC rule granting networking.k8s.io networkpolicies verbs get, list, watch, create, update, patch; retains existing pods rule and RHAI comments.
Constants & Configuration
pkg/rhai/constants/constants.go
Updates annotation docs (progression tracking value and metrics port range) and adds NetworkPolicy constants: DefaultControllerNamespace, ControllerPodLabelName, ControllerPodLabelNameValue, ControllerPodLabelComponent, ControllerPodLabelComponentValue.
Controller Integration
pkg/controller/trainjob_controller.go
Imports rhai and calls Rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob) after applying runtime objects; propagates reconciliation errors.
NetworkPolicy Implementation
pkg/rhai/networkpolicy.go
New: determine controller namespace (SA file with fallback), build NetworkPolicy (podSelector by jobset-name, labels, OwnerReference, PolicyTypes=Ingress), add ingress rules (same-job pods; optional controller metrics access when progression enabled), implement Get/Create/Update reconciliation with wrapped errors.
NetworkPolicy Tests
pkg/rhai/networkpolicy_test.go
New unit tests covering policy naming, build behavior (default/custom metrics ports), metadata/labels/ownerRefs, ingress rules semantics, reconciliation create/update flows, progression enabled/disabled cases, and security properties.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Controller as TrainJob Controller
  participant Rhai as Rhai.Module
  participant SA as ServiceAccount FS
  participant K8s as Kubernetes API

  Controller->>Rhai: ReconcileNetworkPolicy(ctx, client, trainJob)
  Rhai->>SA: read namespace file (if mounted)
  SA-->>Rhai: namespace (or none)
  Rhai->>Rhai: resolve controller namespace (fallback)
  Rhai->>Rhai: build desired NetworkPolicy (labels, podSelector, ingress rules, ownerRef)
  Rhai->>K8s: Get NetworkPolicy by name/namespace
  alt not found
    Rhai->>K8s: Create NetworkPolicy
    K8s-->>Rhai: Created
  else found
    Rhai->>K8s: Update NetworkPolicy (if diff)
    K8s-->>Rhai: Updated
  end
  Rhai-->>Controller: return result or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through YAML, Go, and tests with care,
I stitched pod gates and labeled namespace lair,
A metrics gap for the trainer to peep,
OwnerRefs tidy, secrets tucked deep,
I nibble bugs away — a soft, patchy hare 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: Add NetworkPolicy for metrics endpoint security' directly and accurately describes the main change: implementing a NetworkPolicy for securing the metrics endpoint in TrainJob workloads.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54b3016 and bb1e230.

📒 Files selected for processing (5)
  • manifests/rhoai/rbac_progression_patch.yaml
  • pkg/controller/trainjob_controller.go
  • pkg/rhai/constants/constants.go
  • pkg/rhai/networkpolicy.go
  • pkg/rhai/networkpolicy_test.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coveralls
Copy link

coveralls commented Dec 10, 2025

Pull Request Test Coverage Report for Build 20747341008

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 106 of 122 (86.89%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+1.3%) to 54.988%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller/trainjob_controller.go 0 3 0.0%
pkg/rhai/networkpolicy.go 106 119 89.08%
Files with Coverage Reduction New Missed Lines %
pkg/controller/trainjob_controller.go 1 0.0%
Totals Coverage Status
Change from base Build 20170001728: 1.3%
Covered Lines: 1670
Relevant Lines: 3037

💛 - Coveralls

@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review December 10, 2025 07:33
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
pkg/rhai/constants/constants.go (2)

33-38: Clarify whether port range is enforced or purely advisory

The comment now specifies 1024–65535 as the valid metrics-port range, but neither GetMetricsPort nor the NetworkPolicy builder currently enforce it (only non‑numeric values fall back to default). Consider either:

  • enforcing the documented range (e.g., clamp or fall back to default), or
  • rephrasing the comment to make it clear the range is a recommendation, not a hard validation.

66-72: New NetworkPolicy constants look good; ensure deployment docs match

NetworkPolicyNameSuffix and DefaultControllerNamespace are reasonable defaults and are cleanly consumed from networkpolicy.go. Just make sure controller deployment manifests and docs clearly describe:

  • the default controller namespace (opendatahub), and
  • the CONTROLLER_NAMESPACE override behavior.
pkg/rhai/progression/progression.go (1)

697-702: Non‑fatal NetworkPolicy reconciliation fits the progression flow

Calling ReconcileNetworkPolicy once progression tracking is enabled, and treating failures as non‑fatal with V(1) logging, matches the “best‑effort hardening” goal and avoids breaking metrics collection in clusters without NetworkPolicy support. If you ever want to reduce churn, you could optionally gate this on isRunning, but the current approach is safe.

pkg/rhai/progression/networkpolicy.go (1)

137-168: Ensure OwnerReferences are reconciled on update and be explicit about label overwrites

ReconcileNetworkPolicy behaves correctly for the happy path, but two edge‑case behaviors are worth tightening:

  1. OwnerReferences on update
    On create, buildNetworkPolicy sets an OwnerReference so the policy is garbage‑collected with the TrainJob. On update you only copy Spec and Labels:

    existingPolicy.Spec = desiredPolicy.Spec
    existingPolicy.Labels = desiredPolicy.Labels

    If an older deployment or a user created a policy without OwnerReferences, it will remain orphaned even after this controller starts managing it. Consider also reconciling OwnerReferences (e.g., replace or at least set when empty) to guarantee cleanup semantics for all managed policies.

  2. Label replacement semantics
    Replacing the entire Labels map is usually fine for a controller‑owned object but will drop any user‑added labels on existing policies. If you expect users or other tooling to attach labels (e.g., for monitoring), merging controller‑owned labels into the existing map instead of wholesale replacement would be friendlier.

Both tweaks are backward‑compatible and make the reconciliation behavior more predictable.

pkg/rhai/progression/networkpolicy_test.go (2)

119-307: Broader port validation and negative tests would strengthen coverage

TestBuildNetworkPolicy thoroughly checks metadata, selectors, owner refs, and both default/custom ports, plus the non‑numeric fallback case. Given the documented valid range, you might add one more table entry for a numeric but invalid port (e.g., "0" or "70000") to assert whatever behavior you choose in buildNetworkPolicy (fallback or clamp). That will lock in the semantics once you tighten the range handling.


309-461: Consider asserting OwnerReferences on update scenarios as well

TestReconcileNetworkPolicy covers create/no‑op/update flows with a fake client, which is great. For the “updates existing NetworkPolicy” case, you currently only validate that the port changed. Once ReconcileNetworkPolicy starts reconciling OwnerReferences on update, it’d be useful to extend this test to assert that:

  • the updated policy has exactly one OwnerReference, and
  • it points at the expected TrainJob UID.

That will prevent regressions in cleanup behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fea7ab and a493113.

📒 Files selected for processing (5)
  • manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
  • pkg/rhai/constants/constants.go (2 hunks)
  • pkg/rhai/progression/networkpolicy.go (1 hunks)
  • pkg/rhai/progression/networkpolicy_test.go (1 hunks)
  • pkg/rhai/progression/progression.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (2)
  • DefaultControllerNamespace (72-72)
  • NetworkPolicyNameSuffix (69-69)
pkg/rhai/progression/progression.go (2)
  • GetMetricsPort (207-215)
  • IsProgressionTrackingEnabled (88-94)
🔇 Additional comments (6)
manifests/rhoai/rbac_progression_patch.yaml (1)

14-28: RBAC additions are minimal and aligned with controller behavior

Granting get, create, and update on networkpolicies matches the actual usage in ReconcileNetworkPolicy and keeps the permission surface tight. No additional verbs seem necessary given the current implementation.

pkg/rhai/progression/networkpolicy.go (2)

36-42: Controller namespace helper aligns with env‑override design

getControllerNamespace correctly prefers the CONTROLLER_NAMESPACE env var and falls back to DefaultControllerNamespace. This matches the tests and RBAC patch; just ensure the operator manifest sets this env var appropriately in non‑default deployments.


48-126: Tie default port to shared constant and consider enforcing valid range

buildNetworkPolicy looks solid overall (selectors, owner refs, ingress rules), but two small robustness points:

  1. Default port literal
    You currently fall back to 28080 via a literal:

    portNum, err := strconv.Atoi(metricsPort)
    if err != nil {
        portNum = 28080
    }

    To avoid divergence if the default ever changes, consider deriving this from a shared constant instead of hard-coding.

  2. Range validation
    Numeric but out-of-range values (e.g., "0", "65536", negative) will pass Atoi and produce a NetworkPolicy that the API server may reject. Consider validating portNum and falling back to the default if it's outside the allowed range, and/or logging a warning when clamping/falling back so misconfigurations are visible.

These changes would keep metrics port handling consistent and harden against bad annotations.

pkg/rhai/progression/networkpolicy_test.go (3)

37-84: Env override tests are sound and restore global state correctly

TestGetControllerNamespace nicely exercises default and env‑override behavior while restoring CONTROLLER_NAMESPACE per subtest. Good pattern for avoiding cross‑test contamination.


463-520: Controller namespace from env is well‑covered

TestReconcileNetworkPolicy_ControllerNamespaceFromEnv correctly verifies that the NamespaceSelector on the controller peer uses the custom namespace when CONTROLLER_NAMESPACE is set. The save/restore logic around the env var is also correct.


522-604: Security‑property tests match the intended policy design

TestBuildNetworkPolicy_SecurityProperties clearly encodes the key invariants: ingress‑only, controller‑only access to the metrics port, same‑namespace restrictions for same‑job pods, and owner‑based cleanup. This is a solid safety net around the NetworkPolicy structure.

Copy link
Collaborator

@robert-bell robert-bell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thanks for this @abhijeet-dhumal!

Have you been able to test this in a real cluster?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/rhai/progression/networkpolicy.go (2)

59-64: Use the existing constant for the default port fallback.

The hardcoded 28080 duplicates the value defined in constants.DefaultMetricsPort. Using the constant ensures consistency if the default ever changes.

 	metricsPort := GetMetricsPort(trainJob)
 	portNum, err := strconv.Atoi(metricsPort)
 	if err != nil {
-		portNum = 28080 // default
+		portNum, _ = strconv.Atoi(constants.DefaultMetricsPort) // fallback to default
 	}

Alternatively, consider extracting a helper like GetMetricsPortInt() to avoid duplicate parsing logic.


171-175: Consider restoring OwnerReferences during update.

Currently, only Spec and Labels are synchronized during updates. If OwnerReferences are manually removed from the existing policy, they won't be restored, which could break automatic garbage collection.

 	existingPolicy.Spec = desiredPolicy.Spec
 	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences
 	if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {

This is a minor concern since OwnerReferences are rarely modified manually, but including them ensures complete reconciliation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a493113 and f3b0bd7.

📒 Files selected for processing (4)
  • manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
  • pkg/rhai/constants/constants.go (2 hunks)
  • pkg/rhai/progression/networkpolicy.go (1 hunks)
  • pkg/rhai/progression/networkpolicy_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/rhai/constants/constants.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (2)
  • DefaultControllerNamespace (73-73)
  • NetworkPolicyNameSuffix (69-69)
pkg/rhai/progression/progression.go (2)
  • GetMetricsPort (207-215)
  • IsProgressionTrackingEnabled (88-94)
🔇 Additional comments (5)
manifests/rhoai/rbac_progression_patch.yaml (1)

14-30: LGTM! Well-documented RBAC permissions for NetworkPolicy management.

The permissions are appropriately scoped:

  • get/list/watch enables controller-runtime's caching mechanism
  • create/update allows reconciliation
  • delete is correctly omitted since OwnerReference-based garbage collection handles cleanup

The inline comments clearly explain the security rationale.

pkg/rhai/progression/networkpolicy.go (1)

39-51: Solid namespace detection with proper fallback chain.

The priority order (SA file → env var → default) correctly handles both in-cluster and local development scenarios.

pkg/rhai/progression/networkpolicy_test.go (3)

62-85: Good test coverage with proper environment cleanup.

The save/restore pattern for environment variables ensures test isolation. The tests adequately cover the fallback chain when the SA namespace file isn't available.


311-463: Comprehensive reconciliation tests covering all key paths.

The tests cover:

  • Policy creation when progression tracking is enabled
  • No-op when tracking is disabled or missing
  • Policy updates with changed configuration

Good use of the fake client with proper scheme registration.


524-606: Excellent security property tests.

These tests effectively document and verify the security invariants of the NetworkPolicy:

  • Ingress-only enforcement
  • Controller-restricted metrics access
  • Namespace isolation for same-job communication
  • Automatic cleanup via OwnerReference

This is a good practice for security-critical code.

// NetworkPolicy constants for metrics endpoint security

// NetworkPolicyNameSuffix is appended to TrainJob name to create NetworkPolicy name.
NetworkPolicyNameSuffix string = "-metrics-netpol"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NWP is not specific to metrics, it also covers pod-to-pod traffic.
It could be named as the train job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that awesome catch..
The NetworkPolicy isn't just for metrics - it has two rules:

  • Controller > metrics port
  • Same-job pods > all ports (for NCCL/MPI/gRPC)

I will update it to trainjob name, so that the trainjob and Networkpolicy will have same name as you suggested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @astefanutti for reviewing the PR !

}

// Ensure NetworkPolicy exists to restrict metrics endpoint access to controller only
if err := ReconcileNetworkPolicy(ctx, c, trainJob); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NWP scope is larger than metrics / progression.
Would that make sense to reconcile it in the TrainJob controller reconcileObjects method directly or in runtime.NewObjects?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
pkg/rhai/progression/networkpolicy_test.go (2)

220-223: Fix inconsistent error message.

The error message says "want manager" but the code checks for "controller". This was flagged in a previous review but appears unfixed.

Apply this diff:

 		if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-			t.Errorf("Rule 1: Controller component label = %q, want manager",
+			t.Errorf("Rule 1: Controller component label = %q, want controller",
 				controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
 		}

424-426: Fix inconsistent error message.

The error message says "Missing manager component label requirement" but the code checks for "controller". This was flagged in a previous review but appears unfixed.

Apply this diff:

 		if peer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-			t.Error("Missing manager component label requirement")
+			t.Error("Missing controller component label requirement")
 		}
🧹 Nitpick comments (2)
pkg/rhai/constants/constants.go (1)

68-69: Consider a more generic default namespace for upstream.

The DefaultControllerNamespace = "opendatahub" is specific to Red Hat's RHOAI distribution. For the upstream Kubeflow project, a more generic default like "kubeflow" or making it configurable might be more appropriate. This fallback is rarely used (only when the service account namespace file is unavailable), but it could cause confusion for non-RHOAI deployments.

pkg/rhai/progression/networkpolicy.go (1)

134-140: Consider using standard pointer utilities.

The boolPtr and protocolPtr helpers duplicate functionality already available in k8s.io/utils/ptr.To[T](). Using the standard library improves consistency across the codebase.

For example:

import "k8s.io/utils/ptr"

// Replace boolPtr(true) with ptr.To(true)
// Replace protocolPtr(corev1.ProtocolTCP) with ptr.To(corev1.ProtocolTCP)

Note: I see that k8s.io/utils/ptr is already imported in trainjob_controller.go (line 33), suggesting it's an accepted pattern in this codebase.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a33fea0 and 9765b9a.

📒 Files selected for processing (4)
  • pkg/controller/trainjob_controller.go (1 hunks)
  • pkg/rhai/constants/constants.go (2 hunks)
  • pkg/rhai/progression/networkpolicy.go (1 hunks)
  • pkg/rhai/progression/networkpolicy_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
pkg/rhai/progression/networkpolicy_test.go (3)
pkg/rhai/constants/constants.go (2)
  • AnnotationMetricsPort (38-38)
  • AnnotationProgressionTracking (26-26)
pkg/util/testing/client.go (1)
  • NewClientBuilder (35-47)
pkg/rhai/progression/networkpolicy.go (1)
  • ReconcileNetworkPolicy (144-170)
pkg/controller/trainjob_controller.go (1)
pkg/rhai/progression/networkpolicy.go (1)
  • ReconcileNetworkPolicy (144-170)
pkg/rhai/progression/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
  • DefaultControllerNamespace (69-69)
pkg/rhai/progression/progression.go (1)
  • GetMetricsPort (207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-commit
  • GitHub Check: Test
  • GitHub Check: Generate
🔇 Additional comments (9)
pkg/rhai/constants/constants.go (1)

34-36: LGTM: Clear documentation of port constraints.

The updated comment properly explains the valid port range and the security rationale for avoiding privileged ports.

pkg/rhai/progression/networkpolicy.go (4)

37-47: LGTM: Robust namespace detection with fallback.

The implementation correctly reads the service account namespace file with appropriate error handling and fallback to the default constant.


49-51: LGTM: Simplified naming aligns with PR feedback.

Based on the past review comments, this was appropriately changed from metrics-specific naming to match the TrainJob name directly, since the NetworkPolicy covers both metrics and pod-to-pod traffic.


64-131: LGTM: NetworkPolicy design follows security best practices.

The policy correctly:

  • Uses OwnerReferences for automatic cleanup
  • Restricts metrics port access to controller pods in the controller namespace
  • Allows same-job pod communication on all ports (for NCCL/MPI/gRPC)
  • Prevents cross-namespace and cross-job access via appropriate selectors
  • Uses Ingress-only policy type

The implementation aligns well with the multi-tenancy security goals stated in the PR.


143-170: LGTM: Standard Kubernetes reconciliation pattern.

The create-or-update logic follows best practices with:

  • Proper NotFound handling
  • Descriptive error wrapping
  • Updating both Spec and Labels on the update path
pkg/rhai/progression/networkpolicy_test.go (4)

36-67: LGTM: Clear test coverage for name derivation.

The test appropriately verifies that NetworkPolicy names match the TrainJob name.


69-257: LGTM: Comprehensive policy construction tests.

Excellent coverage of:

  • Default, custom, and invalid port handling
  • Metadata, labels, and OwnerReferences
  • PodSelector and ingress rules
  • Both controller and same-job peer configurations

259-380: LGTM: Solid reconciliation flow tests.

The tests properly verify both create and update paths using a fake client, including ownership and spec updates.


382-464: LGTM: Excellent security property validation.

The security-focused test suite effectively validates:

  • Ingress-only policy type
  • Controller-restricted metrics port access
  • Namespace isolation for same-job pods
  • OwnerReference cleanup semantics

This demonstrates a security-conscious approach to the NetworkPolicy design.

Comment on lines 161 to 164
// Reconcile NetworkPolicy for pod isolation
if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
return err
}
Copy link

@coderabbitai coderabbitai bot Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for the networkpolicy implementation and annotation checks
rg -n -A10 -B2 'func.*ReconcileNetworkPolicy' pkg/rhai/progression/

Repository: opendatahub-io/trainer

Length of output: 2193


🏁 Script executed:

#!/bin/bash
# Also search for annotation checks in the progression package
rg -n 'AnnotationProgressionTracking' pkg/rhai/progression/

Repository: opendatahub-io/trainer

Length of output: 1095


🏁 Script executed:

#!/bin/bash
# Check if there's an annotation constant defined
rg -n 'AnnotationProgressionTracking' --type-list | head -20
rg -n 'AnnotationProgressionTracking' pkg/

Repository: opendatahub-io/trainer

Length of output: 1746


🏁 Script executed:

#!/bin/bash
# Check progression.go around line 92 to see if annotation check wraps ReconcileNetworkPolicy
sed -n '85,110p' pkg/rhai/progression/progression.go

Repository: opendatahub-io/trainer

Length of output: 1048


🏁 Script executed:

#!/bin/bash
# Check the buildNetworkPolicy function
rg -n -A15 'func buildNetworkPolicy' pkg/rhai/progression/networkpolicy.go

Repository: opendatahub-io/trainer

Length of output: 649


🏁 Script executed:

#!/bin/bash
# Check the full context of how ReconcileNetworkPolicy is called in progression.go
rg -n -B5 -A5 'ReconcileNetworkPolicy' pkg/rhai/progression/progression.go

Repository: opendatahub-io/trainer

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Check the trainjob_controller.go to understand the exact call context
sed -n '155,175p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 785


🏁 Script executed:

#!/bin/bash
# Search for all calls to ReconcileNetworkPolicy to see if they're conditional elsewhere
rg -n -B3 -A1 'ReconcileNetworkPolicy' pkg/

Repository: opendatahub-io/trainer

Length of output: 1970


🏁 Script executed:

#!/bin/bash
# Check if there are any conditions checking progression tracking before the NetworkPolicy call
sed -n '155,170p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 600


🏁 Script executed:

#!/bin/bash
# Check what happens in the broader reconciliation context
sed -n '130,175p' pkg/controller/trainjob_controller.go

Repository: opendatahub-io/trainer

Length of output: 1688


Add conditional check for progression-tracking annotation before NetworkPolicy reconciliation

The ReconcileNetworkPolicy call is unconditional, but the PR description states the NetworkPolicy is added "when progression-tracking is enabled." Since IsProgressionTrackingEnabled() already exists, wrap the call:

if progression.IsProgressionTrackingEnabled(trainJob) {
    if err := progression.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
        return err
    }
}

Otherwise, NetworkPolicy is created for all TrainJobs regardless of the progression-tracking setting.

🤖 Prompt for AI Agents
In pkg/controller/trainjob_controller.go around lines 161 to 164, the call to
progression.ReconcileNetworkPolicy runs unconditionally causing NetworkPolicies
to be created for all TrainJobs; wrap that call with a conditional check using
progression.IsProgressionTrackingEnabled(trainJob) and only call
ReconcileNetworkPolicy if it returns true, preserving the existing error
handling to return the error when reconciliation fails.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is a good suggestion, can we ensure, for any existing trainjob with progression tracking disabled, the pod behaviour does not change.
cc. @abhijeet-dhumal

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapil27 Actually this was a intentional design decision. The NetworkPolicy serves two purposes as described in PR description :

  • Pod isolation > only same-job pods can communicate, cross-job traffic blocked (primary- not dependent on progression )
  • Progression Metrics port restriction > controller-only access
    These benefits all TrainJobs, regardless of progression tracking. Without it, any pod in the namespace could access training pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah to think of it you are right 🤔
we should make progression based rule conditional on IsProgressionTrackingEnabled() since the metrics server only runs when progression tracking is enabled 👀
Thanks @kapil27 , On it !

Copy link
Collaborator

@robert-bell robert-bell Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does it make sense to move this netpol code into the rhai package, rather than progression package?

I'm happy for this to be merged as is though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely.. as this netpol is not centralised for progression scope.. it would be good to refactor accordingly
Thanks Rob, on it!!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
pkg/rhai/networkpolicy.go (1)

54-61: Use the shared default metrics port constant instead of a hardcoded literal

The fallback to 28080 on parse error duplicates the default already encoded in constants.DefaultMetricsPort. Using the constant keeps behavior centralized and avoids drift if the default ever changes.

You can reuse the constant here:

 func buildNetworkPolicy(trainJob *trainer.TrainJob) *networkingv1.NetworkPolicy {
 	metricsPort := progression.GetMetricsPort(trainJob)
 	portNum, err := strconv.Atoi(metricsPort)
 	if err != nil {
-		portNum = 28080 // default
+		// Parse default port constant as fallback to keep it in sync with progression defaults
+		portNum, _ = strconv.Atoi(constants.DefaultMetricsPort)
 	}
 	port := intstr.FromInt(portNum)
pkg/controller/trainjob_controller.go (1)

162-165: Gate NetworkPolicy reconciliation on progression-tracking being enabled

Right now rhai.ReconcileNetworkPolicy runs for every TrainJob, which contradicts the stated behavior “when progression-tracking is enabled” and may surprise users by tightening network policy even when they don’t use progression tracking.

Wrap the call with the existing progression flag check so NetworkPolicies are only managed when progression tracking is on:

 	for _, object := range objects {
 		if err := r.client.Apply(ctx, object, client.FieldOwner("trainer"), client.ForceOwnership); err != nil {
 			return err
 		}
 	}
-	// Reconcile NetworkPolicy for pod isolation
-	if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
-		return err
-	}
+	// Reconcile NetworkPolicy for pod isolation when progression tracking is enabled
+	if progression.IsProgressionTrackingEnabled(trainJob) {
+		if err := rhai.ReconcileNetworkPolicy(ctx, r.client, trainJob); err != nil {
+			return err
+		}
+	}
 	return nil
pkg/rhai/networkpolicy_test.go (1)

220-223: Fix mismatch in expected component label in test failure message

The assertion checks for "controller" but the error message still says "want manager", which is confusing when the test fails.

Update the message to match the actual expectation:

-			if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
-				t.Errorf("Rule 1: Controller component label = %q, want manager",
-					controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
-			}
+			if controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"] != "controller" {
+				t.Errorf("Rule 1: Controller component label = %q, want controller",
+					controllerPeer.PodSelector.MatchLabels["app.kubernetes.io/component"])
+			}
🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)

164-170: Consider also reconciling OwnerReferences when updating an existing NetworkPolicy

On update, you currently copy Spec and Labels but not OwnerReferences. If a matching NetworkPolicy pre-exists (same name/namespace) without the expected owner reference, it will be functionally “adopted” by this controller (spec/labels overwritten) but still won’t be garbage-collected with the TrainJob.

If you want full convergence toward the desired resource shape, including cleanup semantics, consider also syncing owner references:

-	existingPolicy.Spec = desiredPolicy.Spec
-	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.Spec = desiredPolicy.Spec
+	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences

This does change behavior for any manually created NP with the same name, so if that scenario matters you may want to guard reconciliation with a label/annotation check instead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9765b9a and 070e02f.

📒 Files selected for processing (3)
  • pkg/controller/trainjob_controller.go (2 hunks)
  • pkg/rhai/networkpolicy.go (1 hunks)
  • pkg/rhai/networkpolicy_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/controller/trainjob_controller.go (1)
pkg/rhai/networkpolicy.go (1)
  • ReconcileNetworkPolicy (145-171)
pkg/rhai/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
  • DefaultControllerNamespace (69-69)
pkg/rhai/progression/progression.go (1)
  • GetMetricsPort (207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Generate
  • GitHub Check: Test
  • GitHub Check: pre-commit

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/rhai/networkpolicy.go (1)

77-79: Use the constant for the default metrics port.

This hardcodes 28080 as the fallback, but constants.DefaultMetricsPort already defines this value. Using the constant improves maintainability and consistency with other code paths.

 		if err != nil {
-			portNum = 28080 // default
+			portNum, _ = strconv.Atoi(constants.DefaultMetricsPort)
 		}
🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)

38-48: Consider caching the controller namespace.

getControllerNamespace() reads from the filesystem on every call during reconciliation. Since the namespace is static for the controller's lifetime, caching would avoid repeated I/O.

+var cachedControllerNamespace string
+
 // getControllerNamespace returns the controller's namespace from SA mount.
 func getControllerNamespace() string {
+	if cachedControllerNamespace != "" {
+		return cachedControllerNamespace
+	}
 	if data, err := os.ReadFile(serviceAccountNamespaceFile); err == nil {
 		if ns := strings.TrimSpace(string(data)); ns != "" {
+			cachedControllerNamespace = ns
 			return ns
 		}
 	}
-	return constants.DefaultControllerNamespace
+	cachedControllerNamespace = constants.DefaultControllerNamespace
+	return cachedControllerNamespace
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 070e02f and 9e316d4.

📒 Files selected for processing (2)
  • pkg/rhai/networkpolicy.go (1 hunks)
  • pkg/rhai/networkpolicy_test.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/rhai/networkpolicy.go (2)
pkg/rhai/constants/constants.go (1)
  • DefaultControllerNamespace (69-69)
pkg/rhai/progression/progression.go (2)
  • IsProgressionTrackingEnabled (88-94)
  • GetMetricsPort (207-215)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-commit
  • GitHub Check: Test
  • GitHub Check: Generate
🔇 Additional comments (4)
pkg/rhai/networkpolicy.go (1)

149-177: LGTM on the reconciliation logic.

The create-or-update pattern is correctly implemented with proper error handling and context wrapping. The OwnerReference ensures automatic cleanup when the TrainJob is deleted.

pkg/rhai/networkpolicy_test.go (3)

36-67: LGTM!

Table-driven tests correctly verify the network policy name derivation.


69-258: LGTM!

Comprehensive test coverage for buildNetworkPolicy with well-structured assertions covering metadata, labels, OwnerReference, selectors, policy types, and ingress rule structure for both progression-enabled and disabled scenarios.


372-449: LGTM!

Security-focused tests correctly verify important properties: ingress-only policy type, controller-exclusive metrics access, namespace isolation for same-job pods, and OwnerReference for automatic cleanup.

@kapil27
Copy link

kapil27 commented Dec 12, 2025

/lgtm

Copy link
Collaborator

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/rhai/networkpolicy.go (1)

152-180: Consider preserving OwnerReferences on update.

The update path (lines 173-177) copies Spec and Labels from the desired policy but not OwnerReferences. If the OwnerReference is accidentally removed (e.g., by manual edit), it won't be restored, potentially leaving orphaned NetworkPolicies after TrainJob deletion.

 	existingPolicy.Spec = desiredPolicy.Spec
 	existingPolicy.Labels = desiredPolicy.Labels
+	existingPolicy.OwnerReferences = desiredPolicy.OwnerReferences
 	if updateErr := c.Update(ctx, existingPolicy); updateErr != nil {
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e0cc4b and 54b3016.

📒 Files selected for processing (4)
  • manifests/rhoai/rbac_progression_patch.yaml (1 hunks)
  • pkg/rhai/constants/constants.go (3 hunks)
  • pkg/rhai/networkpolicy.go (1 hunks)
  • pkg/rhai/networkpolicy_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • manifests/rhoai/rbac_progression_patch.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-commit
  • GitHub Check: Generate
  • GitHub Check: Test
🔇 Additional comments (10)
pkg/rhai/networkpolicy_test.go (4)

36-67: LGTM!

Clean table-driven test that verifies the network policy name derivation correctly returns the TrainJob name.


69-258: LGTM!

Comprehensive test covering all key scenarios: progression enabled/disabled, default/custom ports, and proper verification of metadata, labels, OwnerReferences, selectors, and ingress rules. The rule identification logic based on port presence is sound.


260-380: LGTM!

The reconciliation tests properly verify create and update paths using a fake client. The update test case correctly includes the progression tracking annotation (addressing the past review comment).


382-459: LGTM!

Excellent security property tests that validate critical isolation guarantees: ingress-only policy, controller-restricted metrics access, same-namespace enforcement for pod isolation, and owner reference cleanup semantics.

pkg/rhai/constants/constants.go (2)

65-87: LGTM!

Well-documented constants with clear explanations of the deployment context differences between RHOAI and upstream Kubeflow. Using standard Kubernetes label conventions (app.kubernetes.io/) for controller pod identification is the right approach.


24-25: LGTM!

Helpful documentation improvements clarifying the annotation values and port restrictions for non-privileged environments.

Also applies to: 34-36

pkg/rhai/networkpolicy.go (4)

41-49: LGTM!

Standard pattern for namespace discovery from the service account mount with appropriate fallback. This correctly addresses the earlier review suggestion to read from the SA namespace file.


51-53: LGTM!

Simple function that correctly implements the reviewed naming strategy - the NetworkPolicy shares the TrainJob's name for easy correlation.


55-142: LGTM!

Well-structured NetworkPolicy construction with:

  • Same-job pod isolation rule (implicit same-namespace via nil NamespaceSelector)
  • Conditional controller access rule with proper namespace + pod selectors
  • Correct OwnerReference setup for garbage collection
  • Port parsing with warning log and constant fallback

144-150: LGTM!

Standard pointer helper functions for Kubernetes resource construction.

@robert-bell
Copy link
Collaborator

@abhijeet-dhumal let's hold this PR now until after the 3.2. release.

It's too close to code freeze and adds unnecessary risk. We can release it in 3.3.

@robert-bell robert-bell marked this pull request as draft December 17, 2025 09:02
@ChughShilpa ChughShilpa self-requested a review January 5, 2026 07:50
Copy link
Collaborator

@ChughShilpa ChughShilpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with the latest image - http://quay.io/abdhumal/trainer:v2.1.0-jan2-netpol
and works as expected

@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review January 5, 2026 10:53
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal merged commit c0df748 into opendatahub-io:main Jan 6, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants