Skip to content

[Bug] Katib Controller fails to create Suggestion pods on clusters with "Restricted" Pod Security Standards enabledΒ #2613

@xeonliu

Description

@xeonliu

What happened?

When following the "Getting Started with Katib Python SDK" guide on a modern Kubernetes cluster (e.g., Kind with K8s v1.25+), the experiment gets stuck in the Created state with 0 Trials.

The Katib Controller fails to create the Suggestion Pods because the default Suggestion containers do not comply with the Restricted Pod Security Standards (PSS), which are often enabled by default in newer environments.

The katib-controller logs show the following Admission Controller denial:

{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"suggestion-controller","msg":"Creating Deployment","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment-random"}
{"level":"info","ts":"2026-02-02T08:57:14Z","msg":"would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"suggestion\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"suggestion\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"suggestion\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"suggestion\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}
{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T08:57:14Z","logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"tune-experiment\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":"2026-02-02T09:00:02Z","msg":"metadata.finalizers: \"update-prometheus-metrics\": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers"}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"tune-experiment","namespace":"kubeflow"},"namespace":"kubeflow","name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-suggestion-client","msg":"Suggestion created","experiment":{"name":"tune-experiment","namespace":"kubeflow"},"namespace":"kubeflow","name":"tune-experiment"}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"suggestion-controller","msg":"Creating Service","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment-random"}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"suggestion-controller","msg":"Creating Deployment","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment-random"}
{"level":"info","ts":"2026-02-02T09:00:02Z","msg":"would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"suggestion\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"suggestion\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"suggestion\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"suggestion\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:00:02Z","logger":"suggestion-controller","msg":"Update suggestion instance status failed, reconciler requeued","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"err":"Operation cannot be fulfilled on suggestions.kubeflow.org \"tune-experiment\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":"2026-02-02T09:09:05Z","msg":"metadata.finalizers: \"update-prometheus-metrics\": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers"}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-suggestion-client","msg":"Creating Suggestion","experiment":{"name":"tune-experiment","namespace":"kubeflow"},"namespace":"kubeflow","name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-suggestion-client","msg":"Suggestion created","experiment":{"name":"tune-experiment","namespace":"kubeflow"},"namespace":"kubeflow","name":"tune-experiment"}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"suggestion-controller","msg":"Creating Service","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment-random"}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"suggestion-controller","msg":"Creating Deployment","Suggestion":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment-random"}
{"level":"info","ts":"2026-02-02T09:09:05Z","msg":"would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"suggestion\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"suggestion\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"suggestion\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"suggestion\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Statistics","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"addCount":3}
{"level":"info","ts":"2026-02-02T09:09:05Z","logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":{"name":"tune-experiment","namespace":"kubeflow"},"name":"tune-experiment","Suggestion Requests":3}

The documentation should be updated to warn users about Pod Security Standards. Users need to label their namespace as privileged or baseline if the default Katib images are not PSS-compliant.

What did you expect to happen?

The Experiment should proceed to Running state, and Trial pods should be created.
The Documentation should mention this problem.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.35.0
Kustomize Version: v5.7.1
Server Version: v1.34.0

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/katib/katib-controller:v0.19.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.19.0
Location: /home/liu/Documents/katib-python/.venv/lib/python3.12/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:
(katib-python)

Impacted by this bug?

Give it a πŸ‘ We prioritize the issues with most πŸ‘

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions