Problem
Cloud Native Stack playbooks currently have a few behaviors that cause problems
on GB200 systems, retry/reapply workflows, and newer Ubuntu runtimes:
- Calico autodetection does not include
enP* interfaces.
- GPU Operator install tasks use
helm install --generate-name, which is not
retry-safe.
- Docker CE installs are selected as
latest, which can drift between runs and
does not encode the apt architecture.
- Ubuntu apt repository key tasks still use Ansible
apt_key, which depends on
the deprecated apt-key executable.
Observed Failure
During Ubuntu 26.04 runtime validation, CNS install failed at:
TASK [Add an Kubernetes apt signing key for Ubuntu]
With:
Failed to find required executable "apt-key"
Proposed Fix
Apply the prepared field-fix patch series:
- Add
enP* to Calico autodetection.
- Convert GPU Operator Helm installs to
helm upgrade --install gpu-operator.
- Add Docker CE version and architecture variables and use pinned apt package
installs.
- Replace active Ubuntu
apt_key usage with keyring downloads/dearmor and
signed-by apt repository handling.
Patch commit summary:
- Add
enP* as a device prefix for Calico.
- Make GPU Operator install retry-safe.
- Add Docker CE versioning and architecture support.
- Replace
apt-key usage for Ubuntu runtimes.
Validation
Local validation passed against the prepared patch series:
git diff --check
- YAML parse checks for touched YAML files
- Ansible syntax checks for:
playbooks/prerequisites.yaml
playbooks/nvidia-driver.yaml
playbooks/nvidia-docker.yaml
playbooks/operators-install.yaml
playbooks/k8s-install.yaml
playbooks/cns.yaml
Maintainer Review Checklist
- Confirm whether
gpu-operator is acceptable as the stable Helm release name.
- Confirm Docker CE version defaults for CNS 16.0, 16.1, and 17.0.
- Confirm the
enP* Calico autodetection prefix matches expected GB200 NIC
naming.
- Confirm keyring paths:
/etc/apt/keyrings/kubernetes-apt-keyring.gpg
/etc/apt/keyrings/cri-o-apt-keyring.gpg
/etc/apt/keyrings/docker.asc
/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- Decide whether to split this into separate PRs by topic or accept as one
field-fix bundle.
Problem
Cloud Native Stack playbooks currently have a few behaviors that cause problems
on GB200 systems, retry/reapply workflows, and newer Ubuntu runtimes:
enP*interfaces.helm install --generate-name, which is notretry-safe.
latest, which can drift between runs anddoes not encode the apt architecture.
apt_key, which depends onthe deprecated
apt-keyexecutable.Observed Failure
During Ubuntu 26.04 runtime validation, CNS install failed at:
With:
Proposed Fix
Apply the prepared field-fix patch series:
enP*to Calico autodetection.helm upgrade --install gpu-operator.installs.
apt_keyusage with keyring downloads/dearmor andsigned-byapt repository handling.Patch commit summary:
enP*as a device prefix for Calico.apt-keyusage for Ubuntu runtimes.Validation
Local validation passed against the prepared patch series:
git diff --checkplaybooks/prerequisites.yamlplaybooks/nvidia-driver.yamlplaybooks/nvidia-docker.yamlplaybooks/operators-install.yamlplaybooks/k8s-install.yamlplaybooks/cns.yamlMaintainer Review Checklist
gpu-operatoris acceptable as the stable Helm release name.enP*Calico autodetection prefix matches expected GB200 NICnaming.
/etc/apt/keyrings/kubernetes-apt-keyring.gpg/etc/apt/keyrings/cri-o-apt-keyring.gpg/etc/apt/keyrings/docker.asc/usr/share/keyrings/nvidia-container-toolkit-keyring.gpgfield-fix bundle.