Skip to content

GB200 and newer Ubuntu runtime fixes needed in Cloud Native Stack playbooks #140

@ericrife

Description

@ericrife

Problem

Cloud Native Stack playbooks currently have a few behaviors that cause problems
on GB200 systems, retry/reapply workflows, and newer Ubuntu runtimes:

  • Calico autodetection does not include enP* interfaces.
  • GPU Operator install tasks use helm install --generate-name, which is not
    retry-safe.
  • Docker CE installs are selected as latest, which can drift between runs and
    does not encode the apt architecture.
  • Ubuntu apt repository key tasks still use Ansible apt_key, which depends on
    the deprecated apt-key executable.

Observed Failure

During Ubuntu 26.04 runtime validation, CNS install failed at:

TASK [Add an Kubernetes apt signing key for Ubuntu]

With:

Failed to find required executable "apt-key"

Proposed Fix

Apply the prepared field-fix patch series:

  1. Add enP* to Calico autodetection.
  2. Convert GPU Operator Helm installs to helm upgrade --install gpu-operator.
  3. Add Docker CE version and architecture variables and use pinned apt package
    installs.
  4. Replace active Ubuntu apt_key usage with keyring downloads/dearmor and
    signed-by apt repository handling.

Patch commit summary:

  • Add enP* as a device prefix for Calico.
  • Make GPU Operator install retry-safe.
  • Add Docker CE versioning and architecture support.
  • Replace apt-key usage for Ubuntu runtimes.

Validation

Local validation passed against the prepared patch series:

  • git diff --check
  • YAML parse checks for touched YAML files
  • Ansible syntax checks for:
    • playbooks/prerequisites.yaml
    • playbooks/nvidia-driver.yaml
    • playbooks/nvidia-docker.yaml
    • playbooks/operators-install.yaml
    • playbooks/k8s-install.yaml
    • playbooks/cns.yaml

Maintainer Review Checklist

  • Confirm whether gpu-operator is acceptable as the stable Helm release name.
  • Confirm Docker CE version defaults for CNS 16.0, 16.1, and 17.0.
  • Confirm the enP* Calico autodetection prefix matches expected GB200 NIC
    naming.
  • Confirm keyring paths:
    • /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    • /etc/apt/keyrings/cri-o-apt-keyring.gpg
    • /etc/apt/keyrings/docker.asc
    • /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  • Decide whether to split this into separate PRs by topic or accept as one
    field-fix bundle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions