Skip to content

[Bug]: Slurm Operator Does Not Respect Controller Service Configuration - Breaking StatefulSet Pod DNS Resolution #62

@jskswamy

Description

@jskswamy

Description

The Slurm operator ignores the controller.spec.service.spec.clusterIP configuration from the Controller Custom Resource, always creating regular ClusterIP services instead of headless services when specified. This breaks StatefulSet pod DNS resolution patterns that worker nodes depend on to connect to the controller.

Error symptoms:

error: _xgetaddrinfo: getaddrinfo(slurm-controller-0.slurm-controller.slurm:6817) failed: Name or service not known
error: slurm_set_addr: Unable to resolve "slurm-controller-0.slurm-controller.slurm"

Root cause: The operator ignores clusterIP: None specification and always creates regular ClusterIP services. StatefulSet pod DNS patterns (pod-name.service-name.namespace) only work with headless services.

Impact: Installing the SlinkyProject Helm chart with headless service configuration fails due to this bug.

Steps to Reproduce

Environment:

  • Kubernetes cluster (any version)
  • slurm-operator installed
  • Helm chart deployment or direct Controller CR

Reproduction steps:

  1. Create a Controller CR with headless service configuration:

    apiVersion: slinky.slurm.net/v1alpha1
    kind: Controller
    metadata:
      name: slurm
      namespace: slurm
    spec:
      clusterName: slurm-test # Custom cluster name (doesn't affect DNS)
      service:
        spec:
          clusterIP: None # This should create headless service
      slurmKeyRef:
        name: slurm-auth-slurm
        key: slurm.key
      jwtHs256KeyRef:
        name: slurm-auth-jwths256
        key: jwt_hs256.key
  2. Deploy the Controller CR:

    kubectl apply -f controller.yaml
  3. Check the generated service:

    kubectl get service slurm-controller -n slurm -o yaml | grep clusterIP
    # Actual: clusterIP: 10.73.116.152  (regular ClusterIP)
    # Expected: clusterIP: null (headless service)
  4. Deploy worker NodeSet that references this controller

  5. Observe worker pod DNS resolution failure:

    kubectl logs -n slurm slurm-worker-slinky-0 -c slurmd
    # Shows: error: slurm_set_addr: Unable to resolve "slurm-controller-0.slurm-controller.slurm"

Alternative reproduction via Helm:

# values.yaml
clusterName: slurm-test
controller:
  service:
    spec:
      clusterIP: None
helm install slurm ./slurm -f values.yaml -n slurm
# Results in same DNS resolution failure

Expected Behavior

When a Controller CR specifies:

spec:
  service:
    spec:
      clusterIP: None

The operator should create a headless service:

apiVersion: v1
kind: Service
metadata:
  name: slurm-controller
spec:
  clusterIP: None # Headless service
  publishNotReadyAddresses: true
  selector:
    app.kubernetes.io/instance: slurm
    app.kubernetes.io/name: slurmctld
  ports:
    - name: slurmctld
      port: 6817
      protocol: TCP
      targetPort: slurmctld

This would enable StatefulSet pod DNS resolution:

# Should work
kubectl exec pod -- getent hosts slurm-controller-0.slurm-controller.slurm
# 10.72.1.133     slurm-controller-0.slurm-controller.slurm.svc.cluster.local

Additional Context

Code Location

The bug is in /internal/builder/controller_service.go:15-25:

func (b *Builder) BuildControllerService(controller *slinkyv1alpha1.Controller) (*corev1.Service, error) {
  spec := controller.Spec.Service
  opts := ServiceOpts{
  Key:         controller.ServiceKey(),
  Metadata:    controller.Spec.Template.PodMetadata,
  ServiceSpec: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec,
  Selector: labels.NewBuilder().
  WithControllerSelectorLabels(controller).
  Build(),
  // Missing: Headless field is not being set based on ServiceSpec.ClusterIP
  }
  // ...
  return b.BuildService(opts, controller)
}

Root Cause Analysis

The BuildService function in /internal/builder/service.go properly supports headless services:

if opts.Headless {
  out.Spec.ClusterIP = corev1.ClusterIPNone
  out.Spec.PublishNotReadyAddresses = true
}

But BuildControllerService never sets opts.Headless = true even when the user specifies clusterIP: None.

Proposed Fix

Add this line to BuildControllerService:

Headless: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec.ClusterIP == corev1.ClusterIPNone,

Workaround

Manual service fix after deployment:

  1. Scale down operator: kubectl scale deployment slurm-operator -n slurm --replicas=0
  2. Delete broken service: kubectl delete service slurm-controller -n slurm
  3. Create headless service manually with clusterIP: None
  4. Scale up operator: kubectl scale deployment slurm-operator -n slurm --replicas=1

Impact Assessment

  • Critical: Worker nodes cannot connect to controller
  • Common: Affects all Helm chart deployments that specify headless services
  • Silent: Configuration is ignored without warning, making debugging difficult
  • Standard pattern: Headless services are required for StatefulSet pod-to-pod communication

Related Documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions