-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Description
The Slurm operator ignores the controller.spec.service.spec.clusterIP
configuration from the Controller Custom Resource, always creating regular ClusterIP services instead of headless services when specified. This breaks StatefulSet pod DNS resolution patterns that worker nodes depend on to connect to the controller.
Error symptoms:
error: _xgetaddrinfo: getaddrinfo(slurm-controller-0.slurm-controller.slurm:6817) failed: Name or service not known
error: slurm_set_addr: Unable to resolve "slurm-controller-0.slurm-controller.slurm"
Root cause: The operator ignores clusterIP: None
specification and always creates regular ClusterIP services. StatefulSet pod DNS patterns (pod-name.service-name.namespace
) only work with headless services.
Impact: Installing the SlinkyProject Helm chart with headless service configuration fails due to this bug.
Steps to Reproduce
Environment:
- Kubernetes cluster (any version)
- slurm-operator installed
- Helm chart deployment or direct Controller CR
Reproduction steps:
-
Create a Controller CR with headless service configuration:
apiVersion: slinky.slurm.net/v1alpha1 kind: Controller metadata: name: slurm namespace: slurm spec: clusterName: slurm-test # Custom cluster name (doesn't affect DNS) service: spec: clusterIP: None # This should create headless service slurmKeyRef: name: slurm-auth-slurm key: slurm.key jwtHs256KeyRef: name: slurm-auth-jwths256 key: jwt_hs256.key
-
Deploy the Controller CR:
kubectl apply -f controller.yaml
-
Check the generated service:
kubectl get service slurm-controller -n slurm -o yaml | grep clusterIP # Actual: clusterIP: 10.73.116.152 (regular ClusterIP) # Expected: clusterIP: null (headless service)
-
Deploy worker NodeSet that references this controller
-
Observe worker pod DNS resolution failure:
kubectl logs -n slurm slurm-worker-slinky-0 -c slurmd # Shows: error: slurm_set_addr: Unable to resolve "slurm-controller-0.slurm-controller.slurm"
Alternative reproduction via Helm:
# values.yaml
clusterName: slurm-test
controller:
service:
spec:
clusterIP: None
helm install slurm ./slurm -f values.yaml -n slurm
# Results in same DNS resolution failure
Expected Behavior
When a Controller CR specifies:
spec:
service:
spec:
clusterIP: None
The operator should create a headless service:
apiVersion: v1
kind: Service
metadata:
name: slurm-controller
spec:
clusterIP: None # Headless service
publishNotReadyAddresses: true
selector:
app.kubernetes.io/instance: slurm
app.kubernetes.io/name: slurmctld
ports:
- name: slurmctld
port: 6817
protocol: TCP
targetPort: slurmctld
This would enable StatefulSet pod DNS resolution:
# Should work
kubectl exec pod -- getent hosts slurm-controller-0.slurm-controller.slurm
# 10.72.1.133 slurm-controller-0.slurm-controller.slurm.svc.cluster.local
Additional Context
Code Location
The bug is in /internal/builder/controller_service.go:15-25
:
func (b *Builder) BuildControllerService(controller *slinkyv1alpha1.Controller) (*corev1.Service, error) {
spec := controller.Spec.Service
opts := ServiceOpts{
Key: controller.ServiceKey(),
Metadata: controller.Spec.Template.PodMetadata,
ServiceSpec: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec,
Selector: labels.NewBuilder().
WithControllerSelectorLabels(controller).
Build(),
// Missing: Headless field is not being set based on ServiceSpec.ClusterIP
}
// ...
return b.BuildService(opts, controller)
}
Root Cause Analysis
The BuildService
function in /internal/builder/service.go
properly supports headless services:
if opts.Headless {
out.Spec.ClusterIP = corev1.ClusterIPNone
out.Spec.PublishNotReadyAddresses = true
}
But BuildControllerService
never sets opts.Headless = true
even when the user specifies clusterIP: None
.
Proposed Fix
Add this line to BuildControllerService
:
Headless: controller.Spec.Service.ServiceSpecWrapper.ServiceSpec.ClusterIP == corev1.ClusterIPNone,
Workaround
Manual service fix after deployment:
- Scale down operator:
kubectl scale deployment slurm-operator -n slurm --replicas=0
- Delete broken service:
kubectl delete service slurm-controller -n slurm
- Create headless service manually with
clusterIP: None
- Scale up operator:
kubectl scale deployment slurm-operator -n slurm --replicas=1
Impact Assessment
- Critical: Worker nodes cannot connect to controller
- Common: Affects all Helm chart deployments that specify headless services
- Silent: Configuration is ignored without warning, making debugging difficult
- Standard pattern: Headless services are required for StatefulSet pod-to-pod communication