The GPU Scheduler uses a single ServiceAccount with three separate ClusterRoles for granular permission management:
- Scheduler Role - Comprehensive permissions for scheduling decisions
- Agent Role - Limited permissions for GPU inventory reporting
- Webhook Role - Minimal permissions for admission control
All three components use the same ServiceAccount but are bound to different ClusterRoles based on their needs.
Name: gpu-scheduler (configurable via values.yaml)
Namespace: Same as the Helm release namespace
Used by:
- Scheduler deployment
- Webhook deployment
- Agent daemonset
The scheduler requires extensive read permissions to make informed scheduling decisions.
- pods, nodes, pods/status # Schedule and track pods
- pods/binding # Bind pods to nodes
- events # Record scheduling events- namespaces # Understand namespace boundaries
- services # Service topology awareness
- configmaps # Authentication config (kube-system)- replicationcontrollers # Pod controllers
- replicasets, statefulsets # Workload types- persistentvolumes # Volume information
- persistentvolumeclaims # Volume claims
- storageclasses # Storage class details
- csinodes, csidrivers # CSI resources
- csistoragecapacities # Storage capacity
- volumeattachments # Volume attachments- poddisruptionbudgets # Pod disruption policies
- leases (coordination.k8s.io) # GPU locking mechanism- gpuclaims # GPU allocation requests
- gpuclaims/status # Claim status updates
- gpunodestatuses # GPU node inventory (read-only)The agent needs minimal permissions to report GPU inventory.
- nodes # Read node information
- gpunodestatuses # Create and update GPU inventory
- gpunodestatuses/status # Update status subresourceKey Feature: The agent can create and patch GpuNodeStatus resources to report GPU availability and health.
The webhook requires minimal read-only access for validation.
- pods # Read pod specifications
- gpuclaims # Validate claim referencesSecurity Note: The webhook operates with least privilege - read-only access only.
| Resource | Scheduler | Agent | Webhook |
|---|---|---|---|
| pods | get, list, watch, update, patch | - | get, list |
| nodes | get, list, watch | get, list, watch | - |
| leases | get, list, watch, create, update, patch, delete | - | - |
| gpuclaims | get, list, watch, update, patch | - | get, list |
| gpunodestatuses | get, list, watch | get, list, watch, create, update, patch | - |
| gpunodestatuses/status | - | get, update, patch | - |
The RBAC resources are automatically created when you install the Helm chart:
helm install gpu-scheduler charts/gpu-schedulerThis creates:
- 1 ServiceAccount
- 3 ClusterRoles
- 3 ClusterRoleBindings
kubectl get serviceaccount gpu-scheduler -n gpu-schedulerkubectl get clusterrole | grep gpu-scheduler
# Output:
# gpu-scheduler-scheduler
# gpu-scheduler-agent
# gpu-scheduler-webhookkubectl get clusterrolebinding | grep gpu-scheduler
# Output:
# gpu-scheduler-scheduler
# gpu-scheduler-agent
# gpu-scheduler-webhook# Check scheduler permissions
kubectl auth can-i list pods \
--as=system:serviceaccount:gpu-scheduler:gpu-scheduler
# Check agent permissions
kubectl auth can-i patch gpunodestatuses/status \
--as=system:serviceaccount:gpu-scheduler:gpu-scheduler
# Check webhook permissions
kubectl auth can-i get gpuclaims \
--as=system:serviceaccount:gpu-scheduler:gpu-schedulerIf you see errors like:
"pods" is forbidden: User "system:serviceaccount:gpu-scheduler:gpu-scheduler"
cannot list resource "pods" in API group "" at the cluster scope
Solution:
-
Verify ClusterRoleBindings exist:
kubectl get clusterrolebinding gpu-scheduler-scheduler -o yaml
-
Check if the binding references the correct ServiceAccount:
subjects: - kind: ServiceAccount name: gpu-scheduler namespace: gpu-scheduler # Should match your namespace
-
Re-apply RBAC if needed:
kubectl apply -f charts/gpu-scheduler/templates/rbac.yaml
Error:
"gpunodestatuses/status" is forbidden: cannot patch resource "gpunodestatuses/status"
Solution: Ensure the agent role includes the status subresource:
kubectl get clusterrole gpu-scheduler-agent -o yaml | grep -A2 gpunodestatusesShould show:
- apiGroups: ["gpu.scheduling"]
resources: ["gpunodestatuses/status"]
verbs: ["get", "update", "patch"]Solution: Verify the webhook role exists and is bound:
kubectl get clusterrole gpu-scheduler-webhook
kubectl get clusterrolebinding gpu-scheduler-webhookEach component only has the permissions it needs:
- ✅ Scheduler: Extensive read access, limited write to pods/leases
- ✅ Agent: Only writes to gpunodestatuses
- ✅ Webhook: Read-only access
Three separate ClusterRoles instead of one monolithic role:
- Easier to audit
- Principle of least privilege
- Clear permission boundaries
- ❌ No
createon pods (scheduler uses binding) - ❌ No
deleteon pods - ❌ No cluster-admin
- ❌ No access to secrets
All actions are logged with the ServiceAccount identity:
User: system:serviceaccount:gpu-scheduler:gpu-scheduler
Edit values.yaml:
serviceAccountName: my-custom-saThen upgrade:
helm upgrade gpu-scheduler charts/gpu-schedulerIf you need additional permissions, edit charts/gpu-scheduler/templates/rbac.yaml:
# Add to the appropriate ClusterRole
- apiGroups: ["custom.api.group"]
resources: ["customresources"]
verbs: ["get", "list"]To limit permissions to specific namespaces, convert ClusterRoles to Roles:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gpu-scheduler-scheduler
namespace: specific-namespace
rules:
# ... same rules ...Note: The scheduler typically needs cluster-wide access to schedule pods across all namespaces.