Skip to content

Commit ee50119

Browse files
authored
Merge pull request #77 from opencadc/kueue-dep
Kueue dep
2 parents a655f7d + 5bc209b commit ee50119

File tree

14 files changed

+1034
-267
lines changed

14 files changed

+1034
-267
lines changed

helm/applications/canfar/README.md

Lines changed: 884 additions & 0 deletions
Large diffs are not rendered by default.
38.8 KB
Loading

helm/applications/canfar/skaha.png

82.2 KB
Loading

helm/applications/skaha/CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,10 @@
1-
# CHANGELOG for Skaha User Session API (Chart 1.0.3)
1+
# CHANGELOG for Skaha User Session API (Chart 1.0.4)
2+
3+
## 2025.09.11 (1.0.4)
4+
- Provide Kueue examples with documentation
5+
- Fix typo in headless priority class name setting
6+
- Fix environment variable name to properly be uppercase for headless priority class
7+
- Bump Skaha API image to `1.0.3`
28

39
## 2025.09.10 (1.0.3)
410
- Fix to display GPU Cores

helm/applications/skaha/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 1.0.3
18+
version: 1.0.4
1919

2020
# This is the version number of the application being deployed. This version number should be
2121
# incremented each time you make changes to the application. Versions are not expected to
2222
# follow Semantic Versioning. They should reflect the version the application is using.
2323
# It is recommended to use it with quotes.
24-
appVersion: "1.0.2"
24+
appVersion: "1.0.3"
2525

2626
dependencies:
2727
- name: "redis"

helm/applications/skaha/README.md

Lines changed: 9 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ The following table lists the configurable parameters for the Skaha Helm chart:
6363
| `deployment.skaha.adminsGroup` | GMS style Group URI for Skaha admins to belong to | `""` |
6464
| `deployment.skaha.headlessGroup` | GMS style Group URI whose members can submit headless jobs | `""` |
6565
| `deployment.skaha.headlessPriorityGroup` | GMS style Group URI whose member's headless jobs can pre-empt other's. Useful fortight deadlines in processing | `""` |
66+
| `deployment.skaha.headlessPriorityClass` | Name of the `priorityClass` for headless jobs to allow some pre-emption | `""` |
6667
| `deployment.skaha.loggingGroups` | List of GMS style Group URIs whose members can alter the log level. See [cadc-log](https://github.com/opencadc/core/tree/main/cadc-log) regarding the `/logControl` endpoint. | `[]` |
6768
| `deployment.skaha.posixMapperResourceID` | Resource ID (URI) for the POSIX Mapper service containing the UIDs and GIDs | `""` |
6869
| `deployment.skaha.oidcURI` | URI (or URL) for the OIDC service | `""` |
@@ -84,10 +85,10 @@ The following table lists the configurable parameters for the Skaha Helm chart:
8485
| `deployment.skaha.sessions.minEphemeralStorage` | Minimum ephemeral storage, in [Kubernetes quantity](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/quantity/), for interactive sessions. Defaults to 20Gi. | `"20Gi"` |
8586
| `deployment.skaha.sessions.maxEphemeralStorage` | Maximum ephemeral storage, in [Kubernetes quantity](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/quantity/), for interactive sessions. Defaults to 200Gi. | `"200Gi"` |
8687
| `deployment.skaha.sessions.initContainerImage` | Init container image for Skaha User Sessions. | `redis-7.4.2-alpine3.21` |
87-
| `deployment.skaha.sessions.queue.default.queueName` | Name of the default `LocalQueue` instance from Kueue for all types | `""` |
88-
| `deployment.skaha.sessions.queue.default.priorityClass` | Name of the `priorityClass` for the all types to allow some pre-emption | `""` |
89-
| `deployment.skaha.sessions.queue.<typename>.queueName` | Name of the `LocalQueue` instance from Kueue for the given type | `""` |
90-
| `deployment.skaha.sessions.queue.<typename>.priorityClass` | Name of the `priorityClass` for the given type to allow some pre-emption | `""` |
88+
| `deployment.skaha.sessions.kueue.default.queueName` | Name of the default `LocalQueue` instance from Kueue for all types | `""` |
89+
| `deployment.skaha.sessions.kueue.default.priorityClass` | Name of the `priorityClass` for the all types to allow some pre-emption | `""` |
90+
| `deployment.skaha.sessions.kueue.<typename>.queueName` | Name of the `LocalQueue` instance from Kueue for the given type | `""` |
91+
| `deployment.skaha.sessions.kueue.<typename>.priorityClass` | Name of the `priorityClass` for the given type to allow some pre-emption | `""` |
9192
| `deployment.skaha.sessions.hostname` | Hostname to access user sessions on. Defaults to `deployment.hostname` | `deployment.hostname` |
9293
| `deployment.skaha.sessions.tls` | TLS configuration for the User Sessions IngressRoute. | `{}` |
9394
| `deployment.skaha.sessions.extraVolumes` | List of extra `volume` and `volumeMount` to be mounted in User Sessions. See the `values.yaml` file for examples. | `[]` |
@@ -106,46 +107,20 @@ Ensure that `tolerations` and `nodeAffinity` are at the expected indentation! T
106107
## Kueue
107108
Skaha leverages Kueue for efficient job queueing and management when properly installed and configured in your cluster. For detailed information on Kueue's features and setup, refer to the [Kueue documentation](https://kueue.sigs.k8s.io/docs/).
108109

109-
Choosing to install Kueue:
110-
`values.yaml`
111-
```yaml
112-
kueue:
113-
# Set to false by default
114-
install: true
115-
```
110+
### Installation
111+
https://kueue.sigs.k8s.io/docs/installation/#install-a-released-version
116112

117-
Will install the Kueue Chart, with a default `ClusterQueue`, and whatever defined `LocalQueues` were declared in the `deployment.skaha.sessions.queue` section:
113+
Will install the Kueue Chart, with a default `ClusterQueue`, and whatever defined `LocalQueues` were declared in the `deployment.skaha.sessions.kueue` section:
118114
```yaml
119115
deployment:
120116
skaha:
121117
sessions:
122-
queue:
118+
kueue:
123119
notebook:
124120
queueName: some-local-queue
125121
priorityClass: med
126122
```
127123
128-
In which case Helm would ensure the `some-local-queue` `LocalQueue` is installed.
129-
130-
Kueue will also need to know about the Kubernetes Cluster configuration. Setting the values to 60% to 80% of the cluster resources is recommended for optimal performance.
131-
```yaml
132-
kueue:
133-
install: true
134-
# 60% of cluster resources
135-
clusterQueueResources:
136-
- name: "cpu"
137-
nominalQuota: "28"
138-
borrowingLimit: "0"
139-
lendingLimit: "0"
140-
- name: "memory"
141-
nominalQuota: "100Gi"
142-
borrowingLimit: "0Gi"
143-
lendingLimit: "0Gi"
144-
- name: "ephemeral-storage"
145-
nominalQuota: "500Gi"
146-
borrowingLimit: "0Gi"
147-
lendingLimit: "0Gi"
148-
```
149124
150125
To determine your cluster's allocatable resources, checkout a small Python utility (requires [`uv`](https://github.com/astral-sh/uv?tab=readme-ov-file#installation)):
151126
https://github.com/opencadc/deployments/tree/main/configs/kueue/kueuer
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Example Cluster Queue configuration for Kueue. Usually a single ClusterQueue is sufficient, but read the documentation for more details.
2+
# @see https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
3+
#
4+
# Priority Classes are used in Skaha's configuration to allow Workloads to be prioritized.
5+
#
6+
# jenkinsd 2025.09.11
7+
#
8+
---
9+
# Resource Flavors
10+
apiVersion: kueue.x-k8s.io/v1beta1
11+
kind: ResourceFlavor
12+
metadata:
13+
name: canfar-science-platform-default
14+
---
15+
# ClusterQueue
16+
apiVersion: kueue.x-k8s.io/v1beta1
17+
kind: ClusterQueue
18+
metadata:
19+
name: canfar-science-platform-cluster-queue
20+
spec:
21+
namespaceSelector:
22+
matchExpressions:
23+
- key: kubernetes.io/metadata.name
24+
operator: In
25+
values: [ skaha-workload ]
26+
queueingStrategy: BestEffortFIFO
27+
cohort: canfar-science-platform-cohort
28+
resourceGroups:
29+
- coveredResources:
30+
- "cpu"
31+
- "memory"
32+
- "ephemeral-storage"
33+
- "nvidia.com/gpu"
34+
flavors:
35+
- name: "canfar-science-platform-default"
36+
# These values represent the total resources available to this ClusterQueue, and requires some knowledge of the cluster's capacity.
37+
# Omit the nvidia.com/gpu resource if GPUs are not available in the cluster (also from the coveredResources above).
38+
resources:
39+
- name: "cpu"
40+
nominalQuota: "1680"
41+
borrowingLimit: "420"
42+
lendingLimit: "420"
43+
- name: "memory"
44+
nominalQuota: "6562Gi"
45+
borrowingLimit: "1640Gi"
46+
lendingLimit: "1640Gi"
47+
- name: "ephemeral-storage"
48+
nominalQuota: "52636Gi"
49+
borrowingLimit: "13159Gi"
50+
lendingLimit: "13159Gi"
51+
- name: "nvidia.com/gpu"
52+
nominalQuota: "18"
53+
borrowingLimit: "0"
54+
lendingLimit: "0"
55+
preemption:
56+
reclaimWithinCohort: LowerPriority
57+
borrowWithinCohort:
58+
policy: LowerPriority
59+
maxPriorityThreshold: 10000
60+
withinClusterQueue: LowerPriority
61+
stopPolicy: None
62+
---
63+
# WorkloadPriorityClass
64+
apiVersion: kueue.x-k8s.io/v1beta1
65+
kind: WorkloadPriorityClass
66+
metadata:
67+
name: low
68+
value: 10000
69+
description: "Low Priority"
70+
---
71+
# WorkloadPriorityClass
72+
apiVersion: kueue.x-k8s.io/v1beta1
73+
kind: WorkloadPriorityClass
74+
metadata:
75+
name: medium
76+
value: 100000
77+
description: "Medium Priority"
78+
---
79+
# WorkloadPriorityClass
80+
apiVersion: kueue.x-k8s.io/v1beta1
81+
kind: WorkloadPriorityClass
82+
metadata:
83+
name: high
84+
value: 1000000
85+
description: "High Priority"
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Example Local Queue configuration for Kueue. This assumes the skaha-workload namespace exists, and the referenced ClusterQueue exists.
2+
# This a single LocalQueue, but feel free to add as many as needed.
3+
# @see https://kueue.sigs.k8s.io/docs/concepts/local_queue/
4+
#
5+
# jenkinsd 2025.09.11
6+
#
7+
---
8+
apiVersion: kueue.x-k8s.io/v1beta1
9+
kind: LocalQueue
10+
metadata:
11+
namespace: skaha-workload
12+
name: canfar-science-platform-local-queue
13+
spec:
14+
clusterQueue: canfar-science-platform-cluster-queue
Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
{{- if eq .Values.kueue.install true }}
21
# Grant the service account from the Released and Workload namespaces access to the Kueue resources
32
#
43
# Workload Namespace RBAC
@@ -7,8 +6,8 @@
76
apiVersion: rbac.authorization.k8s.io/v1
87
kind: Role
98
metadata:
10-
name: {{ .Release.Name }}-kueue-role
11-
namespace: {{ .Values.skahaWorkload.namespace }}
9+
name: canfar-science-platform-kueue-role
10+
namespace: skaha-workload
1211
rules:
1312
- apiGroups:
1413
- "kueue.x-k8s.io"
@@ -22,16 +21,16 @@ rules:
2221
apiVersion: rbac.authorization.k8s.io/v1
2322
kind: RoleBinding
2423
metadata:
25-
name: {{ .Release.Name }}-read-kueue
26-
namespace: {{ .Values.skahaWorkload.namespace }}
24+
name: canfar-science-platform-read-kueue
25+
namespace: skaha-workload
2726
roleRef:
2827
apiGroup: rbac.authorization.k8s.io
2928
kind: Role
30-
name: {{ .Release.Name }}-kueue-role
29+
name: canfar-science-platform-kueue-role
3130
subjects:
3231
- kind: ServiceAccount
33-
name: {{ .Values.deployment.skaha.serviceAccountName }}
34-
namespace: {{ .Release.Namespace }}
32+
name: skaha
33+
namespace: skaha-system
3534

3635
#
3736
# Released Namespace RBAC
@@ -40,8 +39,8 @@ subjects:
4039
apiVersion: rbac.authorization.k8s.io/v1
4140
kind: Role
4241
metadata:
43-
name: {{ .Release.Name }}-kueue-role
44-
namespace: {{ .Release.Namespace }}
42+
name: canfar-science-platform-kueue-role
43+
namespace: skaha-system
4544
rules:
4645
- apiGroups:
4746
- "kueue.x-k8s.io"
@@ -55,14 +54,13 @@ rules:
5554
apiVersion: rbac.authorization.k8s.io/v1
5655
kind: RoleBinding
5756
metadata:
58-
name: {{ .Release.Name }}-read-kueue
59-
namespace: {{ .Release.Namespace }}
57+
name: canfar-science-platform-read-kueue
58+
namespace: skaha-system
6059
roleRef:
6160
apiGroup: rbac.authorization.k8s.io
6261
kind: Role
63-
name: {{ .Release.Name }}-kueue-role
62+
name: canfar-science-platform-kueue-role
6463
subjects:
6564
- kind: ServiceAccount
66-
name: {{ .Values.deployment.skaha.serviceAccountName }}
67-
namespace: {{ .Release.Namespace }}
68-
{{- end }}
65+
name: skaha
66+
namespace: skaha-system

helm/applications/skaha/templates/NOTES.txt

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,29 +8,26 @@ Namespace: {{ .Release.Namespace }}
88
Skaha workload namespace: {{ .Values.skahaWorkload.namespace | default "skaha-workload" }}
99

1010
Kueue:
11-
{{- if .Values.kueue.install }}
12-
- Installed by this release: yes (chart {{ "kueue" }})
13-
- ClusterQueue: {{ .Release.Name }}-cluster-queue
14-
- LocalQueues (in {{ .Values.skahaWorkload.namespace }}):
15-
{{- $queues := (default dict .Values.deployment.skaha.sessions.queue) }}
11+
{{- if .Values.deployment.skaha.sessions.kueue }}
12+
- It's expected that the Kueue CRDs and controllers to be present if you set session queues.
13+
{{- $queues := (default dict .Values.deployment.skaha.sessions.kueue) }}
1614
{{- if $queues }}
1715
{{- range $type, $cfg := $queues }}
1816
- {{ $cfg.queueName | default (printf "%s-%s" $.Release.Name $type) }} (type: {{ $type }})
1917
{{- end }}
2018
{{- else }}
21-
- none configured; Jobs will run without Kueue unless annotated by Skaha.
19+
- none configured; Jobs will NOT run with Kueue unless annotated by Skaha.
2220
{{- end }}
2321
{{- else }}
24-
- Installed by this release: no
25-
- Expect Kueue CRDs and controllers to be present if you set session queues.
22+
- Not configured
2623
{{- end }}
2724

2825
Quick checks:
29-
- kubectl get deployment {{ .Release.Name }}-skaha-tomcat -n {{ .Release.Namespace }}
30-
- kubectl get clusterqueue {{ .Release.Name }}-cluster-queue 2>/dev/null || true
31-
- kubectl get localqueue -n {{ .Values.skahaWorkload.namespace }} 2>/dev/null || true
26+
- kubectl get deployment {{ .Release.Name }}-skaha-tomcat -n {{ .Release.Namespace }}
27+
- kubectl get clusterqueues 2>/dev/null || true
28+
- kubectl get localqueues -n {{ .Values.skahaWorkload.namespace }} 2>/dev/null || true
3229

3330
If user Jobs remain Pending:
34-
- Ensure LocalQueue exists in {{ .Values.skahaWorkload.namespace }} and references ClusterQueue {{ .Release.Name }}-cluster-queue.
35-
- Verify ClusterQueue quotas (Values.kueue.clusterQueueResources) are sufficient.
31+
- Ensure LocalQueue exists in {{ .Values.skahaWorkload.namespace }} and references your configured ClusterQueue.
32+
- Verify ClusterQueue quotas are sufficient.
3633
- Confirm priority classes referenced by sessions exist or are created.

0 commit comments

Comments
 (0)