This module creates a Slinky cluster and nodesets, for a Slurm-on-Kubernetes HPC setup.
The setup closely follows the documented quickstart installation v0.3.1, with the exception of a more lightweight monitoring/metrics setup. Consider scraping the Slurm Exporter with Google Managed Prometheus and a PodMonitoring resource, rather than a cluster-local Kube Prometheus Stack (although both are possible with module parameterizations). It also provisions a login node (pod).
Through cert_manager_values, prometheus_values, slurm_operator_values, and slurm_values, you can customize the Helm releases that constitute Slinky. The Cert Manager, Slurm Operator, and Slurm Helm installations are required, whereas the Prometheus Helm chart is optional (and not included by default). Set install_kube_prometheus_stack=true to install Prometheus.
- id: slinky
source: community/modules/scheduler/slinky
use: [gke_cluster, base_pool]
settings:
slurm_values:
compute:
nodesets:
- name: h3
enabled: true
replicas: 2
image:
# Use the default nodeset image
repository: ""
tag: ""
resources:
requests:
cpu: 86
memory: 324Gi
limits:
cpu: 86
memory: 324Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node.kubernetes.io/instance-type"
operator: In
values:
- h3-standard-88
partition:
enabled: true
login: # Login node
enabled: true
replicas: 1
rootSshAuthorizedKeys: []
image:
# Use the default login image
repository: ""
tag: ""
resources:
requests:
cpu: 500m
memory: 4Gi
limits:
cpu: 500m
memory: 4Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node.kubernetes.io/instance-type"
operator: In
values:
- e2-standard-8 # base_pool's machine-typeThis creates a Slinky cluster with the following attributes:
- Slinky Helm releases are installed atop the
gke_cluster(from thegke-clustermodule). - Slinky system components and a login node are scheduled on the
base_pool(from thegke-node-poolmodule).- This node affinity specification is recommended, to save HPC hardware for HPC nodesets, and to ensure Helm releases are fully uninstalled before all nodepools are deleted during a
gcluster destroy.
- This node affinity specification is recommended, to save HPC hardware for HPC nodesets, and to ensure Helm releases are fully uninstalled before all nodepools are deleted during a
- One Slurm nodeset is provisioned, with resource requests/limits and node affinities aligned to h3-standard-88 VMs.
To test Slurm functionality, connect to the controller or the login node and use Slurm client commands:
gcloud container clusters get-credentials YOUR_CLUSTER --region YOUR_REGIONConnect to the controller:
kubectl exec -it statefulsets/slurm-controller \
--namespace=slurm \
-- bash --loginConnect to the login node:
SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
## Assuming your public SSH key was configured in `login.rootSshAuthorizedKeys[]`.
ssh -p 2222 root@${SLURM_LOGIN_IP}
## Assuming SSSD is configured.
ssh -p 2222 ${USER}@${SLURM_LOGIN_IP}On the connected pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test if Slurm is functioning:
sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue| Name | Version |
|---|---|
| terraform | = 1.12.2 |
| >= 6.16 | |
| helm | ~> 2.17 |
| Name | Version |
|---|---|
| >= 6.16 | |
| helm | ~> 2.17 |
No modules.
| Name | Type |
|---|---|
| helm_release.cert_manager | resource |
| helm_release.prometheus | resource |
| helm_release.slurm | resource |
| helm_release.slurm_operator | resource |
| google_client_config.default | data source |
| google_container_cluster.gke_cluster | data source |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| cert_manager_chart_version | Version of the Cert Manager chart to install. | string |
"v1.18.2" |
no |
| cert_manager_values | Value overrides for the Cert Manager release | any |
{ |
no |
| cluster_id | An identifier for the GKE cluster resource with format projects/<project_id>/locations//clusters/. | string |
n/a | yes |
| install_kube_prometheus_stack | Install the Kube Prometheus Stack. | bool |
false |
no |
| install_slurm_chart | Install slurm-operator chart. | bool |
true |
no |
| install_slurm_operator_chart | Install slurm-operator chart. | bool |
true |
no |
| node_pool_names | Names of node pools, for use in node affinities (Slinky system components). | list(string) |
null |
no |
| project_id | The project ID that hosts the GKE cluster. | string |
n/a | yes |
| prometheus_chart_version | Version of the Kube Prometheus Stack chart to install. | string |
"77.0.1" |
no |
| prometheus_values | Value overrides for the Prometheus release | any |
{ |
no |
| slurm_chart_version | Version of the Slurm chart to install. | string |
"0.3.1" |
no |
| slurm_namespace | slurm namespace for charts | string |
"slurm" |
no |
| slurm_operator_chart_version | Version of the Slurm Operator chart to install. | string |
"0.3.1" |
no |
| slurm_operator_namespace | slurm namespace for charts | string |
"slinky" |
no |
| slurm_operator_repository | Value overrides for the Slinky release | string |
"oci://ghcr.io/slinkyproject/charts" |
no |
| slurm_operator_values | Value overrides for the Slinky release | any |
{} |
no |
| slurm_repository | Value overrides for the Slinky release | string |
"oci://ghcr.io/slinkyproject/charts" |
no |
| slurm_values | Value overrides for the Slurm release | any |
{} |
no |
| Name | Description |
|---|---|
| slurm_namespace | namespace for the slurm chart |
| slurm_operator_namespace | namespace for the slinky operator chart |