Skip to content

Latest commit

 

History

History
172 lines (144 loc) · 9.3 KB

File metadata and controls

172 lines (144 loc) · 9.3 KB

Description

This module creates a Slinky cluster and nodesets, for a Slurm-on-Kubernetes HPC setup.

The setup closely follows the documented quickstart installation v0.3.1, with the exception of a more lightweight monitoring/metrics setup. Consider scraping the Slurm Exporter with Google Managed Prometheus and a PodMonitoring resource, rather than a cluster-local Kube Prometheus Stack (although both are possible with module parameterizations). It also provisions a login node (pod).

Through cert_manager_values, prometheus_values, slurm_operator_values, and slurm_values, you can customize the Helm releases that constitute Slinky. The Cert Manager, Slurm Operator, and Slurm Helm installations are required, whereas the Prometheus Helm chart is optional (and not included by default). Set install_kube_prometheus_stack=true to install Prometheus.

Example

- id: slinky
  source: community/modules/scheduler/slinky
  use: [gke_cluster, base_pool]
  settings:
    slurm_values:
      compute:
        nodesets:
        - name: h3
          enabled: true
          replicas: 2
          image:
            # Use the default nodeset image
            repository: ""
            tag: ""
          resources:
            requests:
              cpu: 86
              memory: 324Gi
            limits:
              cpu: 86
              memory: 324Gi
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                    - h3-standard-88
          partition:
            enabled: true
      login: # Login node
        enabled: true
        replicas: 1
        rootSshAuthorizedKeys: []
        image:
          # Use the default login image
          repository: ""
          tag: ""
        resources:
          requests:
            cpu: 500m
            memory: 4Gi
          limits:
            cpu: 500m
            memory: 4Gi
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: "node.kubernetes.io/instance-type"
                  operator: In
                  values:
                  - e2-standard-8 # base_pool's machine-type

This creates a Slinky cluster with the following attributes:

  • Slinky Helm releases are installed atop the gke_cluster (from the gke-cluster module).
  • Slinky system components and a login node are scheduled on the base_pool (from the gke-node-pool module).
    • This node affinity specification is recommended, to save HPC hardware for HPC nodesets, and to ensure Helm releases are fully uninstalled before all nodepools are deleted during a gcluster destroy.
  • One Slurm nodeset is provisioned, with resource requests/limits and node affinities aligned to h3-standard-88 VMs.

Usage

To test Slurm functionality, connect to the controller or the login node and use Slurm client commands:

gcloud container clusters get-credentials YOUR_CLUSTER --region YOUR_REGION

Connect to the controller:

kubectl exec -it statefulsets/slurm-controller \
  --namespace=slurm \
  -- bash --login

Connect to the login node:

SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
## Assuming your public SSH key was configured in `login.rootSshAuthorizedKeys[]`.
ssh -p 2222 root@${SLURM_LOGIN_IP}
## Assuming SSSD is configured.
ssh -p 2222 ${USER}@${SLURM_LOGIN_IP}

On the connected pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test if Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

Requirements

Name Version
terraform = 1.12.2
google >= 6.16
helm ~> 2.17

Providers

Name Version
google >= 6.16
helm ~> 2.17

Modules

No modules.

Resources

Name Type
helm_release.cert_manager resource
helm_release.prometheus resource
helm_release.slurm resource
helm_release.slurm_operator resource
google_client_config.default data source
google_container_cluster.gke_cluster data source

Inputs

Name Description Type Default Required
cert_manager_chart_version Version of the Cert Manager chart to install. string "v1.18.2" no
cert_manager_values Value overrides for the Cert Manager release any
{
"crds": {
"enabled": true
}
}
no
cluster_id An identifier for the GKE cluster resource with format projects/<project_id>/locations//clusters/. string n/a yes
install_kube_prometheus_stack Install the Kube Prometheus Stack. bool false no
install_slurm_chart Install slurm-operator chart. bool true no
install_slurm_operator_chart Install slurm-operator chart. bool true no
node_pool_names Names of node pools, for use in node affinities (Slinky system components). list(string) null no
project_id The project ID that hosts the GKE cluster. string n/a yes
prometheus_chart_version Version of the Kube Prometheus Stack chart to install. string "77.0.1" no
prometheus_values Value overrides for the Prometheus release any
{
"installCRDs": true
}
no
slurm_chart_version Version of the Slurm chart to install. string "0.3.1" no
slurm_namespace slurm namespace for charts string "slurm" no
slurm_operator_chart_version Version of the Slurm Operator chart to install. string "0.3.1" no
slurm_operator_namespace slurm namespace for charts string "slinky" no
slurm_operator_repository Value overrides for the Slinky release string "oci://ghcr.io/slinkyproject/charts" no
slurm_operator_values Value overrides for the Slinky release any {} no
slurm_repository Value overrides for the Slinky release string "oci://ghcr.io/slinkyproject/charts" no
slurm_values Value overrides for the Slurm release any {} no

Outputs

Name Description
slurm_namespace namespace for the slurm chart
slurm_operator_namespace namespace for the slinky operator chart