cluster-toolkit/community/modules/scheduler/slinky/README.md at 7e16cfa9f8e8f3d6f2fe596f5f0645d9fdb1437b · GoogleCloudPlatform/cluster-toolkit

Description

This module creates a Slinky cluster and nodesets, for a Slurm-on-Kubernetes HPC setup.

The setup closely follows the documented quickstart installation v0.3.1, with the exception of a more lightweight monitoring/metrics setup. Consider scraping the Slurm Exporter with Google Managed Prometheus and a PodMonitoring resource, rather than a cluster-local Kube Prometheus Stack (although both are possible with module parameterizations). It also provisions a login node (pod).

Through cert_manager_values, prometheus_values, slurm_operator_values, and slurm_values, you can customize the Helm releases that constitute Slinky. The Cert Manager, Slurm Operator, and Slurm Helm installations are required, whereas the Prometheus Helm chart is optional (and not included by default). Set install_kube_prometheus_stack=true to install Prometheus.

Example

- id: slinky
  source: community/modules/scheduler/slinky
  use: [gke_cluster, base_pool]
  settings:
    slurm_values:
      compute:
        nodesets:
        - name: h3
          enabled: true
          replicas: 2
          image:
            # Use the default nodeset image
            repository: ""
            tag: ""
          resources:
            requests:
              cpu: 86
              memory: 324Gi
            limits:
              cpu: 86
              memory: 324Gi
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                    - h3-standard-88
          partition:
            enabled: true
      login: # Login node
        enabled: true
        replicas: 1
        rootSshAuthorizedKeys: []
        image:
          # Use the default login image
          repository: ""
          tag: ""
        resources:
          requests:
            cpu: 500m
            memory: 4Gi
          limits:
            cpu: 500m
            memory: 4Gi
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: "node.kubernetes.io/instance-type"
                  operator: In
                  values:
                  - e2-standard-8 # base_pool's machine-type

This creates a Slinky cluster with the following attributes:

Slinky Helm releases are installed atop the gke_cluster (from the gke-cluster module).
Slinky system components and a login node are scheduled on the base_pool (from the gke-node-pool module).
- This node affinity specification is recommended, to save HPC hardware for HPC nodesets, and to ensure Helm releases are fully uninstalled before all nodepools are deleted during a gcluster destroy.
One Slurm nodeset is provisioned, with resource requests/limits and node affinities aligned to h3-standard-88 VMs.

Usage

To test Slurm functionality, connect to the controller or the login node and use Slurm client commands:

gcloud container clusters get-credentials YOUR_CLUSTER --region YOUR_REGION

Connect to the controller:

kubectl exec -it statefulsets/slurm-controller \
  --namespace=slurm \
  -- bash --login

Connect to the login node:

SLURM_LOGIN_IP="$(kubectl get services -n slurm -l app.kubernetes.io/instance=slurm,app.kubernetes.io/name=login -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
## Assuming your public SSH key was configured in `login.rootSshAuthorizedKeys[]`.
ssh -p 2222 root@${SLURM_LOGIN_IP}
## Assuming SSSD is configured.
ssh -p 2222 ${USER}@${SLURM_LOGIN_IP}

On the connected pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test if Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

Requirements

Name	Version
terraform	= 1.12.2
google	>= 6.16
helm	~> 2.17

Providers

Name	Version
google	>= 6.16
helm	~> 2.17

Modules

No modules.

Resources

Name	Type
helm_release.cert_manager	resource
helm_release.prometheus	resource
helm_release.slurm	resource
helm_release.slurm_operator	resource
google_client_config.default	data source
google_container_cluster.gke_cluster	data source

Inputs

Name	Description	Type	Default	Required
cert_manager_chart_version	Version of the Cert Manager chart to install.	`string`	`"v1.18.2"`	no
cert_manager_values	Value overrides for the Cert Manager release	`any`	{ "crds": { "enabled": true } }	no
cluster_id	An identifier for the GKE cluster resource with format projects/<project_id>/locations//clusters/.	`string`	n/a	yes
install_kube_prometheus_stack	Install the Kube Prometheus Stack.	`bool`	`false`	no
install_slurm_chart	Install slurm-operator chart.	`bool`	`true`	no
install_slurm_operator_chart	Install slurm-operator chart.	`bool`	`true`	no
node_pool_names	Names of node pools, for use in node affinities (Slinky system components).	`list(string)`	`null`	no
project_id	The project ID that hosts the GKE cluster.	`string`	n/a	yes
prometheus_chart_version	Version of the Kube Prometheus Stack chart to install.	`string`	`"77.0.1"`	no
prometheus_values	Value overrides for the Prometheus release	`any`	{ "installCRDs": true }	no
slurm_chart_version	Version of the Slurm chart to install.	`string`	`"0.3.1"`	no
slurm_namespace	slurm namespace for charts	`string`	`"slurm"`	no
slurm_operator_chart_version	Version of the Slurm Operator chart to install.	`string`	`"0.3.1"`	no
slurm_operator_namespace	slurm namespace for charts	`string`	`"slinky"`	no
slurm_operator_repository	Value overrides for the Slinky release	`string`	`"oci://ghcr.io/slinkyproject/charts"`	no
slurm_operator_values	Value overrides for the Slinky release	`any`	`{}`	no
slurm_repository	Value overrides for the Slinky release	`string`	`"oci://ghcr.io/slinkyproject/charts"`	no
slurm_values	Value overrides for the Slurm release	`any`	`{}`	no

Outputs

Name	Description
slurm_namespace	namespace for the slurm chart
slurm_operator_namespace	namespace for the slinky operator chart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description

Example

Usage

Requirements

Providers

Modules

Resources

Inputs

Outputs

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Description

Example

Usage

Requirements

Providers

Modules

Resources

Inputs

Outputs