Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,50 +34,38 @@ All config files in `slurm-cluster-chart/files` will be mounted into the contain

## Deploying the Cluster

### Generating Cluster Secrets
### Storage

On initial deployment ONLY, run
```console
./generate-secrets.sh [<target-namespace>]
```
This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster"

Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser

### Connecting RWX Volume

A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.
A ReadWriteMany (RWX) volume is required to provision a shared volume across the Slurm pods. By default, a RookNFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide this capability. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.

See the separate RookNFS chart [values.yaml](./rooknfs/values.yaml) for further configuration options when using the RookNFS to provide the shared storage volume.

### Supplying Public Keys
### Public Keys

To access the cluster via `ssh`, you will need to make your public keys available. All your public keys from localhost can be added by running

```console
./publish-keys.sh [<target-namespace>]
```
where `<target-namespace>` is the namespace in which the Slurm cluster chart will be deployed (i.e. using `helm install -n <target-namespace> ...`). This will create a Kubernetes Secret in the appropriate namespace for the Slurm cluster to use. Omitting the namespace arg will install the secrets in the default namespace.
where `<target-namespace>` is the namespace in which the Slurm cluster chart will be deployed. This will create a Kubernetes Secret in the appropriate namespace for the Slurm cluster to use. Omitting the namespace arg will install the secrets in the default namespace.

### Deploying with Helm
Alternatively public keys can be defined in `slurm-cluster-chart/values.yaml:sshPublicKey`

After configuring `kubectl` with the appropriate `kubeconfig` file, deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
```
### Installation with Helm

NOTE: If using the RookNFS dependency, then the following must be run before installing the Slurm cluster chart
```console
helm dependency update slurm-cluster-chart
```
- Configure `kubectl` with the appropriate `kubeconfig` file.

Subsequent releases can be deployed using:
- If necessary, change any configuration in `slurm-cluster-chart/values.yaml`, e.g. `openOnDemand.password`.

```console
helm upgrade <deployment-name> slurm-cluster-chart
```
- If using the RookNFS dependency, then the following must be run before installing the Slurm cluster chart
```console
helm dependency update slurm-cluster-chart
```

Note: When updating the cluster with `helm upgrade`, a pre-upgrade hook will prevent upgrades if there are running jobs in the Slurm queue. Attempting to upgrade will set all Slurm nodes to `DRAINED` state. If an upgrade fails due to running jobs, you can undrain the nodes either by waiting for running jobs to complete and then retrying the upgrade or by manually undraining them by accessing the cluster as a privileged user. Alternatively you can bypass the hook by running `helm upgrade` with the `--no-hooks` flag (may result in running jobs being lost)
- Deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
```

## Accessing the Cluster

Expand Down Expand Up @@ -180,4 +168,16 @@ and then restart the other dependent deployments to propagate changes:
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
## Upgrading the Cluster

Subsequent Helm releases can be deployed using
```console
helm upgrade <deployment-name> slurm-cluster-chart
```

A pre-upgrade hook will prevent upgrades if there are running jobs in the Slurm queue. Attempting to upgrade will set all Slurm nodes to `DRAINED` state. If an upgrade fails due to running jobs, you can undrain the nodes either by waiting for running jobs to complete and then retrying the upgrade or by manually undraining them by accessing the cluster as a privileged user. Alternatively you can bypass the hook by running `helm upgrade` with the `--no-hooks` flag (may result in running jobs being lost)

# Known Issues and Limitations
- Single user (`rocky`)
- All nodes are in a single partition `all`.
- Scaling down the `slurmd` StatefulSet will not remove nodes from Slurm - they will eventually get marked DOWN. However they will go back to IDLE if the StatefulSet is scaled back up.
2 changes: 1 addition & 1 deletion image/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ LABEL org.opencontainers.image.source="https://github.com/stackhpc/slurm-docker-
org.label-schema.docker.cmd="docker-compose up -d" \
maintainer="StackHPC"

ARG SLURM_TAG=slurm-23.02
ARG SLURM_TAG=slurm-23-02-5-1
ARG GOSU_VERSION=1.11

COPY kubernetes.repo /etc/yum.repos.d/kubernetes.repo
Expand Down
4 changes: 2 additions & 2 deletions slurm-cluster-chart/files/slurm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,12 @@ AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd
AccountingStoragePort=6819
#
SlurmctldParameters=cloud_dns,cloud_reg_addrs
SlurmctldParameters=cloud_reg_addrs
TreeWidth=65533
CommunicationParameters=NoAddrCache

# NODES
MaxNodeCount=10
NodeName=slurmd-[0-9] State=FUTURE CPUs=4

# PARTITIONS
PartitionName=all Default=yes Nodes=ALL
Expand Down
2 changes: 1 addition & 1 deletion slurm-cluster-chart/templates/slurmd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ spec:
containers:
- args:
- slurmd
- -F
- -Z
- -vvv
- -N
- "$(POD_NAME)"
Expand Down
2 changes: 1 addition & 1 deletion slurm-cluster-chart/values.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
slurmImage: ghcr.io/stackhpc/slurm-docker-cluster:1f51003
slurmImage: ghcr.io/stackhpc/slurm-k8s-cluster:3c456bd

login:
# Deployment resource name
Expand Down