|
| 1 | +# Flux Framework in Slurm! |
| 2 | + |
| 3 | +Here is a small tutorial to run Flux in Slurm! Lots of people have heard of Sunk (and other Slurm alternatives that have popped up), but I made a [slurm-operator](https://github.com/converged-computing/slurm-operator) about a year before all of that. We are going to use it here to bring up a slurm cluster. You'll need to install the jobset API, which eventually will be added to Kubernetes proper (but is not yet!) Create a cluster: |
| 4 | + |
| 5 | + - [Video on YouTube](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ) |
| 6 | + |
| 7 | +## 1. Create a Cluster |
| 8 | + |
| 9 | +```bash |
| 10 | +kind create cluster --config ./kind-config.yaml |
| 11 | +``` |
| 12 | + |
| 13 | +## 2. Install the Slurm Operator |
| 14 | + |
| 15 | +Install JobSet: |
| 16 | + |
| 17 | +```bash |
| 18 | +kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml |
| 19 | +``` |
| 20 | + |
| 21 | +Install the Slurm operator: |
| 22 | + |
| 23 | +```bash |
| 24 | +kubectl apply -f https://raw.githubusercontent.com/converged-computing/slurm-operator/refs/heads/main/examples/dist/slurm-operator.yaml |
| 25 | +``` |
| 26 | + |
| 27 | +See logs for the operator |
| 28 | + |
| 29 | +```bash |
| 30 | +kubectl logs -n slurm-operator-system slurm-operator-controller-manager-6f6945579-9pknp |
| 31 | +``` |
| 32 | + |
| 33 | +## 3. Create a Slurm Cluster in Kubernetes |
| 34 | + |
| 35 | +Wait until you see the operator running. Create our demo cluster. |
| 36 | + |
| 37 | +```bash |
| 38 | +kubectl apply -f ./slurm.yaml |
| 39 | +``` |
| 40 | + |
| 41 | +Wait until all of the containers are running: |
| 42 | + |
| 43 | +```bash |
| 44 | +kubectl get pods |
| 45 | +NAME READY STATUS RESTARTS AGE |
| 46 | +slurm-sample-d-0-0-45trk 1/1 Running 0 4m27s # this is the daemon (slurmdbd) |
| 47 | +slurm-sample-db-0-0-6jqkz 1/1 Running 0 4m27s # this is that maria database |
| 48 | +slurm-sample-s-0-0-xj5zr 1/1 Running 0 4m27s # this is the login node (slurmctrl) |
| 49 | +slurm-sample-w-0-0-8xtvw 1/1 Running 0 4m27s # this is worker 0 |
| 50 | +slurm-sample-w-0-1-f25rp 1/1 Running 0 4m27s # this is worker 1 |
| 51 | +``` |
| 52 | + |
| 53 | +You'll first want to see the database connect successfully: |
| 54 | + |
| 55 | +```bash |
| 56 | +kubectl logs slurm-sample-d-0-0-45trk -f |
| 57 | +``` |
| 58 | +```console |
| 59 | +slurmdbd: debug2: StorageType = accounting_storage/mysql |
| 60 | +slurmdbd: debug2: StorageUser = slurm |
| 61 | +slurmdbd: debug2: TCPTimeout = 2 |
| 62 | +slurmdbd: debug2: TrackWCKey = 0 |
| 63 | +slurmdbd: debug2: TrackSlurmctldDown= 0 |
| 64 | +slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1 |
| 65 | +slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306 |
| 66 | +slurmdbd: slurmdbd version 21.08.6 started |
| 67 | +slurmdbd: debug2: running rollup at Fri Jun 09 04:14:37 2023 |
| 68 | +slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Everything rolled up |
| 69 | +slurmdbd: debug: REQUEST_PERSIST_INIT: CLUSTER:linux VERSION:9472 UID:0 IP:10.244.0.152 CONN:7 |
| 70 | +slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1 |
| 71 | +slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306 |
| 72 | +slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:0 |
| 73 | +slurmdbd: debug2: DBD_GET_CLUSTERS: called in CONN 7 |
| 74 | +slurmdbd: debug2: DBD_ADD_CLUSTERS: called in CONN 7 |
| 75 | +slurmdbd: dropping key time_start_end from table "linux_step_table" |
| 76 | +slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:1 |
| 77 | +slurmdbd: debug2: DBD_FINI: CLOSE:1 COMMIT:0 |
| 78 | +``` |
| 79 | + |
| 80 | +And then watch the login node, which is starting the controller, registering the cluster, and starting again. It normally would happen via a node reboot but we instead run it in a loop (and it seems to work). |
| 81 | + |
| 82 | +```bash |
| 83 | +kubectl logs -n slurm-operaslurm-sample-s-0-0-xj5zr -f |
| 84 | +``` |
| 85 | +```bash |
| 86 | +Hello, I am a server with slurm-sample-s-0-0.slurm-svc.slurm-operator.svc.cluster.local |
| 87 | +Sleeping waiting for database... |
| 88 | +---> Starting the MUNGE Authentication service (munged) ... |
| 89 | +---> Sleeping for slurmdbd to become active before starting slurmctld ... |
| 90 | +---> Starting the Slurm Controller Daemon (slurmctld) ... |
| 91 | + Adding Cluster(s) |
| 92 | + Name = linux |
| 93 | +slurmctld: debug: slurmctld log levels: stderr=debug2 logfile=debug2 syslog=quiet |
| 94 | +slurmctld: debug: Log file re-opened |
| 95 | +... |
| 96 | +``` |
| 97 | +You'll see a lot of output stream to this log when it's finally running. |
| 98 | + |
| 99 | +```console |
| 100 | +slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_PING |
| 101 | +slurmctld: debug2: Tree head got back 0 looking for 2 |
| 102 | +slurmctld: debug2: Tree head got back 1 |
| 103 | +slurmctld: debug2: Tree head got back 2 |
| 104 | +slurmctld: debug2: node_did_resp slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local |
| 105 | +slurmctld: debug2: node_did_resp slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local |
| 106 | +slurmctld: debug: sched/backfill: _attempt_backfill: beginning |
| 107 | +slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill |
| 108 | +slurmctld: debug2: Testing job time limits and checkpoints |
| 109 | +slurmctld: debug: sched/backfill: _attempt_backfill: beginning |
| 110 | +slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill |
| 111 | +slurmctld: debug2: Testing job time limits and checkpoints |
| 112 | +slurmctld: debug2: Performing purge of old job records |
| 113 | +slurmctld: debug: sched: Running job scheduler for full queue. |
| 114 | +slurmctld: debug2: Testing job time limits and checkpoints |
| 115 | +``` |
| 116 | + |
| 117 | +Once you've verified the controller is running, you can shell into the control login node, and run sinfo or try a job: |
| 118 | + |
| 119 | +```bash |
| 120 | +kubectl exec -it slurm-sample-s-0-0-xj5zr bash |
| 121 | +``` |
| 122 | +```bash |
| 123 | +sinfo |
| 124 | +``` |
| 125 | +```console |
| 126 | +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| 127 | +normal* up 5-00:00:00 2 idle slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local,slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local |
| 128 | +``` |
| 129 | + |
| 130 | +## 4. Deploy Flux! |
| 131 | + |
| 132 | +We probably don't need to ask for an exclusive allocation (we own this entire cluster) but let's pretend we don't. We could do this: |
| 133 | + |
| 134 | +```bash |
| 135 | +salloc -N2 --exclusive |
| 136 | +srun -v --mpi=pmi2 -N2 --pty /opt/conda/bin/flux -v start |
| 137 | +``` |
| 138 | + |
| 139 | +And then boum, you're in flux. |
0 commit comments