init: hello world flux tutorials!

vsoch · vsoch · commit 4a48f124ceba · 2024-11-12T16:05:27.000-07:00
This is the start of the Dinosaur Tutorial series to teach
about Flux Framework. The first tutorial added shows
bringing up flux under slurm.

Signed-off-by: vsoch &lt;vsoch@users.noreply.github.com&gt;
diff --git a/COPYRIGHT b/COPYRIGHT
@@ -0,0 +1,15 @@
+Intellectual Property Notice
+----------------------------
+
+HPCIC DevTools is licensed under the MIT license (LICENSE).
+
+Copyrights and patents in this project are retained by
+contributors. No copyright assignment is required to contribute to
+HPCIC DevTools.
+
+SPDX usage
+------------
+
+Individual files contain SPDX tags instead of the full license text.
+This enables machine processing of license information based on the SPDX
+License Identifiers that are available here: https://spdx.org/licenses/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022-2023 LLNS, LLC and other HPCIC DevTools Developers.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,21 @@
+This work was produced under the auspices of the U.S. Department of
+Energy by Lawrence Livermore National Laboratory under Contract
+DE-AC52-07NA27344.
+
+This work was prepared as an account of work sponsored by an agency of
+the United States Government. Neither the United States Government nor
+Lawrence Livermore National Security, LLC, nor any of their employees
+makes any warranty, expressed or implied, or assumes any legal liability
+or responsibility for the accuracy, completeness, or usefulness of any
+information, apparatus, product, or process disclosed, or represents that
+its use would not infringe privately owned rights.
+
+Reference herein to any specific commercial product, process, or service
+by trade name, trademark, manufacturer, or otherwise does not necessarily
+constitute or imply its endorsement, recommendation, or favoring by the
+United States Government or Lawrence Livermore National Security, LLC.
+
+The views and opinions of authors expressed herein do not necessarily
+state or reflect those of the United States Government or Lawrence
+Livermore National Security, LLC, and shall not be used for advertising
+or product endorsement purposes.
diff --git a/README.md b/README.md
@@ -0,0 +1,44 @@
+# Flux Tutorials
+
+> A Dinosaur Tutorial Series!
+
+## Tutorials
+
+- [flux-in-slurm](tutorial/flux-in-slurm): Bring up a Flux instance (in user-space) in a Slurm Allocation - both in Kubernetes ([video](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ)) 
+- [HPCIC Tutorial 2024](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ)
+
+## What is this?
+
+This repository is a response from user-feedback that you wanted to have different content for our Flux Tutorials, including us walking through content, and comparisons with Slurm. I hear you! In response to that, I've decided to have a little fun, and to make these shorter or "bite sized" tutorials. I'm aiming for 10-15 minutes per video, and I'll make a set of scoped topics around Flux, along with videos at your request. These will be released when I think of fun new things to share about Flux, or if you have an idea. My goal is to teach you about Flux and have fun along the way.
+
+## What about the HPCIC Tutorial Series?
+
+We still host our official tutorial once a year! For the official tutorial, we give a longer talk followed by hosting an autoscaling Kubernetes cluster where you get an interactive notebook. If you are looking for this official tutorial material, you can find it at [flux-framework/Tutorials](https://github.com/flux-framework/Tutorials) and the latest video is [here](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ). 
+
+## How do I request a tutorial?
+
+You can [open up an issue](https://github.com/converged-computing/flux-tutorials/issues), ping me on a slack (I'm on [hpc.social](https://hpc.social) and several others, usually as "v" or ping me on Twitter or GitHub (I am vsoch). Any way you can get the message across is good! And if you want to participate in a recording with me? I'd love that! Most of these are unpracticed - I put together a quick README document with some commands to run, and then just record (and see what happens). It's easier that way.
+
+## Tutorials Coming Soon...
+
+ - Flux on AWS
+ - Flux on Google Cloud
+ - Flux on Azure
+ - The Jobsetta Stone (Comparing Flux and Slurm)
+ - The Flux Operator
+ - Fluence - Using flux-sched "Fluxion" to schedule jobs to Flux
+ - The Ensemble Operator (Running Flux Ensembles in Kubernetes)
+ - ...and a few more!
+
+## License
+
+HPCIC DevTools is distributed under the terms of the MIT license.
+All new contributions must be made under this license.
+
+See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
+[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
+[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.
+
+SPDX-License-Identifier: (MIT)
+
+LLNL-CODE- 842614
diff --git a/tutorial/flux-in-slurm/README.md b/tutorial/flux-in-slurm/README.md
@@ -0,0 +1,139 @@
+# Flux Framework in Slurm!
+
+Here is a small tutorial to run Flux in Slurm! Lots of people have heard of Sunk (and other Slurm alternatives that have popped up), but I made a [slurm-operator](https://github.com/converged-computing/slurm-operator) about a year before all of that. We are going to use it here to bring up a slurm cluster.  You'll need to install the jobset API, which eventually will be added to Kubernetes proper (but is not yet!) Create a cluster:
+
+ - [Video on YouTube](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ)
+
+## 1. Create a Cluster
+
+```bash
+kind create cluster --config ./kind-config.yaml
+```
+
+## 2. Install the Slurm Operator
+
+Install JobSet:
+
+```bash
+kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml
+```
+
+Install the Slurm operator:
+
+```bash
+kubectl apply -f https://raw.githubusercontent.com/converged-computing/slurm-operator/refs/heads/main/examples/dist/slurm-operator.yaml
+```
+
+See logs for the operator
+
+```bash
+kubectl logs -n slurm-operator-system slurm-operator-controller-manager-6f6945579-9pknp
+```
+
+## 3. Create a Slurm Cluster in Kubernetes
+
+Wait until you see the operator running. Create our demo cluster.
+
+```bash
+kubectl apply -f ./slurm.yaml 
+```
+
+Wait until all of the containers are running:
+
+```bash
+kubectl get pods
+NAME                        READY   STATUS    RESTARTS   AGE
+slurm-sample-d-0-0-45trk    1/1     Running   0          4m27s    # this is the daemon (slurmdbd)
+slurm-sample-db-0-0-6jqkz   1/1     Running   0          4m27s    # this is that maria database
+slurm-sample-s-0-0-xj5zr    1/1     Running   0          4m27s    # this is the login node (slurmctrl)
+slurm-sample-w-0-0-8xtvw    1/1     Running   0          4m27s    # this is worker 0
+slurm-sample-w-0-1-f25rp    1/1     Running   0          4m27s    # this is worker 1
+```
+
+You'll first want to see the database connect successfully:
+
+```bash
+kubectl logs slurm-sample-d-0-0-45trk -f
+```
+```console
+slurmdbd: debug2: StorageType       = accounting_storage/mysql
+slurmdbd: debug2: StorageUser       = slurm
+slurmdbd: debug2: TCPTimeout        = 2
+slurmdbd: debug2: TrackWCKey        = 0
+slurmdbd: debug2: TrackSlurmctldDown= 0
+slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
+slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306
+slurmdbd: slurmdbd version 21.08.6 started
+slurmdbd: debug2: running rollup at Fri Jun 09 04:14:37 2023
+slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Everything rolled up
+slurmdbd: debug:  REQUEST_PERSIST_INIT: CLUSTER:linux VERSION:9472 UID:0 IP:10.244.0.152 CONN:7
+slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
+slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306
+slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:0
+slurmdbd: debug2: DBD_GET_CLUSTERS: called in CONN 7
+slurmdbd: debug2: DBD_ADD_CLUSTERS: called in CONN 7
+slurmdbd: dropping key time_start_end from table "linux_step_table"
+slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:1
+slurmdbd: debug2: DBD_FINI: CLOSE:1 COMMIT:0
+```
+
+And then watch the login node, which is starting the controller, registering the cluster, and starting again. It normally would happen via a node reboot but we instead run it in a loop (and it seems to work). 
+
+```bash
+kubectl logs -n slurm-operaslurm-sample-s-0-0-xj5zr -f
+```
+```bash
+Hello, I am a server with slurm-sample-s-0-0.slurm-svc.slurm-operator.svc.cluster.local
+Sleeping waiting for database...
+---> Starting the MUNGE Authentication service (munged) ...
+---> Sleeping for slurmdbd to become active before starting slurmctld ...
+---> Starting the Slurm Controller Daemon (slurmctld) ...
+ Adding Cluster(s)
+  Name           = linux
+slurmctld: debug:  slurmctld log levels: stderr=debug2 logfile=debug2 syslog=quiet
+slurmctld: debug:  Log file re-opened
+...
+```
+You'll see a lot of output stream to this log when it's finally running.
+
+```console
+slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_PING
+slurmctld: debug2: Tree head got back 0 looking for 2
+slurmctld: debug2: Tree head got back 1
+slurmctld: debug2: Tree head got back 2
+slurmctld: debug2: node_did_resp slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local
+slurmctld: debug2: node_did_resp slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local
+slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
+slurmctld: debug:  sched/backfill: _attempt_backfill: no jobs to backfill
+slurmctld: debug2: Testing job time limits and checkpoints
+slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
+slurmctld: debug:  sched/backfill: _attempt_backfill: no jobs to backfill
+slurmctld: debug2: Testing job time limits and checkpoints
+slurmctld: debug2: Performing purge of old job records
+slurmctld: debug:  sched: Running job scheduler for full queue.
+slurmctld: debug2: Testing job time limits and checkpoints
+```
+
+Once you've verified the controller is running, you can shell into the control login node, and run sinfo or try a job:
+
+```bash
+kubectl exec -it slurm-sample-s-0-0-xj5zr bash
+```
+```bash
+sinfo
+```
+```console
+PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+normal*      up 5-00:00:00      2   idle slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local,slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local
+```
+
+## 4. Deploy Flux!
+
+We probably don't need to ask for an exclusive allocation (we own this entire cluster) but let's pretend we don't. We could do this:
+
+```bash
+salloc -N2 --exclusive
+srun -v --mpi=pmi2 -N2 --pty /opt/conda/bin/flux -v start
+```
+
+And then boum, you're in flux.
diff --git a/tutorial/flux-in-slurm/kind-config.yaml b/tutorial/flux-in-slurm/kind-config.yaml
@@ -0,0 +1,24 @@
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+nodes:
+- role: control-plane
+  kubeadmConfigPatches:
+  - |
+    kind: InitConfiguration
+    nodeRegistration:
+      kubeletExtraArgs:
+        node-labels: "ingress-ready=true"
+  extraPortMappings:
+  - containerPort: 8080
+    hostPort: 8080
+    protocol: TCP
+  - containerPort: 4242
+    hostPort: 4242
+    protocol: TCP
+  - containerPort: 4243
+    hostPort: 4243
+    protocol: TCP
+- role: worker
+- role: worker
+- role: worker
+- role: worker
diff --git a/tutorial/flux-in-slurm/slurm-operator/Dockerfile b/tutorial/flux-in-slurm/slurm-operator/Dockerfile
@@ -0,0 +1,10 @@
+FROM ghcr.io/converged-computing/slurm
+
+# docker build -t ghcr.io/converged-computing/slurm-operator:with-flux .
+
+# This is the standard slurm operator base, but we add flux from conda-forge
+RUN yum update -y && yum install -y bzip2 curl iproute munge && \
+    curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba && yum clean all
+
+# COPY ./env.yaml /tmp/env.yaml
+# RUN 
diff --git a/tutorial/flux-in-slurm/slurm-operator/README.md b/tutorial/flux-in-slurm/slurm-operator/README.md
diff --git a/tutorial/flux-in-slurm/slurm.yaml b/tutorial/flux-in-slurm/slurm.yaml