Skip to content

Commit 4a48f12

Browse files
committed
init: hello world flux tutorials!
This is the start of the Dinosaur Tutorial series to teach about Flux Framework. The first tutorial added shows bringing up flux under slurm. Signed-off-by: vsoch <[email protected]>
0 parents  commit 4a48f12

File tree

9 files changed

+419
-0
lines changed

9 files changed

+419
-0
lines changed

COPYRIGHT

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Intellectual Property Notice
2+
----------------------------
3+
4+
HPCIC DevTools is licensed under the MIT license (LICENSE).
5+
6+
Copyrights and patents in this project are retained by
7+
contributors. No copyright assignment is required to contribute to
8+
HPCIC DevTools.
9+
10+
SPDX usage
11+
------------
12+
13+
Individual files contain SPDX tags instead of the full license text.
14+
This enables machine processing of license information based on the SPDX
15+
License Identifiers that are available here: https://spdx.org/licenses/

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2022-2023 LLNS, LLC and other HPCIC DevTools Developers.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

NOTICE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
This work was produced under the auspices of the U.S. Department of
2+
Energy by Lawrence Livermore National Laboratory under Contract
3+
DE-AC52-07NA27344.
4+
5+
This work was prepared as an account of work sponsored by an agency of
6+
the United States Government. Neither the United States Government nor
7+
Lawrence Livermore National Security, LLC, nor any of their employees
8+
makes any warranty, expressed or implied, or assumes any legal liability
9+
or responsibility for the accuracy, completeness, or usefulness of any
10+
information, apparatus, product, or process disclosed, or represents that
11+
its use would not infringe privately owned rights.
12+
13+
Reference herein to any specific commercial product, process, or service
14+
by trade name, trademark, manufacturer, or otherwise does not necessarily
15+
constitute or imply its endorsement, recommendation, or favoring by the
16+
United States Government or Lawrence Livermore National Security, LLC.
17+
18+
The views and opinions of authors expressed herein do not necessarily
19+
state or reflect those of the United States Government or Lawrence
20+
Livermore National Security, LLC, and shall not be used for advertising
21+
or product endorsement purposes.

README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Flux Tutorials
2+
3+
> A Dinosaur Tutorial Series!
4+
5+
## Tutorials
6+
7+
- [flux-in-slurm](tutorial/flux-in-slurm): Bring up a Flux instance (in user-space) in a Slurm Allocation - both in Kubernetes ([video](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ))
8+
- [HPCIC Tutorial 2024](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ)
9+
10+
## What is this?
11+
12+
This repository is a response from user-feedback that you wanted to have different content for our Flux Tutorials, including us walking through content, and comparisons with Slurm. I hear you! In response to that, I've decided to have a little fun, and to make these shorter or "bite sized" tutorials. I'm aiming for 10-15 minutes per video, and I'll make a set of scoped topics around Flux, along with videos at your request. These will be released when I think of fun new things to share about Flux, or if you have an idea. My goal is to teach you about Flux and have fun along the way.
13+
14+
## What about the HPCIC Tutorial Series?
15+
16+
We still host our official tutorial once a year! For the official tutorial, we give a longer talk followed by hosting an autoscaling Kubernetes cluster where you get an interactive notebook. If you are looking for this official tutorial material, you can find it at [flux-framework/Tutorials](https://github.com/flux-framework/Tutorials) and the latest video is [here](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ).
17+
18+
## How do I request a tutorial?
19+
20+
You can [open up an issue](https://github.com/converged-computing/flux-tutorials/issues), ping me on a slack (I'm on [hpc.social](https://hpc.social) and several others, usually as "v" or ping me on Twitter or GitHub (I am vsoch). Any way you can get the message across is good! And if you want to participate in a recording with me? I'd love that! Most of these are unpracticed - I put together a quick README document with some commands to run, and then just record (and see what happens). It's easier that way.
21+
22+
## Tutorials Coming Soon...
23+
24+
- Flux on AWS
25+
- Flux on Google Cloud
26+
- Flux on Azure
27+
- The Jobsetta Stone (Comparing Flux and Slurm)
28+
- The Flux Operator
29+
- Fluence - Using flux-sched "Fluxion" to schedule jobs to Flux
30+
- The Ensemble Operator (Running Flux Ensembles in Kubernetes)
31+
- ...and a few more!
32+
33+
## License
34+
35+
HPCIC DevTools is distributed under the terms of the MIT license.
36+
All new contributions must be made under this license.
37+
38+
See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
39+
[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
40+
[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.
41+
42+
SPDX-License-Identifier: (MIT)
43+
44+
LLNL-CODE- 842614

tutorial/flux-in-slurm/README.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Flux Framework in Slurm!
2+
3+
Here is a small tutorial to run Flux in Slurm! Lots of people have heard of Sunk (and other Slurm alternatives that have popped up), but I made a [slurm-operator](https://github.com/converged-computing/slurm-operator) about a year before all of that. We are going to use it here to bring up a slurm cluster. You'll need to install the jobset API, which eventually will be added to Kubernetes proper (but is not yet!) Create a cluster:
4+
5+
- [Video on YouTube](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ)
6+
7+
## 1. Create a Cluster
8+
9+
```bash
10+
kind create cluster --config ./kind-config.yaml
11+
```
12+
13+
## 2. Install the Slurm Operator
14+
15+
Install JobSet:
16+
17+
```bash
18+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.7.0/manifests.yaml
19+
```
20+
21+
Install the Slurm operator:
22+
23+
```bash
24+
kubectl apply -f https://raw.githubusercontent.com/converged-computing/slurm-operator/refs/heads/main/examples/dist/slurm-operator.yaml
25+
```
26+
27+
See logs for the operator
28+
29+
```bash
30+
kubectl logs -n slurm-operator-system slurm-operator-controller-manager-6f6945579-9pknp
31+
```
32+
33+
## 3. Create a Slurm Cluster in Kubernetes
34+
35+
Wait until you see the operator running. Create our demo cluster.
36+
37+
```bash
38+
kubectl apply -f ./slurm.yaml
39+
```
40+
41+
Wait until all of the containers are running:
42+
43+
```bash
44+
kubectl get pods
45+
NAME READY STATUS RESTARTS AGE
46+
slurm-sample-d-0-0-45trk 1/1 Running 0 4m27s # this is the daemon (slurmdbd)
47+
slurm-sample-db-0-0-6jqkz 1/1 Running 0 4m27s # this is that maria database
48+
slurm-sample-s-0-0-xj5zr 1/1 Running 0 4m27s # this is the login node (slurmctrl)
49+
slurm-sample-w-0-0-8xtvw 1/1 Running 0 4m27s # this is worker 0
50+
slurm-sample-w-0-1-f25rp 1/1 Running 0 4m27s # this is worker 1
51+
```
52+
53+
You'll first want to see the database connect successfully:
54+
55+
```bash
56+
kubectl logs slurm-sample-d-0-0-45trk -f
57+
```
58+
```console
59+
slurmdbd: debug2: StorageType = accounting_storage/mysql
60+
slurmdbd: debug2: StorageUser = slurm
61+
slurmdbd: debug2: TCPTimeout = 2
62+
slurmdbd: debug2: TrackWCKey = 0
63+
slurmdbd: debug2: TrackSlurmctldDown= 0
64+
slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
65+
slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306
66+
slurmdbd: slurmdbd version 21.08.6 started
67+
slurmdbd: debug2: running rollup at Fri Jun 09 04:14:37 2023
68+
slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Everything rolled up
69+
slurmdbd: debug: REQUEST_PERSIST_INIT: CLUSTER:linux VERSION:9472 UID:0 IP:10.244.0.152 CONN:7
70+
slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
71+
slurmdbd: debug2: Attempting to connect to slurm-sample-db-0-0.slurm-svc.slurm-operator.svc.cluster.local:3306
72+
slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:0
73+
slurmdbd: debug2: DBD_GET_CLUSTERS: called in CONN 7
74+
slurmdbd: debug2: DBD_ADD_CLUSTERS: called in CONN 7
75+
slurmdbd: dropping key time_start_end from table "linux_step_table"
76+
slurmdbd: debug2: DBD_FINI: CLOSE:0 COMMIT:1
77+
slurmdbd: debug2: DBD_FINI: CLOSE:1 COMMIT:0
78+
```
79+
80+
And then watch the login node, which is starting the controller, registering the cluster, and starting again. It normally would happen via a node reboot but we instead run it in a loop (and it seems to work).
81+
82+
```bash
83+
kubectl logs -n slurm-operaslurm-sample-s-0-0-xj5zr -f
84+
```
85+
```bash
86+
Hello, I am a server with slurm-sample-s-0-0.slurm-svc.slurm-operator.svc.cluster.local
87+
Sleeping waiting for database...
88+
---> Starting the MUNGE Authentication service (munged) ...
89+
---> Sleeping for slurmdbd to become active before starting slurmctld ...
90+
---> Starting the Slurm Controller Daemon (slurmctld) ...
91+
Adding Cluster(s)
92+
Name = linux
93+
slurmctld: debug: slurmctld log levels: stderr=debug2 logfile=debug2 syslog=quiet
94+
slurmctld: debug: Log file re-opened
95+
...
96+
```
97+
You'll see a lot of output stream to this log when it's finally running.
98+
99+
```console
100+
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_PING
101+
slurmctld: debug2: Tree head got back 0 looking for 2
102+
slurmctld: debug2: Tree head got back 1
103+
slurmctld: debug2: Tree head got back 2
104+
slurmctld: debug2: node_did_resp slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local
105+
slurmctld: debug2: node_did_resp slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local
106+
slurmctld: debug: sched/backfill: _attempt_backfill: beginning
107+
slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill
108+
slurmctld: debug2: Testing job time limits and checkpoints
109+
slurmctld: debug: sched/backfill: _attempt_backfill: beginning
110+
slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill
111+
slurmctld: debug2: Testing job time limits and checkpoints
112+
slurmctld: debug2: Performing purge of old job records
113+
slurmctld: debug: sched: Running job scheduler for full queue.
114+
slurmctld: debug2: Testing job time limits and checkpoints
115+
```
116+
117+
Once you've verified the controller is running, you can shell into the control login node, and run sinfo or try a job:
118+
119+
```bash
120+
kubectl exec -it slurm-sample-s-0-0-xj5zr bash
121+
```
122+
```bash
123+
sinfo
124+
```
125+
```console
126+
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
127+
normal* up 5-00:00:00 2 idle slurm-sample-w-0-0.slurm-svc.slurm-operator.svc.cluster.local,slurm-sample-w-0-1.slurm-svc.slurm-operator.svc.cluster.local
128+
```
129+
130+
## 4. Deploy Flux!
131+
132+
We probably don't need to ask for an exclusive allocation (we own this entire cluster) but let's pretend we don't. We could do this:
133+
134+
```bash
135+
salloc -N2 --exclusive
136+
srun -v --mpi=pmi2 -N2 --pty /opt/conda/bin/flux -v start
137+
```
138+
139+
And then boum, you're in flux.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
kind: Cluster
2+
apiVersion: kind.x-k8s.io/v1alpha4
3+
nodes:
4+
- role: control-plane
5+
kubeadmConfigPatches:
6+
- |
7+
kind: InitConfiguration
8+
nodeRegistration:
9+
kubeletExtraArgs:
10+
node-labels: "ingress-ready=true"
11+
extraPortMappings:
12+
- containerPort: 8080
13+
hostPort: 8080
14+
protocol: TCP
15+
- containerPort: 4242
16+
hostPort: 4242
17+
protocol: TCP
18+
- containerPort: 4243
19+
hostPort: 4243
20+
protocol: TCP
21+
- role: worker
22+
- role: worker
23+
- role: worker
24+
- role: worker
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
FROM ghcr.io/converged-computing/slurm
2+
3+
# docker build -t ghcr.io/converged-computing/slurm-operator:with-flux .
4+
5+
# This is the standard slurm operator base, but we add flux from conda-forge
6+
RUN yum update -y && yum install -y bzip2 curl iproute munge && \
7+
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba && yum clean all
8+
9+
# COPY ./env.yaml /tmp/env.yaml
10+
# RUN

0 commit comments

Comments
 (0)