Skip to content

Commit f8dbe0e

Browse files
authored
installing flux view an on demand volume is almost working! (#68)
* installing flux view an on demand volume is almost working! Signed-off-by: vsoch <[email protected]>
1 parent 8fdddab commit f8dbe0e

File tree

14 files changed

+911
-140
lines changed

14 files changed

+911
-140
lines changed

docs/_static/data/addons.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,5 +43,10 @@
4343
"name": "volume-secret",
4444
"description": "secret volume type",
4545
"family": "volume"
46+
},
47+
{
48+
"name": "workload-flux",
49+
"description": "hierarchical graph-based scheduler and resource manager",
50+
"family": "workload"
4651
}
4752
]

docs/getting_started/addons.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,63 @@ spec:
157157

158158
**Note that we have support for a custom application container, but haven't written any good examples yet!**
159159

160+
## Workload
161+
162+
### workload-flux
163+
164+
If you need to "throw in" Flux Framework into your container to use as a scheduler, you can do that with an addon!
165+
166+
> Yes, it's astounding. 🦩️
167+
168+
This works by way of the same trick that we use for other addons that have a complex (and/or large) install setup. We:
169+
170+
- Build the software into an isolated spack "copy" view
171+
- The software is then (generally) at some `/opt/view` and `/opt/software`
172+
- The flux container is added as a sidecar container to your pod for your replicated job
173+
- Additional setup / configuration is done here
174+
- We can then create an empty volume that is shared by your metric or scaled application
175+
- The entire tree is copied over into the empty volume
176+
- When the copy is done, indicated by the final touch of a file, the updated container entrypoint is run
177+
- This typically means we have taken your metric command, and wrapped it in a Flux submit.
178+
179+
It's really cool because it means you can run a metric / application with Flux without needing
180+
to install it into your container to begin with. The one important detail is a matching of
181+
general operating system. The current view uses rocky, however the image is customizable
182+
(and we can provide other bases if/when requested). Here are the arguments you can customize
183+
under the metric -> options.
184+
185+
| Name | Description | Type | Default |
186+
|-----|-------------|------------|------|
187+
| mount | Path to mount flux view in application container | string | /opt/share |
188+
| tasks | Number of tasks `-n` to give to flux (not provided if not set) | string | unset |
189+
| image | Customize the container image | string | `ghcr.io/rse-ops/spack-flux-rocky-view:tag-8` |
190+
| fluxUser | The flux user (currently not used, but TBA) | string | flux |
191+
| fluxUid | The flux user ID (currently not used, but TBA) | string | 1004 |
192+
| interactive | Run flux in interactive mode | string | "false" |
193+
| connectTimeout | How long zeroMQ should wait to retry | string | "5s" |
194+
| quorum | The number of brokers to require before starting the cluster | string | (total brokers or pods) |
195+
| debugZeroMQ | Turn on zeroMQ debugging | string | "false" |
196+
| logLevel | Customize the flux log level | string | "6" |
197+
| queuePolicy | Queue policy for flux to use | string | fcfs |
198+
| workerLetter | The letter that the worker job is expected to have | string | w |
199+
| launcherLetter | The letter that the launcher job is expected to have | string | w |
200+
| workerIndex | The index of the replicated job for the worker | string | 0 |
201+
| launcherIndex | The index of the replicated job for the launcher | string | 0 |
202+
| preCommand | Pre-command logic to run in launcher/workers before flux is started (after setup in flux container) | string | unset |
203+
204+
Note that the number of pods for flux defaults to the number in your MetricSet, along
205+
with the namespace and service name.
206+
207+
**Important** the flux addon is currently supported for metric types that:
208+
209+
1. have the launcher / worker design (so the hostlist.txt is present in the PWD)
210+
2. Have scp installed, as the shared certificate needs to be copied from the lead broker to all followers
211+
3. Ideally have munge installed - we do try to install it (but better to already be there)
212+
213+
We also currently run flux as root. This is considered bad practice, but probably OK
214+
for this early development work. We don't see a need to have shared namespace / operator
215+
environments at this point, which is why I didn't add it.
216+
160217
## Performance
161218

162219
### perf-hpctoolkit
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
apiVersion: flux-framework.org/v1alpha2
2+
kind: MetricSet
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: metricset
6+
app.kubernetes.io/instance: metricset-sample
7+
name: metricset-sample
8+
spec:
9+
# Number of pods for lammps (one launcher, the rest workers)
10+
pods: 4
11+
logging:
12+
interactive: true
13+
14+
metrics:
15+
16+
# Running more scaled lammps is our main goal
17+
- name: app-lammps
18+
19+
# This flux addon is built on rocky, and we can provide additional os bases
20+
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky
21+
22+
options:
23+
command: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
24+
workdir: /opt/lammps/examples/reaxff/HNS
25+
26+
# Add on hpctoolkit, will mount a volume and wrap lammps
27+
addons:
28+
- name: workload-flux
29+
options:
30+
# Ensure intel environment is setup
31+
preCommand: . /opt/intel/mpi/latest/env/vars.sh
32+
workdir: /opt/lammps/examples/reaxff/HNS

pkg/addons/addons.go

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ var (
2626
AddonFamilyPerformance = "performance"
2727
AddonFamilyVolume = "volume"
2828
AddonFamilyApplication = "application"
29+
AddonFamilyWorkload = "workload"
2930
)
3031

3132
// A general metric is a container added to a JobSet
@@ -37,7 +38,7 @@ type Addon interface {
3738
Description() string
3839

3940
// Options and exportable attributes
40-
SetOptions(*api.MetricAddon)
41+
SetOptions(*api.MetricAddon, *api.MetricSet)
4142
Options() map[string]intstr.IntOrString
4243
ListOptions() map[string][]intstr.IntOrString
4344
MapOptions() map[string]map[string]intstr.IntOrString
@@ -65,7 +66,7 @@ type AddonBase struct {
6566
mapOptions map[string]map[string]intstr.IntOrString
6667
}
6768

68-
func (b *AddonBase) SetOptions(metric *api.MetricAddon) {}
69+
func (b *AddonBase) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {}
6970
func (b *AddonBase) CustomizeEntrypoints([]*specs.ContainerSpec, []*jobset.ReplicatedJob) {}
7071

7172
func (b *AddonBase) Validate() bool {
@@ -97,7 +98,7 @@ func (b *AddonBase) MapOptions() map[string]map[string]intstr.IntOrString {
9798
}
9899

99100
// GetAddon looks up and validates an addon
100-
func GetAddon(a *api.MetricAddon) (Addon, error) {
101+
func GetAddon(a *api.MetricAddon, set *api.MetricSet) (Addon, error) {
101102

102103
// We don't want to change the addon interface/struct itself
103104
template, ok := Registry[a.Name]
@@ -111,7 +112,7 @@ func GetAddon(a *api.MetricAddon) (Addon, error) {
111112
addon := reflect.New(templateType.Type()).Interface().(Addon)
112113

113114
// Set options before validation
114-
addon.SetOptions(a)
115+
addon.SetOptions(a, set)
115116

116117
// Validate the addon
117118
if !addon.Validate() {

pkg/addons/commands.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,9 @@ func (a *PerfAddon) CustomizeEntrypoints(
4242
}
4343
}
4444

45-
func (a *PerfAddon) SetOptions(metric *api.MetricAddon) {
45+
func (a *PerfAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
4646
a.Identifier = perfCommandsName
47-
a.SetSharedCommandOptions(metric)
47+
a.SetSharedCommandOptions(addon)
4848
}
4949

5050
// addContainerCaps adds capabilities to a container spec
@@ -102,9 +102,9 @@ func (m CommandAddon) Family() string {
102102
return AddonFamilyApplication
103103
}
104104

105-
func (a *CommandAddon) SetOptions(metric *api.MetricAddon) {
105+
func (a *CommandAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
106106
a.Identifier = commandsName
107-
a.SetSharedCommandOptions(metric)
107+
a.SetSharedCommandOptions(addon)
108108
}
109109

110110
// Set custom options / attributes for the metric

pkg/addons/containers.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,8 +139,8 @@ func (a *ApplicationAddon) setDefaultEntrypoint() {
139139
}
140140

141141
// Calling the default allows a custom application that uses this to do the same
142-
func (a *ApplicationAddon) SetOptions(metric *api.MetricAddon) {
143-
a.SetDefaultOptions(metric)
142+
func (a *ApplicationAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
143+
a.SetDefaultOptions(addon)
144144
}
145145

146146
// Underlying function that can be shared

0 commit comments

Comments
 (0)