Skip to content

Commit 85c05a9

Browse files
authored
Refactor/standalone design (#41)
* add quicksilver * save state of refactored app/network This adds a shared template used for networking and app. I think I can simplify this further to have a common shared template. * refactor to add common design jobset package If this works it could be really cool - we have a jobs package that defines common jobset (metric set) interfaces that can be shared between standalone apps. As an example, the launcher/worker design works for both network and traditional apps, with slight differences that are determined based on the struct that inherits the interface. I hope this works! * add pennant * add ensure default name to success jobs * simplify entrypoint script logic further the common prefix to generate the hosts lists can be shared between applications. We can also define it to be a more general prefix field because it does not necessarily need to be mpirun * quicksilver can use common prefix too * also simplify storage interfaces! This is so satisfying! Signed-off-by: vsoch <[email protected]>
1 parent 646c17e commit 85c05a9

File tree

20 files changed

+1307
-1117
lines changed

20 files changed

+1307
-1117
lines changed

.github/workflows/main.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,9 @@ jobs:
7070
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test
7171
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test
7272
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
73-
["app-kripke", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
73+
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test
74+
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test
75+
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120], # standalone app test
7476
["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120]] # standalone app test
7577

7678
steps:

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,11 @@ To learn more:
1818
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
1919
- Look into pod affinity/anti-affintiy vs. topology constraint (which do we want)?
2020
- Add assertions checking for python tests
21-
- Plotting examples needed for
21+
- Plotting examples (python parsers) needed for
2222
- io-sysstat
2323
- app-kripke
24+
- app-quicksilver
25+
- app-pennant
2426

2527
## License
2628

docs/_static/data/metrics.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,20 @@
2020
"image": "ghcr.io/converged-computing/metric-lammps:latest",
2121
"url": "https://www.lammps.org/"
2222
},
23+
{
24+
"name": "app-pennant",
25+
"description": "Unstructured mesh hydrodynamics for advanced architectures ",
26+
"type": "standalone",
27+
"image": "ghcr.io/converged-computing/metric-pennant:latest",
28+
"url": "https://github.com/LLNL/pennant"
29+
},
30+
{
31+
"name": "app-quicksilver",
32+
"description": "A proxy app for the Monte Carlo Transport Code",
33+
"type": "standalone",
34+
"image": "ghcr.io/converged-computing/metric-quicksilver:latest",
35+
"url": "https://github.com/LLNL/Quicksilver"
36+
},
2337
{
2438
"name": "io-fio",
2539
"description": "Flexible IO Tester (FIO)",

docs/getting_started/metrics.md

Lines changed: 164 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ The following metrics are under development (or being planned).
77
- [Application Metrics](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#application)
88
- [Standalone Metrics](https://converged-computing.github.io/metrics-operator/getting_started/metrics.html#standalone)
99

10-
<iframe src="../_static/data/table.html" style="width:100%; height:650px;" frameBorder="0"></iframe>
10+
<iframe src="../_static/data/table.html" style="width:100%; height:700px;" frameBorder="0"></iframe>
1111

1212
All metrics can be customized with the following variables
1313

@@ -290,6 +290,169 @@ More likely you want an actual problem size on a specific number of node and tas
290290
run a larger problem and the parser does not work as expected, please [send us the output](https://github.com/converged-computing/metrics-operator/issues) and we will provide an updated parser.
291291
See [this guide](https://asc.llnl.gov/sites/asc/files/2020-09/AMG_Summary_v1_7.pdf) for more detail.
292292
293+
294+
#### app-quicksilver
295+
296+
- [Standalone Metric Set](user-guide.md#application-metric-set)
297+
- *[app-quicksilver](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-quicksilver)*
298+
299+
Quicksilver is a proxy app for Monte Carlo simulation code. You can learn more about it on the [GitHub repository](https://github.com/LLNL/Quicksilver/).
300+
By default, akin to other apps we expose the entire mpirun command along with the working directory for you to adjust.
301+
302+
| Name | Description | Option Key | Type | Default |
303+
|-----|-------------|------------|------|---------|
304+
| command | The qs command (without mpirun) | options->command |string | (see below) |
305+
| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) |
306+
| workdir | The working directory for the command | options->workdir | string | /opt/AMG |
307+
308+
By default, when not set, you will just run the qs (quicksilver) binary on a sample problem, represented by an input text file:
309+
310+
```bash
311+
# mpirun
312+
mpirun --hostfile ./hostlist.txt
313+
314+
# command
315+
qs /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp
316+
317+
# Assembled into problem.sh as follows:
318+
mpirun --hostfile ./hostlist.txt ./problem.sh
319+
```
320+
321+
There are many problems that come in the container, and here are the fullpaths:
322+
323+
```console
324+
# Example command
325+
qs /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp
326+
327+
# All examples:
328+
/opt/quicksilver/Examples/AllScattering/scatteringOnly.inp
329+
/opt/quicksilver/Examples/NoCollisions/no.collisions.inp
330+
/opt/quicksilver/Examples/NonFlatXC/NonFlatXC.inp
331+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem2/Coral2_P2_4096.inp
332+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem2/Coral2_P2.inp
333+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem2/Coral2_P2_1.inp
334+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp
335+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1_1.inp
336+
/opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1_4096.inp
337+
/opt/quicksilver/Examples/CTS2_Benchmark/CTS2.inp
338+
/opt/quicksilver/Examples/CTS2_Benchmark/CTS2_36.inp
339+
/opt/quicksilver/Examples/CTS2_Benchmark/CTS2_1.inp
340+
/opt/quicksilver/Examples/AllAbsorb/allAbsorb.inp
341+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v4_ts.inp
342+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v5_ts.inp
343+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem.inp
344+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v3_wq.inp
345+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v7_ts.inp
346+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v4_tm.inp
347+
/opt/quicksilver/Examples/Homogeneous/homogeneousProblem_v3.inp
348+
/opt/quicksilver/Examples/AllEscape/allEscape.inp
349+
/opt/quicksilver/Examples/NoFission/noFission.inp
350+
```
351+
352+
You can also look more closely in the [GitHub repository](https://github.com/LLNL/Quicksilver/tree/master/Examples).
353+
354+
#### app-pennant
355+
356+
- [Standalone Metric Set](user-guide.md#application-metric-set)
357+
- *[app-pennant](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-pennant)*
358+
359+
Pennant is an unstructured mesh hydrodynamics for advanced architectures. The documentation is sparse, but you
360+
can find the [source code on GitHub](https://github.com/llnl/pennant).
361+
By default, akin to other apps we expose the entire mpirun prefix and command along with the working directory for you to adjust.
362+
363+
| Name | Description | Option Key | Type | Default |
364+
|-----|-------------|------------|------|---------|
365+
| command | The pennant command (without mpirun) | options->command |string | (see below) |
366+
| mpirun | The mpirun command (and arguments) | options->mpirun | string | (see below) |
367+
| workdir | The working directory for the command | options->workdir | string | /opt/AMG |
368+
369+
By default, when not set, you will just run pennant on a test problem, represented by an input text file:
370+
371+
```bash
372+
# mpirun
373+
mpirun --hostfile ./hostlist.txt
374+
375+
# command
376+
pennant /opt/pennant/test/sedovsmall/sedovsmall.pnt
377+
378+
# Assembled into problem.sh as follows:
379+
mpirun --hostfile ./hostlist.txt ./problem.sh
380+
```
381+
382+
There are many input files that come in the container, and here are the fullpaths in `/opt/pennant/test`:
383+
384+
<details>
385+
386+
<summary>Input files available to pennant</summary>
387+
388+
```console
389+
|-- leblanc
390+
| |-- leblanc.pnt
391+
| |-- leblanc.xy.std
392+
| `-- leblanc.xy.std4
393+
|-- leblancbig
394+
| `-- leblancbig.pnt
395+
|-- leblancx16
396+
| `-- leblancx16.pnt
397+
|-- leblancx4
398+
| `-- leblancx4.pnt
399+
|-- leblancx48
400+
| `-- leblancx48.pnt
401+
|-- leblancx64
402+
| `-- leblancx64.pnt
403+
|-- noh
404+
| |-- noh.pnt
405+
| |-- noh.xy.std
406+
| `-- noh.xy.std4
407+
|-- nohpoly
408+
| `-- nohpoly.pnt
409+
|-- nohsmall
410+
| |-- nohsmall.pnt
411+
| |-- nohsmall.xy.std
412+
| `-- nohsmall.xy.std4
413+
|-- nohsquare
414+
| `-- nohsquare.pnt
415+
|-- sample_outputs
416+
| |-- edison
417+
| | |-- leblancbig.thr1.out
418+
| | |-- leblancx16.thr1024.out
419+
| | |-- leblancx4.thr16.out
420+
| | |-- leblancx64.mpi2048.out
421+
| | `-- nohpoly.thr1.out
422+
| `-- vulcan
423+
| |-- leblancx16.out
424+
| |-- leblancx48.out
425+
| |-- sedovflat.out
426+
| |-- sedovflatx16.out
427+
| |-- sedovflatx4.out
428+
| `-- sedovflatx40.out
429+
|-- sedov
430+
| |-- sedov.pnt
431+
| |-- sedov.xy.std
432+
| `-- sedov.xy.std4
433+
|-- sedovbig
434+
| `-- sedovbig.pnt
435+
|-- sedovflat
436+
| `-- sedovflat.pnt
437+
|-- sedovflatx120
438+
| `-- sedovflatx120.pnt
439+
|-- sedovflatx16
440+
| `-- sedovflatx16.pnt
441+
|-- sedovflatx4
442+
| `-- sedovflatx4.pnt
443+
|-- sedovflatx40
444+
| `-- sedovflatx40.pnt
445+
`-- sedovsmall
446+
|-- sedovsmall.pnt
447+
|-- sedovsmall.xy
448+
|-- sedovsmall.xy.std
449+
`-- sedovsmall.xy.std4
450+
```
451+
452+
</details>
453+
454+
And likely you will need to adjust the mpirun parameters, etc.
455+
293456
#### app-kripke
294457
295458
- [Standalone Metric Set](user-guide.md#application-metric-set)
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Pennant Example
2+
3+
This is an example of a metric app, Pennant, which is part of the [coral 2 benchmarks](https://asc.llnl.gov/coral-2-benchmarks).
4+
We have not yet added a Python example as we want a use case first, but can and will when it is warranted.
5+
6+
## Usage
7+
8+
Create a cluster
9+
10+
```bash
11+
kind create cluster
12+
```
13+
14+
and install JobSet to it.
15+
16+
```bash
17+
VERSION=v0.2.0
18+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
19+
```
20+
21+
Install the operator (from the development manifest here):
22+
23+
```bash
24+
$ kubectl apply -f ../../dist/metrics-operator-dev.yaml
25+
```
26+
27+
How to see metrics operator logs:
28+
29+
```bash
30+
$ kubectl logs -n metrics-system metrics-controller-manager-859c66464c-7rpbw
31+
```
32+
33+
Then create the metrics set. This is going to run a single run of LAMMPS over MPI!
34+
as lammps runs.
35+
36+
```bash
37+
kubectl apply -f metrics.yaml
38+
```
39+
40+
Wait until you see pods created by the job and then running (there should be two - a launcher and worker for LAMMPS):
41+
42+
```bash
43+
kubectl get pods
44+
```
45+
```diff
46+
NAME READY STATUS RESTARTS AGE
47+
metricset-sample-l-0-0-lt782 1/1 Running 0 3s
48+
metricset-sample-w-0-0-4s5p9 1/1 Running 0 3s
49+
```
50+
51+
In the above, "l" is a launcher pod, and "w" is a worker node.
52+
If you inspect the log for the launcher you'll see a short sleep (the network isn't up immediately)
53+
and then the example running, and the log is printed to the console.
54+
55+
```bash
56+
kubectl logs metricset-sample-l-0-0-lt782 -f
57+
```
58+
```console
59+
METADATA START {"pods":2,"completions":2,"metricName":"app-pennant","metricDescription":"Unstructured mesh hydrodynamics for advanced architectures ","metricType":"standalone","metricOptions":{"command":"pennant /opt/pennant/test/sedovsmall/sedovsmall.pnt","completions":0,"mpirun":"mpirun --hostfile ./hostlist.txt","rate":10,"workdir":"/opt/pennant/test"}}
60+
METADATA END
61+
Sleeping for 10 seconds waiting for network...
62+
METRICS OPERATOR COLLECTION START
63+
METRICS OPERATOR TIMEPOINT
64+
********************
65+
Running PENNANT v0.9
66+
********************
67+
68+
Running on 2 MPI PE(s)
69+
Running on 8 thread(s)
70+
--- Mesh Information ---
71+
Points: 100
72+
Zones: 81
73+
Sides: 324
74+
Edges: 189
75+
Side chunks: 21
76+
Point chunks: 8
77+
Zone chunks: 6
78+
Chunk size: 16
79+
------------------------
80+
Energy check: total energy = 2.467991e-01
81+
(internal = 2.467991e-01, kinetic = 0.000000e+00)
82+
End cycle 1, time = 2.50000e-03, dt = 2.50000e-03, wall = 1.64902e-01
83+
dt limiter: Initial timestep
84+
End cycle 10, time = 2.85593e-02, dt = 2.58849e-03, wall = 1.72612e+00
85+
dt limiter: PE 0, Hydro dV/V limit for z = 0
86+
87+
Run complete
88+
cycle = 10, cstop = 10
89+
time = 2.855932e-02, tstop = 1.000000e+00
90+
91+
************************************
92+
hydro cycle run time= 1.892289e+00
93+
************************************
94+
Energy check: total energy = 2.512181e-01
95+
(internal = 1.874053e-01, kinetic = 6.381282e-02)
96+
Writing .xy file...
97+
METRICS OPERATOR COLLECTION END
98+
```
99+
100+
The above shows the structured output that is done in a way for our Python parsing script to easily
101+
find sections of data. Also note that the worker will only be alive long enough for the main job to
102+
finish, and once it does, the worker goes away! Here is what you'll see in its brief life:
103+
104+
```console
105+
METADATA START {"pods":2,"completions":2,"metricName":"app-pennant","metricDescription":"Unstructured mesh hydrodynamics for advanced architectures ","metricType":"standalone","metricOptions":{"command":"pennant /opt/pennant/test/sedovsmall/sedovsmall.pnt","completions":0,"mpirun":"mpirun --hostfile ./hostlist.txt","rate":10,"workdir":"/opt/pennant/test"}}
106+
METADATA END
107+
Sleeping for 10 seconds waiting for network...
108+
METRICS OPERATOR COLLECTION START
109+
```
110+
111+
We never actually parse the output of the worker, so it isn't important.
112+
We can do this with JobSet logic that the entire set is done when the launcher is done.
113+
114+
```bash
115+
$ kubectl get pods
116+
```
117+
```console
118+
NAME READY STATUS RESTARTS AGE
119+
metricset-sample-l-0-0-vfz4w 0/1 Completed 0 68s
120+
```
121+
122+
When you are done, the job and jobset will be completed.
123+
124+
```bash
125+
$ kubectl get jobset
126+
```
127+
```console
128+
NAME RESTARTS COMPLETED AGE
129+
metricset-sample True 82s
130+
```
131+
```bash
132+
$ kubectl get jobs
133+
```
134+
```console
135+
NAME COMPLETIONS DURATION AGE
136+
metricset-sample-n-0 1/1 18s 84s
137+
```
138+
139+
And then you can cleanup!
140+
141+
```bash
142+
kubectl delete -f metrics.yaml
143+
```
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: flux-framework.org/v1alpha1
2+
kind: MetricSet
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: metricset
6+
app.kubernetes.io/instance: metricset-sample
7+
name: metricset-sample
8+
spec:
9+
# Number of indexed jobs to run netmark on
10+
pods: 2
11+
metrics:
12+
# This uses the default commands
13+
- name: app-pennant

0 commit comments

Comments
 (0)