Skip to content

Commit d2ab423

Browse files
committed
feat: graph working with tests
I need to test on corona (to use nvidia/rocm-smi for gpu) and then to add the graphic support. Signed-off-by: vsoch <[email protected]>
1 parent 8468a4b commit d2ab423

File tree

27 files changed

+2887
-163
lines changed

27 files changed

+2887
-163
lines changed

README.md

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,29 @@
66

77
![img/fluxbind.png](img/fluxbind-small.png)
88

9+
## How does this work?
10+
11+
OK I think I know what we might do. The top level description for a job resources is the jobspec "job specification" that might look like this:
12+
13+
```yaml
14+
resources:
15+
- type: slot
16+
count: 4
17+
with:
18+
- type: core
19+
count: 8
20+
```
21+
22+
Flux run / submit is going to use flux-sched (or a scheduler) to assign jobs to nodes and to go into the flux exec shell and execute some number of tasks per node. Each of those tasks is what is going to hit and then execute our bash script, with a view of the entire node, and with need to run fluxbind shape to derive the binding for the task. We can technically derive a shapefile from a jobspec. It is the same, but only needs to describe the shape of one slot, for which the task that receives it is responsible for some N. So a shapefile that describes the shape of a slot looks like this:
23+
24+
```yaml
25+
resources:
26+
- type: core
27+
count: 8
28+
```
29+
30+
And then that is to say "On this node, we are breaking resources down into this slot shape." We calculate the number of slots that the task is handling based on `FLUX_` environment variables. For now this is assume exclusive resources per node so if we are slightly off its not a huge deal, but in the future (given a slice of a node for a slot) we will need to be right, because we might see an entire node with hwloc but already be in a cpuset. Right now I'm also assuming the `fluxbind run` matches the topology of the shapefile. If you do something that doesn't match it probably can't be satisfied and will get an error, but not guaranteed.
31+
932
## Run
1033

1134
Use fluxbind to run a job binding to specific cores. For flux, this means we require exclusive, and then for each node customize the binding exactly as we want it. We do this via a shape file.
@@ -14,11 +37,15 @@ Use fluxbind to run a job binding to specific cores. For flux, this means we req
1437
### Basic Examples
1538

1639
```bash
17-
# Start with a first match policy
40+
# Start with a first match policy (I think this just works one node)
1841
flux start --config ./examples/config/match-first.toml
1942
43+
# This works >1 node
44+
flux alloc --conf ./examples/config/match-first.toml
45+
2046
# 1. Bind each task to a unique physical core, starting from core:0 (common case)
21-
fluxbind run -n 8 --quiet --shape ./examples/shape/1node/packed-cores-shapefile.yaml sleep 1
47+
# STOPPED HERE - get this working with run! I don't think --graph is being passed.
48+
fluxbind run -n 8 --quiet --shape ./examples/shape-graph/single-node/simple_cores/shape.yaml --graph sleep 1
2249
2350
# 2. Reverse it!
2451
fluxbind run -n 8 --quiet --shape ./examples/shape/1node/packed-cores-reversed-shapefile.yaml sleep 1
@@ -61,6 +88,14 @@ fluxbind run -N 1 --tasks-per-core 2 --shape ./examples/shape/kripke/packed-pus-
6188
fluxbind run -N 1 -n 2 --env OMP_NUM_THREADS=4 --env OMP_PLACES=cores --shape ./examples/shape/kripke/hybrid-l3-shapefile.yaml kripke --zones 16,16,16 --niter 500 --procs 2,1,1 --layout GZD
6289
```
6390

91+
## Shape
92+
93+
The run command generates a shape, and we can test the shape command without it to provide a shapefile (basically, a modified jobspec with a pattern and optional affinity). Currently, the basic shape works as expected, but we are trying to work on the `--graph` implementation (uses a Flux jobspec and builds into a graph of nodes).
94+
95+
```bash
96+
fluxbind shape --file examples/shape-graph/basic/4-cores-anywhere/shape.yaml --rank 0 --node-id 1 --local-rank 0 --gpus-per-task 0 --graph --global-ranks $(nproc) --nodes 1
97+
```
98+
6499

65100
## Predict
66101

examples/shape-graph/affinity/cores-near-gpu/shape.yaml

Lines changed: 0 additions & 10 deletions
This file was deleted.

examples/shape-graph/affinity/cores-near-nic/shape.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

examples/shape-graph/affinity/gpu-and-nic-in-same-numa/shape.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

examples/shape-graph/basic/4-cores-anywhere/shape.yaml

Lines changed: 0 additions & 8 deletions
This file was deleted.

examples/shape-graph/basic/8-cores-in-one-socket/shape.yaml

Lines changed: 0 additions & 9 deletions
This file was deleted.

examples/shape-graph/complex/gpu-compute-and-nic-comm/shape.yaml

Lines changed: 0 additions & 28 deletions
This file was deleted.

examples/shape-graph/multi-rank/4-ranks-16-cores/shape.yaml

Lines changed: 0 additions & 11 deletions
This file was deleted.

examples/shape-graph/multi-rank/4-ranks-4-gpus-near-nic/shape.yaml

Lines changed: 0 additions & 16 deletions
This file was deleted.

examples/shape-graph/patterns/local-cores/shape.yaml

Lines changed: 0 additions & 10 deletions
This file was deleted.

0 commit comments

Comments
 (0)