You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+37-2Lines changed: 37 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,29 @@
6
6
7
7

8
8
9
+
## How does this work?
10
+
11
+
OK I think I know what we might do. The top level description for a job resources is the jobspec "job specification" that might look like this:
12
+
13
+
```yaml
14
+
resources:
15
+
- type: slot
16
+
count: 4
17
+
with:
18
+
- type: core
19
+
count: 8
20
+
```
21
+
22
+
Flux run / submit is going to use flux-sched (or a scheduler) to assign jobs to nodes and to go into the flux exec shell and execute some number of tasks per node. Each of those tasks is what is going to hit and then execute our bash script, with a view of the entire node, and with need to run fluxbind shape to derive the binding for the task. We can technically derive a shapefile from a jobspec. It is the same, but only needs to describe the shape of one slot, for which the task that receives it is responsible for some N. So a shapefile that describes the shape of a slot looks like this:
23
+
24
+
```yaml
25
+
resources:
26
+
- type: core
27
+
count: 8
28
+
```
29
+
30
+
And then that is to say "On this node, we are breaking resources down into this slot shape." We calculate the number of slots that the task is handling based on `FLUX_` environment variables. For now this is assume exclusive resources per node so if we are slightly off its not a huge deal, but in the future (given a slice of a node for a slot) we will need to be right, because we might see an entire node with hwloc but already be in a cpuset. Right now I'm also assuming the `fluxbind run` matches the topology of the shapefile. If you do something that doesn't match it probably can't be satisfied and will get an error, but not guaranteed.
31
+
9
32
## Run
10
33
11
34
Use fluxbind to run a job binding to specific cores. For flux, this means we require exclusive, and then for each node customize the binding exactly as we want it. We do this via a shape file.
@@ -14,11 +37,15 @@ Use fluxbind to run a job binding to specific cores. For flux, this means we req
14
37
### Basic Examples
15
38
16
39
```bash
17
-
# Start with a first match policy
40
+
# Start with a first match policy (I think this just works one node)
The run command generates a shape, and we can test the shape command without it to provide a shapefile (basically, a modified jobspec with a pattern and optional affinity). Currently, the basic shape works as expected, but we are trying to work on the `--graph` implementation (uses a Flux jobspec and builds into a graph of nodes).
0 commit comments