Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 37 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,29 @@

![img/fluxbind.png](img/fluxbind-small.png)

## How does this work?

OK I think I know what we might do. The top level description for a job resources is the jobspec "job specification" that might look like this:

```yaml
resources:
- type: slot
count: 4
with:
- type: core
count: 8
```

Flux run / submit is going to use flux-sched (or a scheduler) to assign jobs to nodes and to go into the flux exec shell and execute some number of tasks per node. Each of those tasks is what is going to hit and then execute our bash script, with a view of the entire node, and with need to run fluxbind shape to derive the binding for the task. We can technically derive a shapefile from a jobspec. It is the same, but only needs to describe the shape of one slot, for which the task that receives it is responsible for some N. So a shapefile that describes the shape of a slot looks like this:

```yaml
resources:
- type: core
count: 8
```

And then that is to say "On this node, we are breaking resources down into this slot shape." We calculate the number of slots that the task is handling based on `FLUX_` environment variables. For now this is assume exclusive resources per node so if we are slightly off its not a huge deal, but in the future (given a slice of a node for a slot) we will need to be right, because we might see an entire node with hwloc but already be in a cpuset. Right now I'm also assuming the `fluxbind run` matches the topology of the shapefile. If you do something that doesn't match it probably can't be satisfied and will get an error, but not guaranteed.

## Run

Use fluxbind to run a job binding to specific cores. For flux, this means we require exclusive, and then for each node customize the binding exactly as we want it. We do this via a shape file.
Expand All @@ -14,11 +37,14 @@ Use fluxbind to run a job binding to specific cores. For flux, this means we req
### Basic Examples

```bash
# Start with a first match policy
# Start with a first match policy (I think this just works one node)
flux start --config ./examples/config/match-first.toml

# This works >1 node
flux alloc --conf ./examples/config/match-first.toml

# 1. Bind each task to a unique physical core, starting from core:0 (common case)
fluxbind run -n 8 --quiet --shape ./examples/shape/1node/packed-cores-shapefile.yaml sleep 1
fluxbind run -n 8 --quiet --shape ./examples/shape-graph/single-node/simple_cores/shape.yaml --graph sleep 1

# 2. Reverse it!
fluxbind run -n 8 --quiet --shape ./examples/shape/1node/packed-cores-reversed-shapefile.yaml sleep 1
Expand Down Expand Up @@ -61,6 +87,14 @@ fluxbind run -N 1 --tasks-per-core 2 --shape ./examples/shape/kripke/packed-pus-
fluxbind run -N 1 -n 2 --env OMP_NUM_THREADS=4 --env OMP_PLACES=cores --shape ./examples/shape/kripke/hybrid-l3-shapefile.yaml kripke --zones 16,16,16 --niter 500 --procs 2,1,1 --layout GZD
```

## Shape

The run command generates a shape, and we can test the shape command without it to provide a shapefile (basically, a modified jobspec with a pattern and optional affinity). Currently, the basic shape works as expected, but we are trying to work on the `--graph` implementation (uses a Flux jobspec and builds into a graph of nodes).

```bash
fluxbind shape --file examples/shape-graph/basic/4-cores-anywhere/shape.yaml --rank 0 --node-id 1 --local-rank 0 --gpus-per-task 0 --graph --global-ranks $(nproc) --nodes 1
```


## Predict

Expand All @@ -75,6 +109,7 @@ fluxbind predict core:0-7
fluxbind predict --xml ./examples/topology/corona.xml numa:0,1 x core:0-2
```


## License

DevTools is distributed under the terms of the MIT license.
Expand Down
180 changes: 180 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# fluxbind

## Binding

How does fluxbind handle the cpuset calculation? I realize that when we bind to PUs (processing units) we are doing something akin to SMT. Choosing to bind to `Core` is without that. The objects obove those are containers - we don't really bind to them, we select them to then bind to child PU or Core (as I understand it). Since we are controlling the binding in the library, we need to think about both how the user specifies this, and defaults if they do not. We will implement a hierarchy of rules (checks) that the library does to determine what to do.

### Highest Priority: Explicit Request

The Shapefile needs an explicit request from the user - "This is my shape, but bind to PU/Core."
For this request, the shape.yaml file can have an options block with `bind`.

```yaml
# Avoid SMT and bind to physical cores.
options:
bind: core

resources:
- type: l3cache
count: 1
```

In the above, the `options.bind` key exists so we honor it no matter what. This selection has to be Core or PU (someone tell me if I'm off about this - I'm pretty sure the cpusets in the hwloc on the containers are going to select the lower levels).


### Second level: Implicit Intent

This comes from the resource request. If a user has a lowest level, we can assume that is what they want to bind to. This would say "Bind to Core"

```yaml
resources:
- type: socket
count: 1
with:
- type: core
count: 4
```

This would say bind to PU (and the user is required to know the count)

```yaml
resources:
- type: socket
count: 1
with:
- type: core
count: 4
with:
- type: process
count: 4
```

If they don't know the count, they can use the first strategy and request it explicitly:

```yaml
options:
bind: process

resources:
- type: l3cache
count: 1
```

And note that I'm mapping "process" to "pu" because I don't think people (users) are familiar with pu. Probably I should support both.
In other words, if there is no `options.bind` we will inspect the `resources` and see if the final level (most granular) is Core or PU. If yes, we assume that is what we bind to.


### Lowest Priority: HPC Default

If we don't have an explicit request for binding and the lowest level is not PU or CPU, we have to assume some default. E.g., "Start with this container and bind to `<abstraction>` under it. Since most HPC workloads are run single threaded, I think we should assume Core. People that want SMT need to specify something special. Here is an example where we cannot know:

```yaml
resources:
- type: l3cache
count: 1
```

We will allocate one `L3Cache` object, and when it's time to bind, we won't bind a bind directive or a PU/Core at the lowest level. We have to assume the default, which will be Core.

### Special Cases

#### Unbound

A special case is unbound. I didn't add this at first because I figured if the user didn't want binding, they wouldn't use the tool. But the exception is devices! I might want to be close to a GPU or NIC but not actually bind any processes. In that case I would use fluxbind and specify the shape, but I'd ask for unbound:


```yaml
options:
bind: none

resources:
- type: core
count: 4
affinity:
type: gpu
count: 1
```

Note that the affinity spec above is still a WIP. I have something implemented for my first approach but am still working on this new graph one. The above is subject to change, but illustrates the point - we don't want to bind processes, but we want the cores to have affinity (be close to) a gpu.

#### GPU

This might be an alternative to the above - I'm not decided yet. GPU affinity (remote or local) means we want a GPU that is close by (same NUMA node) or remote (different NUMA), I haven't tested this yet, but it will look like this:

```yaml
options:
bind: gpu-local

resources:
- type: core
count: 4
```

Right now I have this request akin to `bind` (as a bind type I mean) because then the pattern defaults to `packed`. I think that is OK. I like this maybe a little better than the one before because we don't change the jobspec too much... :)


### Examples

Here are examples for different scenarios.

| `shape.yaml` | Logic Used | Final Binding Unit |
| :--- | :--- | :--- |
| **`options: {bind: process}`**, `resources: [{type: socket}]` | Explicit Request | `pu` |
| **`options: {bind: core}`**, `resources: [{type: socket}]` | Explicit Request | `core` |
| No options, `resources: [{type: core, count: 4}]` | Implicit Intent | `core` |
| No options, `resources: [{type: pu, count: 4}]` | Implicit Intent | `pu` |
| No options, `resources: [{type: l3cache, count: 1}]` | HPC Default | `core` |
| No options, `resources: [{type: numanode, count: 1}]` | HPC Default | `core` |
| `options: {bind: process}`, `resources: [{type: core, count: 2}]` | Explicit Request | `pu` |


## Patterns

The binding rules determine *what* kind of hardware to bind to (physical cores vs. hardware threads) and patterns determine *how* a total pool of those resources is distributed among the tasks on a node. When a `shape.yaml` describes a total pool of resources (e.g., `core: 8`) and a job is launched with multiple tasks on the node (e.g., `local_size=4`), `fluxbind` must have a deterministic strategy to give each task its own unique slice of the total pool. This strategy is controlled by the `pattern` key.

### packed

> Default

The packed pattern assigns resources in contiguous, dense blocks. This is the default behavior if no pattern is specified, because I think it is what generally would be wanted, because cores are physically close. As an example, given 8 available cores and 4 tasks, packed assigns resources like this:
* `local_rank=0` gets `[Core:0, Core:1]`
* `local_rank=1` gets `[Core:2, Core:3]`
* `local_rank=2` gets `[Core:4, Core:5]`
* `local_rank=3` gets `[Core:6, Core:7]`

```yaml
# pattern: packed is optional as it's the default, so you could leave this out.
resources:
- type: core
count: 8
pattern: packed
```

## scatter (spread)

> The pattern that makes you think of peanut butter

The scatter pattern distributes resources with the largest possible stride, like dealing out cards to each task. I think this can be similar to [cyclic](https://hpc.llnl.gov/sites/default/files/distributions_0.gif) or round robin. I think we'd want to do this for memory intensive tasks, where we would want cores physically far apart so each gets its own memory (L2/L3 caches).

```yaml
# 'spread' is an alias for 'scatter'.
resources:
- type: core
count: 8
pattern: spread
```

Right now I'm calling this interleaved, but I think they are actually different and if we want this case we need to add it. Interleaved would be like filling up all cores first (one PU) before going back and filling other PUs. Like filling cookies with Jam, but only every other cookie.

## Modifiers

### reverse

The reverse modifier is a boolean (true/false) that can be combined with any pattern. It simply reverses the canonical list of available resources before the distribution pattern is applied. Not sure when it's useful, but maybe we'd want to test one end and then "the other end."

```yaml
resources:
- type: core
count: 8
reverse: true
```
Loading