Skip to content

Commit d45b66e

Browse files
committed
feat: add graphic generation that isnot as terrible
Signed-off-by: vsoch <[email protected]>
1 parent d2ab423 commit d45b66e

17 files changed

+523
-237
lines changed

docs/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# fluxbind
2+
3+
## Binding
4+
5+
How does fluxbind handle the cpuset calculation? I realize that when we bind to PUs (processing units) we are doing something akin to SMT. Choosing to bind to `Core` is without that. The objects obove those are containers - we don't really bind to them, we select them to then bind to child PU or Core (as I understand it). Since we are controlling the binding in the library, we need to think about both how the user specifies this, and defaults if they do not. We will implement a hierarchy of rules (checks) that the library does to determine what to do.
6+
7+
### Highest Priority: Explicit Request
8+
9+
The Shapefile needs an explicit request from the user - "This is my shape, but bind to PU/Core."
10+
For this request, the shape.yaml file can have an options block with `bind`.
11+
12+
```yaml
13+
# Avoid SMT and bind to physical cores.
14+
options:
15+
bind: core
16+
17+
resources:
18+
- type: l3cache
19+
count: 1
20+
```
21+
22+
In the above, the `options.bind` key exists so we honor it no matter what. This selection has to be Core or PU (someone tell me if I'm off about this - I'm pretty sure the cpusets in the hwloc on the containers are going to select the lower levels).
23+
24+
25+
### Second level: Implicit Intent
26+
27+
This comes from the resource request. If a user has a lowest level, we can assume that is what they want to bind to. This would say "Bind to Core"
28+
29+
```yaml
30+
resources:
31+
- type: socket
32+
count: 1
33+
with:
34+
- type: core
35+
count: 4
36+
```
37+
38+
This would say bind to PU (and the user is required to know the count)
39+
40+
```yaml
41+
resources:
42+
- type: socket
43+
count: 1
44+
with:
45+
- type: core
46+
count: 4
47+
with:
48+
- type: process
49+
count: 4
50+
```
51+
52+
If they don't know the count, they can use the first strategy and request it explicitly:
53+
54+
```yaml
55+
options:
56+
bind: process
57+
58+
resources:
59+
- type: l3cache
60+
count: 1
61+
```
62+
63+
And note that I'm mapping "process" to "pu" because I don't think people (users) are familiar with pu. Probably I should support both.
64+
In other words, if there is no `options.bind` we will inspect the `resources` and see if the final level (most granular) is Core or PU. If yes, we assume that is what we bind to.
65+
66+
67+
### Lowest Priority: HPC Default
68+
69+
If we don't have an explicit request for binding and the lowest level is not PU or CPU, we have to assume some default. E.g., "Start with this container and bind to `<abstraction>` under it. Since most HPC workloads are run single threaded, I think we should assume Core. People that want SMT need to specify something special. Here is an example where we cannot know:
70+
71+
```yaml
72+
resources:
73+
- type: l3cache
74+
count: 1
75+
```
76+
77+
We will allocate one `L3Cache` object, and when it's time to bind, we won't bind a bind directive or a PU/Core at the lowest level. We have to assume the default, which will be Core.
78+
79+
### Special Cases
80+
81+
#### Unbound
82+
83+
A special case is unbound. I didn't add this at first because I figured if the user didn't want binding, they wouldn't use the tool. But the exception is devices! I might want to be close to a GPU or NIC but not actually bind any processes. In that case I would use fluxbind and specify the shape, but I'd ask for unbound:
84+
85+
86+
```yaml
87+
options:
88+
bind: none
89+
90+
resources:
91+
- type: core
92+
count: 4
93+
affinity:
94+
type: gpu
95+
count: 1
96+
```
97+
98+
Note that the affinity spec above is still a WIP. I have something implemented for my first approach but am still working on this new graph one. The above is subject to change, but illustrates the point - we don't want to bind processes, but we want the cores to have affinity (be close to) a gpu.
99+
100+
#### GPU
101+
102+
This might be an alternative to the above - I'm not decided yet. GPU affinity (remote or local) means we want a GPU that is close by (same NUMA node) or remote (different NUMA), I haven't tested this yet, but it will look like this:
103+
104+
```yaml
105+
options:
106+
bind: gpu-local
107+
108+
resources:
109+
- type: core
110+
count: 4
111+
```
112+
113+
Right now I have this request akin to `bind` (as a bind type I mean) because then the pattern defaults to `packed`. I think that is OK. I like this maybe a little better than the one before because we don't change the jobspec too much... :)
114+
115+
116+
### Examples
117+
118+
Here are examples for different scenarios.
119+
120+
| `shape.yaml` | Logic Used | Final Binding Unit |
121+
| :--- | :--- | :--- |
122+
| **`options: {bind: process}`**, `resources: [{type: socket}]` | Explicit Request | `pu` |
123+
| **`options: {bind: core}`**, `resources: [{type: socket}]` | Explicit Request | `core` |
124+
| No options, `resources: [{type: core, count: 4}]` | Implicit Intent | `core` |
125+
| No options, `resources: [{type: pu, count: 4}]` | Implicit Intent | `pu` |
126+
| No options, `resources: [{type: l3cache, count: 1}]` | HPC Default | `core` |
127+
| No options, `resources: [{type: numanode, count: 1}]` | HPC Default | `core` |
128+
| `options: {bind: process}`, `resources: [{type: core, count: 2}]` | Explicit Request | `pu` |
129+
130+
131+
## Patterns
132+
133+
The binding rules determine *what* kind of hardware to bind to (physical cores vs. hardware threads) and patterns determine *how* a total pool of those resources is distributed among the tasks on a node. When a `shape.yaml` describes a total pool of resources (e.g., `core: 8`) and a job is launched with multiple tasks on the node (e.g., `local_size=4`), `fluxbind` must have a deterministic strategy to give each task its own unique slice of the total pool. This strategy is controlled by the `pattern` key.
134+
135+
### packed
136+
137+
> Default
138+
139+
The packed pattern assigns resources in contiguous, dense blocks. This is the default behavior if no pattern is specified, because I think it is what generally would be wanted, because cores are physically close. As an example, given 8 available cores and 4 tasks, packed assigns resources like this:
140+
* `local_rank=0` gets `[Core:0, Core:1]`
141+
* `local_rank=1` gets `[Core:2, Core:3]`
142+
* `local_rank=2` gets `[Core:4, Core:5]`
143+
* `local_rank=3` gets `[Core:6, Core:7]`
144+
145+
```yaml
146+
# pattern: packed is optional as it's the default, so you could leave this out.
147+
resources:
148+
- type: core
149+
count: 8
150+
pattern: packed
151+
```
152+
153+
## scatter (spread)
154+
155+
> The pattern that makes you think of peanut butter
156+
157+
The scatter pattern distributes resources with the largest possible stride, like dealing out cards to each task. I think this can be similar to [cyclic](https://hpc.llnl.gov/sites/default/files/distributions_0.gif) or round robin. I think we'd want to do this for memory intensive tasks, where we would want cores physically far apart so each gets its own memory (L2/L3 caches).
158+
159+
```yaml
160+
# 'spread' is an alias for 'scatter'.
161+
resources:
162+
- type: core
163+
count: 8
164+
pattern: spread
165+
```
166+
167+
Right now I'm calling this interleaved, but I think they are actually different and if we want this case we need to add it. Interleaved would be like filling up all cores first (one PU) before going back and filling other PUs. Like filling cookies with Jam, but only every other cookie.
168+
169+
## Modifiers
170+
171+
### reverse
172+
173+
The reverse modifier is a boolean (true/false) that can be combined with any pattern. It simply reverses the canonical list of available resources before the distribution pattern is applied. Not sure when it's useful, but maybe we'd want to test one end and then "the other end."
174+
175+
```yaml
176+
resources:
177+
- type: core
178+
count: 8
179+
reverse: true
180+
```

fluxbind/graph/graph.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ def load(self, xml_input, max_workers=None):
3131
Load the graph, including distances, and pre-calculate
3232
entire set of affinities for objects.
3333
"""
34+
self.last_affinity_target = None
35+
3436
# If we don't have an xml file, derive from system
3537
if not xml_input:
3638
xml_input = commands.lstopo.get_xml()
@@ -558,7 +560,11 @@ def get_descendants(self, gp_index, **filters):
558560
]
559561

560562
def get_ancestor_of_type(self, start_node_gp, ancestor_type):
563+
"""
564+
Given a starting node, return all ancestors of a certain type
565+
"""
561566
current_gp = start_node_gp
567+
562568
# Walk up the hierarchy tree one parent at a time.
563569
while current_gp in self.hierarchy_view:
564570
# Get the parent (should only be one in a tree)
@@ -573,6 +579,9 @@ def get_ancestor_of_type(self, start_node_gp, ancestor_type):
573579
return None
574580

575581
def get_sort_key_for_node(self, leaf_node):
582+
"""
583+
Return tuple sorting key e.g., (0, package_id, core_id) -> e.g., (0, 0, 5)
584+
"""
576585
gp, data = leaf_node
577586

578587
# TYPE_ORDER: CPU types < PCI types < Other OS types < Nameless types
@@ -582,8 +591,6 @@ def get_sort_key_for_node(self, leaf_node):
582591
if data.get("type") in ["Core", "PU"]:
583592
package = self.get_ancestor_of_type(gp, "Package")
584593
package_idx = package[1].get("os_index", -1) if package else -1
585-
586-
# Returns (0, package_id, core_id) -> e.g., (0, 0, 5)
587594
return (0, int(package_idx), int(data.get("os_index", -1)))
588595

589596
# Handle PCI devices (GPUs, NICs)

fluxbind/graph/graphic.py

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
import logging
2+
3+
import networkx as nx
4+
5+
try:
6+
import matplotlib.pyplot as plt
7+
import pydot
8+
9+
VISUALIZATION_ENABLED = True
10+
except ImportError:
11+
VISUALIZATION_ENABLED = False
12+
13+
log = logging.getLogger(__name__)
14+
15+
16+
class TopologyVisualizer:
17+
"""
18+
Creates a simplified, contextual block diagram of a hardware allocation
19+
that shows assigned nodes in the context of their unassigned siblings.
20+
"""
21+
22+
def __init__(self, topology: "HwlocTopology", assigned_nodes: list, affinity_target=None):
23+
if not VISUALIZATION_ENABLED:
24+
raise ImportError("Visualization libraries (matplotlib, pydot) are not installed.")
25+
26+
self.topology = topology
27+
self.assigned_nodes = assigned_nodes
28+
self.assigned_gps = {gp for gp, _ in assigned_nodes}
29+
self.affinity_target_gp = affinity_target[0] if affinity_target else None
30+
self.title = "Hardware Allocation" # Public attribute for a descriptive title
31+
32+
def _build_contextual_subgraph(self):
33+
"""
34+
Constructs a new, clean graph for drawing that includes assigned nodes,
35+
their unassigned siblings, and their parent containers for context.
36+
"""
37+
if not self.assigned_nodes:
38+
return nx.DiGraph()
39+
40+
# Step 1: Identify the type of resource we are drawing (e.g., 'core', 'pu').
41+
leaf_type = self.assigned_nodes[0][1].get("type")
42+
if not leaf_type:
43+
return nx.DiGraph()
44+
45+
# Step 2: Find a common parent container for the allocated resources.
46+
first_node_gp = self.assigned_nodes[0][0]
47+
48+
# Use the existing, correct helper function.
49+
parent = self.topology.get_ancestor_of_type(
50+
first_node_gp, "Package"
51+
) or self.topology.get_ancestor_of_type(first_node_gp, "NUMANode")
52+
53+
search_domain_gp = None
54+
if parent:
55+
search_domain_gp = parent[0]
56+
elif leaf_type in ["Package", "NUMANode", "Machine"]:
57+
search_domain_gp = first_node_gp
58+
59+
# Step 3: Get all sibling nodes of the same type within that context.
60+
if search_domain_gp:
61+
all_siblings = self.topology.get_descendants(search_domain_gp, type=leaf_type)
62+
if not all_siblings and leaf_type in ["Package", "NUMANode"]:
63+
all_siblings = self.assigned_nodes
64+
else:
65+
all_siblings = self.assigned_nodes
66+
67+
# Step 4: Build the final set of nodes to draw.
68+
nodes_to_draw_gps = set()
69+
for gp, _ in all_siblings:
70+
nodes_to_draw_gps.add(gp)
71+
nodes_to_draw_gps.update(nx.ancestors(self.topology.hierarchy_view, gp))
72+
73+
final_subgraph = self.topology.graph.subgraph(nodes_to_draw_gps).copy()
74+
75+
# Filter out types we don't want to see, for clarity.
76+
nodes_to_remove = [
77+
gp
78+
for gp, data in final_subgraph.nodes(data=True)
79+
if data.get("type") not in ["Core", "PU"]
80+
]
81+
final_subgraph.remove_nodes_from(nodes_to_remove)
82+
83+
return final_subgraph
84+
85+
def draw(self, filename: str):
86+
# This method's logic was already correct and does not need to change.
87+
log.info(f"Generating allocation graphic at '{filename}'...")
88+
89+
subgraph = self._build_contextual_subgraph()
90+
if subgraph.number_of_nodes() == 0:
91+
log.warning("Cannot generate graphic: No nodes to draw.")
92+
return
93+
94+
labels = {}
95+
colors = {}
96+
sorted_nodes = sorted(
97+
subgraph.nodes(data=True),
98+
key=lambda item: (item[1].get("depth", 0), self.topology.get_sort_key_for_node(item)),
99+
)
100+
101+
for gp, data in sorted_nodes:
102+
node_type = data.get("type", "Unknown")
103+
os_index = data.get("os_index")
104+
labels[gp] = (
105+
f"{node_type.capitalize()}:{os_index}"
106+
if os_index is not None
107+
else node_type.capitalize()
108+
)
109+
# Color logic is unchanged...
110+
if gp == self.affinity_target_gp:
111+
colors[gp] = "gold"
112+
elif gp in self.assigned_gps:
113+
colors[gp] = "lightgreen"
114+
elif node_type == "numanode":
115+
colors[gp] = "skyblue"
116+
elif node_type == "package":
117+
colors[gp] = "coral"
118+
else:
119+
colors[gp] = "lightgray"
120+
121+
node_colors = [colors.get(gp, "lightgray") for gp in subgraph.nodes()]
122+
pos = nx.drawing.nx_pydot.graphviz_layout(subgraph, prog="dot")
123+
124+
plt.figure(figsize=(12, 8))
125+
nx.draw_networkx(
126+
subgraph,
127+
pos,
128+
labels=labels,
129+
node_color=node_colors,
130+
node_size=2000,
131+
node_shape="s",
132+
edgecolors="black",
133+
font_size=8,
134+
font_weight="bold",
135+
arrows=False,
136+
width=1.5,
137+
)
138+
139+
plt.title(self.title, fontsize=16)
140+
plt.box(False)
141+
plt.tight_layout()
142+
plt.savefig(filename, bbox_inches="tight", dpi=150)
143+
plt.close()
144+
145+
log.info("...graphic saved successfully.")

fluxbind/graph/shape.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ def get_binding_for_rank(
102102

103103
if graphic and mapping.nodes:
104104
visualizer = TopologyVisualizer(
105-
mapping.topo, mapping.nodes, affinity_target=self.last_affinity_target
105+
mapping.topo, mapping.nodes, affinity_target=mapping.topo.last_affinity_target
106106
)
107107
visualizer.draw(graphic)
108108

@@ -158,6 +158,7 @@ def get_gpu_binding(self, topology, local_rank, gpus_per_task, bind_mode):
158158
A tuple containing the GPUAssignment object and a set of graph pointers
159159
to the Package(s) that should be used for the CPU search.
160160
"""
161+
gpus_per_task = gpus_per_task or 0
161162
if gpus_per_task <= 0:
162163
raise ValueError(f"'bind: {bind_mode}' requires --gpus-per-task to be > 0.")
163164

25.7 KB
Loading
25.1 KB
Loading

tests/img/02_explicit_pu_rank0.png

83.1 KB
Loading

tests/img/02_explicit_pu_rank1.png

82.5 KB
Loading

tests/img/03_implicit_core.png

23.4 KB
Loading
27 KB
Loading

0 commit comments

Comments
 (0)