converged-computing
diff --git a/‎docs/README.md‎
Lines changed: 180 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 180 additions & 0 deletions
diff --git a/‎fluxbind/graph/graph.py‎
Lines changed: 9 additions & 2 deletions b/‎fluxbind/graph/graph.py‎
Lines changed: 9 additions & 2 deletions
diff --git a/‎fluxbind/graph/graphic.py‎
Lines changed: 145 additions & 0 deletions b/‎fluxbind/graph/graphic.py‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎fluxbind/graph/shape.py‎
Lines changed: 2 additions & 1 deletion b/‎fluxbind/graph/shape.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎tests/img/01_simple_cores_rank0.png‎
25.7 KB b/‎tests/img/01_simple_cores_rank0.png‎
25.7 KB
diff --git a/‎tests/img/01_simple_cores_rank1.png‎
25.1 KB b/‎tests/img/01_simple_cores_rank1.png‎
25.1 KB
diff --git a/‎tests/img/02_explicit_pu_rank0.png‎
83.1 KB b/‎tests/img/02_explicit_pu_rank0.png‎
83.1 KB
diff --git a/‎tests/img/02_explicit_pu_rank1.png‎
82.5 KB b/‎tests/img/02_explicit_pu_rank1.png‎
82.5 KB
diff --git a/‎tests/img/03_implicit_core.png‎
23.4 KB b/‎tests/img/03_implicit_core.png‎
23.4 KB
diff --git a/‎tests/img/04_default_core_container.png‎
27 KB b/‎tests/img/04_default_core_container.png‎
27 KB
@@ -0,0 +1,180 @@
+# fluxbind
+
+## Binding
+
+How does fluxbind handle the cpuset calculation? I realize that when we bind to PUs (processing units) we are doing something akin to SMT. Choosing to bind to `Core` is without that. The objects obove those are containers - we don't really bind to them, we select them to then bind to child PU or Core (as I understand it). Since we are controlling the binding in the library, we need to think about both how the user specifies this, and defaults if they do not. We will implement a hierarchy of rules (checks) that the library does to determine what to do.
+
+### Highest Priority: Explicit Request
+
+The Shapefile needs an explicit request from the user - "This is my shape, but bind to PU/Core."
+For this request, the shape.yaml file can have an options block with `bind`.
+
+```yaml
+# Avoid SMT and bind to physical cores.
+options:
+  bind: core
+
+resources:
+  - type: l3cache
+    count: 1
+```
+
+In the above, the `options.bind` key exists so we honor it no matter what. This selection has to be Core or PU (someone tell me if I'm off about this - I'm pretty sure the cpusets in the hwloc on the containers are going to select the lower levels).
+
+
+### Second level: Implicit Intent
+
+This comes from the resource request. If a user has a lowest level, we can assume that is what they want to bind to. This would say "Bind to Core"
+
+```yaml
+resources:
+- type: socket
+  count: 1
+  with:
+  - type: core
+    count: 4
+```
+
+This would say bind to PU (and the user is required to know the count)
+
+```yaml
+resources:
+- type: socket
+  count: 1
+  with:
+  - type: core
+    count: 4
+    with:
+    - type: process
+      count: 4
+```
+
+If they don't know the count, they can use the first strategy and request it explicitly:
+
+```yaml
+options:
+  bind: process
+
+resources:
+  - type: l3cache
+    count: 1
+```
+
+And note that I'm mapping "process" to "pu" because I don't think people (users) are familiar with pu. Probably I should support both.
+In other words, if there is no `options.bind` we will inspect the `resources` and see if the final level (most granular) is Core or PU. If yes, we assume that is what we bind to.
+
+
+### Lowest Priority: HPC Default
+
+If we don't have an explicit request for binding and the lowest level is not PU or CPU, we have to assume some default. E.g., "Start with this container and bind to `<abstraction>` under it. Since most HPC workloads are run single threaded, I think we should assume Core. People that want SMT need to specify something special. Here is an example where we cannot know:
+
+```yaml
+resources:
+- type: l3cache
+  count: 1
+```
+
+We will allocate one `L3Cache` object, and when it's time to bind, we won't bind a bind directive or a PU/Core at the lowest level. We have to assume the default, which will be Core.
+
+### Special Cases
+
+#### Unbound
+
+A special case is unbound. I didn't add this at first because I figured if the user didn't want binding, they wouldn't use the tool. But the exception is devices! I might want to be close to a GPU or NIC but not actually bind any processes. In that case I would use fluxbind and specify the shape, but I'd ask for unbound:
+
+
+```yaml
+options:
+  bind: none
+
+resources:
+  - type: core
+    count: 4
+    affinity:
+      type: gpu
+      count: 1
+```
+
+Note that the affinity spec above is still a WIP. I have something implemented for my first approach but am still working on this new graph one. The above is subject to change, but illustrates the point - we don't want to bind processes, but we want the cores to have affinity (be close to) a gpu.
+
+#### GPU
+
+This might be an alternative to the above - I'm not decided yet. GPU affinity (remote or local) means we want a GPU that is close by (same NUMA node) or remote (different NUMA), I haven't tested this yet, but it will look like this:
+
+```yaml
+options:
+  bind: gpu-local
+
+resources:
+  - type: core
+    count: 4
+```
+
+Right now I have this request akin to `bind` (as a bind type I mean) because then the pattern defaults to `packed`. I think that is OK. I like this maybe a little better than the one before because we don't change the jobspec too much... :)
+
+
+### Examples
+
+Here are examples for different scenarios.
+
+| `shape.yaml` | Logic Used | Final Binding Unit |
+| :--- | :--- | :--- |
+| **`options: {bind: process}`**, `resources: [{type: socket}]` | Explicit Request | `pu` |
+| **`options: {bind: core}`**, `resources: [{type: socket}]` | Explicit Request | `core` |
+| No options, `resources: [{type: core, count: 4}]` | Implicit Intent | `core` |
+| No options, `resources: [{type: pu, count: 4}]` | Implicit Intent | `pu` |
+| No options, `resources: [{type: l3cache, count: 1}]` | HPC Default | `core` |
+| No options, `resources: [{type: numanode, count: 1}]` | HPC Default | `core` |
+| `options: {bind: process}`, `resources: [{type: core, count: 2}]` | Explicit Request | `pu` |
+
+
+## Patterns
+
+The binding rules determine *what* kind of hardware to bind to (physical cores vs. hardware threads) and patterns determine *how* a total pool of those resources is distributed among the tasks on a node. When a `shape.yaml` describes a total pool of resources (e.g., `core: 8`) and a job is launched with multiple tasks on the node (e.g., `local_size=4`), `fluxbind` must have a deterministic strategy to give each task its own unique slice of the total pool. This strategy is controlled by the `pattern` key.
+
+### packed
+
+> Default
+
+The packed pattern assigns resources in contiguous, dense blocks. This is the default behavior if no pattern is specified, because I think it is what generally would be wanted, because cores are physically close. As an example, given 8 available cores and 4 tasks, packed assigns resources like this:
+  * `local_rank=0` gets `[Core:0, Core:1]`
+  * `local_rank=1` gets `[Core:2, Core:3]`
+  * `local_rank=2` gets `[Core:4, Core:5]`
+  * `local_rank=3` gets `[Core:6, Core:7]`
+
+```yaml
+# pattern: packed is optional as it's the default, so you could leave this out.
+resources:
+  - type: core
+    count: 8
+    pattern: packed
+```
+
+## scatter (spread)
+
+> The pattern that makes you think of peanut butter
+
+The scatter pattern distributes resources with the largest possible stride, like dealing out cards to each task. I think this can be similar to [cyclic](https://hpc.llnl.gov/sites/default/files/distributions_0.gif) or round robin.  I think we'd want to do this for memory intensive tasks, where we would want cores physically far apart so each gets its own memory (L2/L3 caches).
+
+```yaml
+# 'spread' is an alias for 'scatter'.
+resources:
+  - type: core
+    count: 8
+    pattern: spread
+```
+
+Right now I'm calling this interleaved, but I think they are actually different and if we want this case we need to add it. Interleaved would be like filling up all cores first (one PU) before going back and filling other PUs. Like filling cookies with Jam, but only every other cookie.
+
+## Modifiers
+
+### reverse
+
+The reverse modifier is a boolean (true/false) that can be combined with any pattern. It simply reverses the canonical list of available resources before the distribution pattern is applied. Not sure when it's useful, but maybe we'd want to test one end and then "the other end."
+
+```yaml
+resources:
+  - type: core
+    count: 8
+    reverse: true
+```
@@ -31,6 +31,8 @@ def load(self, xml_input, max_workers=None):
         Load the graph, including distances, and pre-calculate
         entire set of affinities for objects.
         """
+        self.last_affinity_target = None
+
         # If we don't have an xml file, derive from system
         if not xml_input:
             xml_input = commands.lstopo.get_xml()
@@ -558,7 +560,11 @@ def get_descendants(self, gp_index, **filters):
         ]
 
     def get_ancestor_of_type(self, start_node_gp, ancestor_type):
+        """
+        Given a starting node, return all ancestors of a certain type
+        """
         current_gp = start_node_gp
+
         # Walk up the hierarchy tree one parent at a time.
         while current_gp in self.hierarchy_view:
             # Get the parent (should only be one in a tree)
@@ -573,6 +579,9 @@ def get_ancestor_of_type(self, start_node_gp, ancestor_type):
         return None
 
     def get_sort_key_for_node(self, leaf_node):
+        """
+        Return tuple sorting key e.g., (0, package_id, core_id) -> e.g., (0, 0, 5)
+        """
         gp, data = leaf_node
 
         # TYPE_ORDER: CPU types < PCI types < Other OS types < Nameless types
@@ -582,8 +591,6 @@ def get_sort_key_for_node(self, leaf_node):
         if data.get("type") in ["Core", "PU"]:
             package = self.get_ancestor_of_type(gp, "Package")
             package_idx = package[1].get("os_index", -1) if package else -1
-
-            # Returns (0, package_id, core_id) -> e.g., (0, 0, 5)
             return (0, int(package_idx), int(data.get("os_index", -1)))
 
         # Handle PCI devices (GPUs, NICs)
 
@@ -0,0 +1,145 @@
+import logging
+
+import networkx as nx
+
+try:
+    import matplotlib.pyplot as plt
+    import pydot
+
+    VISUALIZATION_ENABLED = True
+except ImportError:
+    VISUALIZATION_ENABLED = False
+
+log = logging.getLogger(__name__)
+
+
+class TopologyVisualizer:
+    """
+    Creates a simplified, contextual block diagram of a hardware allocation
+    that shows assigned nodes in the context of their unassigned siblings.
+    """
+
+    def __init__(self, topology: "HwlocTopology", assigned_nodes: list, affinity_target=None):
+        if not VISUALIZATION_ENABLED:
+            raise ImportError("Visualization libraries (matplotlib, pydot) are not installed.")
+
+        self.topology = topology
+        self.assigned_nodes = assigned_nodes
+        self.assigned_gps = {gp for gp, _ in assigned_nodes}
+        self.affinity_target_gp = affinity_target[0] if affinity_target else None
+        self.title = "Hardware Allocation"  # Public attribute for a descriptive title
+
+    def _build_contextual_subgraph(self):
+        """
+        Constructs a new, clean graph for drawing that includes assigned nodes,
+        their unassigned siblings, and their parent containers for context.
+        """
+        if not self.assigned_nodes:
+            return nx.DiGraph()
+
+        # Step 1: Identify the type of resource we are drawing (e.g., 'core', 'pu').
+        leaf_type = self.assigned_nodes[0][1].get("type")
+        if not leaf_type:
+            return nx.DiGraph()
+
+        # Step 2: Find a common parent container for the allocated resources.
+        first_node_gp = self.assigned_nodes[0][0]
+
+        # Use the existing, correct helper function.
+        parent = self.topology.get_ancestor_of_type(
+            first_node_gp, "Package"
+        ) or self.topology.get_ancestor_of_type(first_node_gp, "NUMANode")
+
+        search_domain_gp = None
+        if parent:
+            search_domain_gp = parent[0]
+        elif leaf_type in ["Package", "NUMANode", "Machine"]:
+            search_domain_gp = first_node_gp
+
+        # Step 3: Get all sibling nodes of the same type within that context.
+        if search_domain_gp:
+            all_siblings = self.topology.get_descendants(search_domain_gp, type=leaf_type)
+            if not all_siblings and leaf_type in ["Package", "NUMANode"]:
+                all_siblings = self.assigned_nodes
+        else:
+            all_siblings = self.assigned_nodes
+
+        # Step 4: Build the final set of nodes to draw.
+        nodes_to_draw_gps = set()
+        for gp, _ in all_siblings:
+            nodes_to_draw_gps.add(gp)
+            nodes_to_draw_gps.update(nx.ancestors(self.topology.hierarchy_view, gp))
+
+        final_subgraph = self.topology.graph.subgraph(nodes_to_draw_gps).copy()
+
+        # Filter out types we don't want to see, for clarity.
+        nodes_to_remove = [
+            gp
+            for gp, data in final_subgraph.nodes(data=True)
+            if data.get("type") not in ["Core", "PU"]
+        ]
+        final_subgraph.remove_nodes_from(nodes_to_remove)
+
+        return final_subgraph
+
+    def draw(self, filename: str):
+        # This method's logic was already correct and does not need to change.
+        log.info(f"Generating allocation graphic at '{filename}'...")
+
+        subgraph = self._build_contextual_subgraph()
+        if subgraph.number_of_nodes() == 0:
+            log.warning("Cannot generate graphic: No nodes to draw.")
+            return
+
+        labels = {}
+        colors = {}
+        sorted_nodes = sorted(
+            subgraph.nodes(data=True),
+            key=lambda item: (item[1].get("depth", 0), self.topology.get_sort_key_for_node(item)),
+        )
+
+        for gp, data in sorted_nodes:
+            node_type = data.get("type", "Unknown")
+            os_index = data.get("os_index")
+            labels[gp] = (
+                f"{node_type.capitalize()}:{os_index}"
+                if os_index is not None
+                else node_type.capitalize()
+            )
+            # Color logic is unchanged...
+            if gp == self.affinity_target_gp:
+                colors[gp] = "gold"
+            elif gp in self.assigned_gps:
+                colors[gp] = "lightgreen"
+            elif node_type == "numanode":
+                colors[gp] = "skyblue"
+            elif node_type == "package":
+                colors[gp] = "coral"
+            else:
+                colors[gp] = "lightgray"
+
+        node_colors = [colors.get(gp, "lightgray") for gp in subgraph.nodes()]
+        pos = nx.drawing.nx_pydot.graphviz_layout(subgraph, prog="dot")
+
+        plt.figure(figsize=(12, 8))
+        nx.draw_networkx(
+            subgraph,
+            pos,
+            labels=labels,
+            node_color=node_colors,
+            node_size=2000,
+            node_shape="s",
+            edgecolors="black",
+            font_size=8,
+            font_weight="bold",
+            arrows=False,
+            width=1.5,
+        )
+
+        plt.title(self.title, fontsize=16)
+        plt.box(False)
+        plt.tight_layout()
+        plt.savefig(filename, bbox_inches="tight", dpi=150)
+        plt.close()
+
+        log.info("...graphic saved successfully.")
@@ -102,7 +102,7 @@ def get_binding_for_rank(
 
         if graphic and mapping.nodes:
             visualizer = TopologyVisualizer(
-                mapping.topo, mapping.nodes, affinity_target=self.last_affinity_target
+                mapping.topo, mapping.nodes, affinity_target=mapping.topo.last_affinity_target
             )
             visualizer.draw(graphic)
 
@@ -158,6 +158,7 @@ def get_gpu_binding(self, topology, local_rank, gpus_per_task, bind_mode):
             A tuple containing the GPUAssignment object and a set of graph pointers
             to the Package(s) that should be used for the CPU search.
         """
+        gpus_per_task = gpus_per_task or 0
         if gpus_per_task <= 0:
             raise ValueError(f"'bind: {bind_mode}' requires --gpus-per-task to be > 0.")