Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,5 @@ roles/*
!roles/nhc/**
!roles/eessi/
!roles/eessi/**
!roles/topology/
!roles/topology/**
14 changes: 14 additions & 0 deletions ansible/roles/topology/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
topology
========

Templates out /etc/slurm/topology.conf file based on an Openstack project for use by
Slurm's [topology/tree plugin.](https://slurm.schedmd.com/topology.html) Models
project as tree with a heirarchy of:

Project -> Availability Zones -> Hypervisors -> VMs

Role Variables
--------------

- `topology_topology_nodes: []`: Required list[str]. List of nodes to include in topology tree. Must be set to include all compute nodes in Slurm cluster. Default `[]`.
- `topology_topology_override:`: Optional str. If set, will override templating and be provided as custom topology.conf content. Undefined by default.
5 changes: 5 additions & 0 deletions ansible/roles/topology/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Nodes to be included in topology tree, must include all Slurm compute nodes
topology_topology_nodes: []

# If set, will override topology.conf file auto-detected from OpenStack project
# topology_topology_override:
98 changes: 98 additions & 0 deletions ansible/roles/topology/library/map_hosts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#!/usr/bin/python

# Copyright: (c) 2025, StackHPC
# Apache 2 License

from ansible.module_utils.basic import AnsibleModule
import openstack

DOCUMENTATION = """
---
module: map_hosts
short_description: Creates map of OpenStack VM network topology
description:
- Creates map representing the network topology tree of an OpenStack project with a heirarchy
of: Availability Zone -> Hypervisors/Baremetal nodes -> VMs/Baremetal instances
options:
compute_vms:
description:
- List of VM names within the target OpenStack project to include in the tree
required: true
type: str
author:
- Steve Brasier, William Tripp, StackHPC
"""

RETURN = """
topology:
description:
Map representing tree of project topology. Top level keys are AZ names, their values
are maps of shortened unique identifiers of hosts UUIDs to lists of VM names
returned: success
type: dict[str, dict[str,list[str]]]
sample:
"nova-az":
"afe9":
- "mycluster-compute-0"
- "mycluster-compute-1"
"00f9":
- "mycluster-compute-vm-on-other-hypervisor"
"""

EXAMPLES = """
- name: Get topology map
map_hosts:
compute_vms:
- mycluster-compute-0
- mycluster-compute-1
"""

def min_prefix(uuids, start=4):
""" Take a list of uuids and return the smallest length >= start which keeps them unique """
for length in range(start, len(uuids[0])):
prefixes = set(uuid[:length] for uuid in uuids)
if len(prefixes) == len(uuids):
return length

def run_module():
module_args = dict(
compute_vms=dict(type='list', elements='str', required=True)
)
module = AnsibleModule(argument_spec=module_args, supports_check_mode=True)

conn = openstack.connection.from_config()

servers = [s for s in conn.compute.servers() if s["name"] in module.params["compute_vms"]]

topo = {}
all_host_ids = []
for s in servers:
az = s['availability_zone']
host_id = s['host_id']
if host_id != '':
all_host_ids.append(host_id)
if az not in topo:
topo[az] = {}
if host_id not in topo[az]:
topo[az][host_id] = []
topo[az][host_id].append(s['name'])

uuid_len = min_prefix(list(set(all_host_ids)))

for az in topo:
topo[az] = dict((k[:uuid_len], v) for (k, v) in topo[az].items())

result = {
"changed": True,
"topology": topo,
}

module.exit_json(**result)


def main():
run_module()


if __name__ == "__main__":
main()
16 changes: 16 additions & 0 deletions ansible/roles/topology/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
- name: Map instances to hosts
become: false
map_hosts:
compute_vms: "{{ topology_topology_nodes }}"
register: _topology
delegate_to: localhost
run_once: true

- name: Template topology.conf
become: true
ansible.builtin.template:
src: templates/topology.conf.j2
dest: /etc/slurm/topology.conf
owner: root
group: root
mode: 0644
13 changes: 13 additions & 0 deletions ansible/roles/topology/templates/topology.conf.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# topology.conf
# Switch Configuration
{% if topology_topology_override is defined %}
{{ topology_topology_override }}
{% else %}
{% for az in _topology.topology.keys() %}
{% for instance_host in _topology.topology[az].keys() %}
SwitchName={{ instance_host }} Nodes={{ _topology.topology[az][instance_host] | join(",") }}
{% endfor %}
SwitchName={{ az }} Switches={{ _topology.topology[az].keys() | join(",") }}
{% endfor %}
SwitchName=master Switches={{ _topology.topology.keys() | join(",") }}
{% endif %}
5 changes: 5 additions & 0 deletions ansible/slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@
tags:
- openhpc
tasks:
- include_role:
name: topology
# Gated on topology group having compute nodes but role also
# needs to run on control and login nodes
when: appliances_mode == 'configure' and (groups['topology'] | length) > 0
- include_role:
name: stackhpc.openhpc
tasks_from: "{{ 'runtime.yml' if appliances_mode == 'configure' else 'main.yml' }}"
Expand Down
1 change: 1 addition & 0 deletions environments/common/inventory/group_vars/all/openhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ openhpc_config_default:
- enable_configless
TaskPlugin: task/cgroup,task/affinity
ReturnToService: 2 # workaround for templating bug TODO: Remove once on stackhpc.openhpc v1.2.0
TopologyPlugin: topology/tree

# default additional slurm.conf parameters when "rebuild" enabled:
openhpc_config_rebuild:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
topology_topology_nodes: "{{ groups['topology'] }}"
6 changes: 6 additions & 0 deletions environments/common/inventory/groups
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ openhpc
[builder]
# Do not add hosts here manually - used as part of Packer image build pipeline. See packer/README.md.

[topology]
# Compute nodes to be included in the Slurm topology plugin's topology tree
# Should be set to `compute` if enabled
# Note that this feature currently assumes all compute nodes are VMs, enabling
# when the cluster contains baremetal compute nodes may lead to unexpected scheduling behaviour

[podman:children]
# Hosts running containers for below services:
opensearch
Expand Down
3 changes: 3 additions & 0 deletions environments/common/layouts/everything
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,6 @@ builder
[nhc:children]
# Hosts to configure for node health checks
compute

[topology:children]
compute
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ locals {
}
)
}

baremetal_az = var.availability_zone != null ? var.availability_zone : "nova"
}

resource "openstack_blockstorage_volume_v3" "compute" {
Expand Down Expand Up @@ -115,7 +117,7 @@ resource "openstack_compute_instance_v2" "compute_fixed_image" {
fqdn: ${local.fqdns[each.key]}
EOF

availability_zone = var.match_ironic_node ? "${var.availability_zone}::${var.baremetal_nodes[each.key]}" : null
availability_zone = var.match_ironic_node ? "${local.baremetal_az}::${var.baremetal_nodes[each.key]}" : var.availability_zone

lifecycle {
ignore_changes = [
Expand Down Expand Up @@ -170,7 +172,7 @@ resource "openstack_compute_instance_v2" "compute" {
fqdn: ${local.fqdns[each.key]}
EOF

availability_zone = var.match_ironic_node ? "${var.availability_zone}::${var.baremetal_nodes[each.key]}" : null
availability_zone = var.match_ironic_node ? "${local.baremetal_az}::${var.baremetal_nodes[each.key]}" : var.availability_zone

}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,8 @@ variable "match_ironic_node" {

variable "availability_zone" {
type = string
description = "Name of availability zone - ignored unless match_ironic_node is true"
default = "nova"
nullable = false
description = "Name of availability zone. If undefined, defaults to 'nova' if match_ironic_node is true, defered to OpenStack otherwise"
default = null
}

variable "baremetal_nodes" {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,8 @@ variable "login" {
For any networks not specified here the cloud will
select addresses.
match_ironic_node: Set true to launch instances on the Ironic node of the same name as each cluster node
availability_zone: Name of availability zone - ignored unless match_ironic_node is true (default: "nova")
availability_zone: Name of availability zone"Name of availability zone. If undefined, defaults to 'nova'
if match_ironic_node is true, defered to OpenStack otherwise"
gateway_ip: Address to add default route via
nodename_template: Overrides variable cluster_nodename_template
EOF
Expand Down Expand Up @@ -122,7 +123,8 @@ variable "compute" {
For any networks not specified here the cloud will
select addresses.
match_ironic_node: Set true to launch instances on the Ironic node of the same name as each cluster node
availability_zone: Name of availability zone - ignored unless match_ironic_node is true (default: "nova")
availability_zone: Name of availability zone. "Name of availability zone. If undefined, defaults to 'nova'
if match_ironic_node is true, defered to OpenStack otherwise"
gateway_ip: Address to add default route via
nodename_template: Overrides variable cluster_nodename_template

Expand Down