From 765c0ed01fceefc9b98277c8ef89df78bc7ad0d0 Mon Sep 17 00:00:00 2001 From: Steve Brasier <33413598+sjpb@users.noreply.github.com> Date: Tue, 13 Aug 2024 09:20:00 +0100 Subject: [PATCH 01/11] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index f1d6f461a..c27db366d 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,8 @@ These instructions assume the deployment host is running Rocky Linux 8: cd ansible-slurm-appliance ./dev/setup-env.sh +You will also need to install [OpenTofu](https://opentofu.org/docs/intro/install/rpm/). + ## Overview of directory structure - `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates. From d585e91a5532cadb58a4309f3016d4796f11d5d2 Mon Sep 17 00:00:00 2001 From: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com> Date: Tue, 13 Aug 2024 16:24:07 +0100 Subject: [PATCH 02/11] OSes supported as deploy hosts --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index c27db366d..c7e1ebe9d 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,12 @@ It is recommended to check the following before starting: ## Installation on deployment host +Current Operating Systems supported to be deploy hosts: + +- Rocky Linux 9 +- Rocky Linux 8 +- Ubuntu 22.04 + These instructions assume the deployment host is running Rocky Linux 8: sudo yum install -y git python38 From 39f99a60091bba59c5844d273129470c50fc54b4 Mon Sep 17 00:00:00 2001 From: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com> Date: Tue, 13 Aug 2024 16:44:57 +0100 Subject: [PATCH 03/11] undo readme OSes supported --- README.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/README.md b/README.md index c7e1ebe9d..c27db366d 100644 --- a/README.md +++ b/README.md @@ -27,12 +27,6 @@ It is recommended to check the following before starting: ## Installation on deployment host -Current Operating Systems supported to be deploy hosts: - -- Rocky Linux 9 -- Rocky Linux 8 -- Ubuntu 22.04 - These instructions assume the deployment host is running Rocky Linux 8: sudo yum install -y git python38 From 0d2367ede18dbf9242146c76eca5e7c1e44f6761 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Fri, 4 Oct 2024 14:59:04 +0000 Subject: [PATCH 04/11] add operations docs --- docs/operations.md | 140 +++++++++++++++++++++++++++++++++++++++++++++ docs/site.md | 6 ++ 2 files changed, 146 insertions(+) create mode 100644 docs/operations.md create mode 100644 docs/site.md diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 000000000..a2feedd1f --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,140 @@ +# Operations + +This page describes the commands required for common operations. + +All subsequent sections assume that: +- Commands are run from the repository root, unless otherwise indicated by a `cd` command. +- An Ansible vault secret is configured. +- The correct private key is available to Ansible. +- Appropriate OpenStack credentials are available. +- Any non-appliance controlled infrastructure is avaialble (e.g. networks, volumes, etc.). +- `$ENV` is your current, activated environment, as defined by e.g. `environments/production/`. +- `$SITE_ENV` is the base site-specific environment, as defined by e.g. `environments/mysite/`. +- A string `some/path/to/file.yml:myvar` defines a path relative to the repository root and an Ansible variable in that file. +- Configuration is generally common to all environments at a site, i.e. is made in `environments/$SITE_ENV` not `environments/$ENV`. + +Review any [site-specific documentation](docs/site.md) for more details on the above. + +# Deploying a Cluster + +This follows the same process as defined in the main [README.md](../README.md) for the default configuration. + +Note that tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the tasks in `site.yml`. + +# SSH to Cluster Nodes +This depends on how the cluster is accessed. + +The script `dev/ansible-ssh` may generally be used to connect to a host specified by a `inventory_hostname` using the same connection details as Ansible. If this does not work: +- Instance IPs are normally defined in `ansible_host` variables in an inventory file `environments/$ENV/inventory/hosts{,.yml}`. +- The ssh user is defined by `ansible_user`, default is `rocky`. This may be overriden in your environment. +- If a jump host is required the user and address may be defined in the above inventory file. + +# Modifying general Slurm.conf parameters +Parameters for [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) can be added to an `openhpc_config_extra` mapping in `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`. +Note that values in this mapping may be: +- A string, which will be inserted as-is. +- A list, which will be converted to a comma-separated string. + +This allows specifying `slurm.conf` contents in an yaml-format Ansible-native way. + +**NB:** The appliance provides some default values in `environments/common/inventory/group_vars/all/openhpc.yml:openhpc_config_default` which is combined with the above. The `enable_configless` flag in the `SlurmCtldParameters` key this sets must not be overridden - a validation step checks this has not happened. + +See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes. + +# Modifying Slurm Partition-specific Configuration + +Modify the `openhpc_slurm_partitions` mapping usually in `enviroments/$SITE_ENV/inventory/group_vars/all/openhpc.yml` as described for [stackhpc.openhpc:slurmconf](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) (note the relevant version of this role is defined in the `requirements.yml`) + +Note an Ansible inventory group for the partition is required. This is generally auto-defined by a template in the OpenTofu configuration. + +**NB:** `default:NO` must be set on all non-default partitions, otherwise the last defined partition will always be set as the default. + +See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes. + +# Adding an Additional Partition +This is a usually a two-step process: + +- If new nodes are required, define a new node group by adding an entry to the `compute` mapping in `environments/$ENV/tofu/main.tf` assuming the default OpenTofu configuration: + - The key is the partition name. + - The value should be a mapping, with the parameters defined in `environments/$SITE_ENV/terraform/compute/variables.tf`, but in brief will need at least `flavor` (name) and `nodes` (a list of node name suffixes). +- Add a new partition to the partition configuration as described under [Modifying Slurm Partition-specific Configuration](#Modifying-Slurm-Partition-specific-Configuration). + +Deploying the additional nodes and applying these changes requires rerunning both Terraform and the Ansible site.yml playbook - follow [Deploying a Cluster](#Deploying-a-Cluster). + +# Adding Additional Packages +Packages from any enabled DNF repositories (which always includes EPEL, PowerTools and OpenHPC) can be added to all nodes by defining a list `openhpc_packages_extra` (defaulted to the empty list in the common environment) in e.g. `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`. For example: + + # environments/foo-base/inventory/group_vars/all/openhpc.yml: + openhpc_packages_extra: + - somepackage + - anotherpackage + + +The packages available from the OpenHPC repos are described in Appendix E of the OpenHPC installation guide (linked from the [OpenHPC releases page](https://github.com/openhpc/ohpc/releases/)). Note "user-facing" OpenHPC packages such as compilers, mpi libraries etc. include corresponding `lmod` modules. + +To add these packages to the current cluster, run the same command as for [Reconfiguring Slurm](#Reconfiguring-Slurm). TODO: describe what's required to add these to site-specific images. + +If additional repositories are required, these could be added/enabled as necessary in a play added to `environments/$SITE_ENV/hooks/{pre,post}.yml` as appropriate. Note such a plat should NOT exclude the builder group, so that the repositories are also added to built images. There are various Ansible modules which might be useful for this: + - `ansible.builtin.yum_repository`: Add a repo from an URL providing a 'repodata' directory. + - `ansible.builtin.rpm_key` : Add a GPG key to the RPM database. + - `ansible.builtin.get_url`: Can be used to install a repofile directly from an URL (e.g. https://turbovnc.org/pmwiki/uploads/Downloads/TurboVNC.repo) + - `ansible.builtin.dnf`: Can be used to install 'release packages' providing repos, e.g. `epel-release`, `ohpc-release`. + +The packages to be installed from that repo could also be defined in that play. Note using the `dnf` module with a list for its `name` parameter is more efficient and allows better dependency resolution than calling the module in a loop. + + +Adding these repos/packages to the cluster/image would then require running: + + ansible-playbook environments/$SITE_ENV/hooks/{pre,post}.yml + +as appropriate. + +TODO: improve description about adding these to extra images. + + +# Reconfiguring Slurm + +At a minimum run: + + ansible-playbook ansible/slurm.yml --tags openhpc + + +**NB:** This will restart all daemons if the `slurm.conf` has any changes, even if technically only a `scontrol reconfigure` is required. + + +# Running the MPI Test Suite + +See [ansible/roles/hpctests/README.md](ansible/roles/hpctests/README.md) for a description of these. They can be run using + + ansible-playbook ansible/adhoc/hpctests.yml + +Note that: +- The above role provides variables to select specific partitions, nodes and interfaces which may be required. If not set in inventory, these can be passed as extravars: + + ansible-playbook ansible/adhoc/hpctests.yml -e hpctests_myvar=foo +- The HPL-based test is only resonably optimised on Intel processors due the libaries and default parallelisation scheme used. For AMD processors it is recommended this +is skipped using: + + ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo. + +Review any [site-specific documentation](docs/site.md) for more details. + +# Running CUDA Tests +This uses the [cuda-samples](https://github.com/NVIDIA/cuda-samples/) utilities "deviceQuery" and "bandwidthTest" to test GPU functionality. It automatically runs on any +host in the `cuda` inventory group: + + ansible-playbook ansible/adhoc/cudatests.yml + +**NB:** This test is not launched through Slurm, so confirm nodes are free/out of service or use `--limit` appropriately. + +# Ad-hoc Commands + +5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using: + + ansible-playbook ansible/adhoc/$PLAYBOOK + + Currently they include the following (see each playbook for links to documentation): + - `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance. + - `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster). + - `restart-slurm.yml`: Restart all Slurm daemons in the correct order. + - `update-packages.yml`: Update specified packages on cluster nodes (NB: not recommended for routine use). diff --git a/docs/site.md b/docs/site.md new file mode 100644 index 000000000..ee147875c --- /dev/null +++ b/docs/site.md @@ -0,0 +1,6 @@ +# Site-specific Documentation + +This document is a placeholder for any site-specific documentation, e.g. environment descriptions. + +#TODO: list things which should commonly be specified here. + From 9f97e0245053d10d4d2ea6d6b78aab7cc1fa5646 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Tue, 8 Oct 2024 14:36:44 +0000 Subject: [PATCH 05/11] simplify main README.md to only cover default configuration --- README.md | 174 +++++++++++++++++++----------------------------------- 1 file changed, 61 insertions(+), 113 deletions(-) diff --git a/README.md b/README.md index b81d35862..bda8aa6da 100644 --- a/README.md +++ b/README.md @@ -2,32 +2,42 @@ # StackHPC Slurm Appliance -This repository contains playbooks and configuration to define a Slurm-based HPC environment including: -- A Rocky Linux 9 and OpenHPC v3-based Slurm cluster. -- Shared fileystem(s) using NFS (with servers within or external to the cluster). -- Slurm accounting using a MySQL backend. -- A monitoring backend using Prometheus and ElasticSearch. -- Grafana with dashboards for both individual nodes and Slurm jobs. -- Production-ready Slurm defaults for access and memory. -- A Packer-based build pipeline for compute and login node images. - -The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us! - -While it is tested on OpenStack it should work on any cloud, except for node rebuild/reimaging features which are currently OpenStack-specific. - -## Prerequisites -It is recommended to check the following before starting: -- You have root access on the "ansible deploy host" which will be used to deploy the appliance. +This repository contains playbooks and configuration to define a Slurm-based HPC environment. This includes: +- [Rocky Linux](https://rockylinux.org/)-based hosts. +- [OpenTofu](https://opentofu.org/) configurations to define the cluster's infrastructure-as-code. +- Packages for Slurm and MPI software stacks from [OpenHPC](https://openhpc.community/). +- Shared fileystem(s) using NFS (with in-cluster or external servers) or [CephFS](https://docs.ceph.com/en/latest/cephfs/) via [Openstack Manila](https://wiki.openstack.org/wiki/Manila). +- Slurm accounting using a MySQL database. +- Monitoring integrated with Slurm jobs using Prometheus, ElasticSearch and Grafana. +- A web-based portal from [OpenOndemand](https://openondemand.org/). +- Production-ready default Slurm configurations for access and memory. +- [Packer](https://developer.hashicorp.com/packer)-based image build configurations for node images. + +The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters based off a single configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us! + +While it is tested on OpenStack it would work on any cloud with appropriate OpenTofu configuration files. + +## Demonstration Deployment + +The default configuration in this repository may be used to create a cluster to explore use of the appliance. It provides: +- Persistent state backed by an OpenStack volume. +- NFS-based shared file system backed by another OpenStack volume. + +Note that the OpenOndemand portal and its remote apps are not usable with this default configuration. + +It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud. + +Before starting ensure that: +- You have root access on the deploy host. - You can create instances using a Rocky 9 GenericCloud image (or an image based on that). - **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI. However the appliance will install the necessary packages if a GenericCloud image is used. -- SSH keys get correctly injected into instances. -- Instances have access to internet (note proxies can be setup through the appliance if necessary). -- DNS works (if not this can be partially worked around but additional configuration will be required). +- You have a SSH keypair defined in OpenStack, with the private key available on the deploy host. +- Created instances have access to internet (note proxies can be setup through the appliance if necessary). - Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance). -## Installation on deployment host +### Setup deploy host -Current Operating Systems supported to be deploy hosts: +The following operating systems are supported for the deploy host: - Rocky Linux 9 - Rocky Linux 8 @@ -42,28 +52,9 @@ These instructions assume the deployment host is running Rocky Linux 8: You will also need to install [OpenTofu](https://opentofu.org/docs/intro/install/rpm/). -## Overview of directory structure - -- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates. -- `ansible/`: Contains the ansible playbooks to configure the infrastruture. -- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information. -- `dev/`: Contains development tools. +### Create a new environment -## Environments - -### Overview - -An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing: -- Any deployment automation required - e.g. Terraform configuration or HEAT templates. -- An ansible `inventory/` directory. -- An `activate` script which sets environment variables to point to this configuration. -- Optionally, additional playbooks in `/hooks` to run before or after the main tasks. - -All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required. - -### Creating a new environment - -This repo contains a `cookiecutter` template which can be used to create a new environment from scratch. Run the [installation on deployment host](#Installation-on-deployment-host) instructions above, then in the repo root run: +Use the `cookiecutter` template to create a new environment to hold your configuration. In the repository root run: . venv/bin/activate cd environments @@ -71,86 +62,43 @@ This repo contains a `cookiecutter` template which can be used to create a new e and follow the prompts to complete the environment name and description. -Alternatively, you could copy an existing environment directory. - -Now add deployment automation if required, and then complete the environment-specific inventory as described below. - -### Environment-specific inventory structure - -The ansible inventory for the environment is in `environments//inventory/`. It should generally contain: -- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc. -- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality. -- Optionally, group variable files in `group_vars//overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/.yml` (the use of `all` here is due to ansible's precedence rules). - -Although most of the inventory uses the group convention described above there are a few special cases: -- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file. -- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient. -- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments//inventory/group_vars/all/overrides.yml` file. -- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments//inventory/group_vars/all/secrets.yml`. -- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path. -- Each Slurm partition must have: - - An inventory group `_` defining the hosts it contains - these must be homogenous w.r.t CPU and memory. - - An entry in the `openhpc_slurm_partitions` mapping in `environments//inventory/group_vars/openhpc/overrides.yml`. - See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options. -- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file). - -## Creating a Slurm appliance - -NB: This section describes generic instructions - check for any environment-specific instructions in `environments//README.md` before starting. - -1. Activate the environment - this **must be done** before any other commands are run: - - source environments//activate - -2. Deploy instances - see environment-specific instructions. - -3. Generate passwords: - - ansible-playbook ansible/adhoc/generate-passwords.yml - - This will output a set of passwords in `environments//inventory/group_vars/all/secrets.yml`. It is recommended that these are encrpyted and then commited to git using: - - ansible-vault encrypt inventory/group_vars/all/secrets.yml - - See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details. - -4. Deploy the appliance: - - ansible-playbook ansible/site.yml - - or if you have encrypted secrets use: +**NB:** In subsequent sections this new environment is refered to as `$ENV`. - ansible-playbook ansible/site.yml --ask-vault-password +Now generate secrets for this environment: - Tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the `site` tasks. + ansible-playbook ansible/adhoc/generate-passwords.yml -5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using: +### Define infrastructure configuration - ansible-playbook ansible/adhoc/ +Create an OpenTofu variables file to define the required infrastructure, e.g.: - Currently they include the following (see each playbook for links to documentation): - - `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance. - - `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster - see `ansible/roles/rebuild` for that). - - `restart-slurm.yml`: Restart all Slurm daemons in the correct order. - - `update-packages.yml`: Update specified packages on cluster nodes. + # environments/$ENV/terraform/terrraform.tfvars: -## Adding new functionality -Please contact us for specific advice, but in outline this generally involves: -- Adding a role. -- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`. -- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`. -- Adding new default group vars into `environments/common/inventory/group_vars/all//`. -- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`. -- Updating READMEs. + cluster_name = "mycluster" + cluster_net = "some_network" # * + cluster_subnet = "some_subnet" # * + key_pair = "my_key" # * + control_node_flavor = "some_flavor_name" + login_nodes = { + login-0: "login_flavor_name" + } + cluster_image_id = "rocky_linux_9_image_uuid" + compute = { + general = { + nodes: ["compute-0", "compute-1"] + flavor: "compute_flavor_name" + } + } -## Monitoring and logging +Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables +and descriptions see `environments/$ENV/terraform/terrraform.tfvars`. -Please see the [monitoring-and-logging.README.md](docs/monitoring-and-logging.README.md) for details. +### Deploy appliance -## CI/CD automation + ansible-playbook ansible/site.yml -The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include: +You can now log in to the cluster using: -- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published. + ssh rocky@$login_ip -- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud. \ No newline at end of file +where the IP of the login node is given in `environments/$ENV/inventory/hosts.yml` From 47092999220db09b6cc501b9889eff5e26811a43 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Tue, 8 Oct 2024 14:51:53 +0000 Subject: [PATCH 06/11] move more-specific documentation into their own files --- README.md | 10 ++++++++++ docs/adding-functionality.md | 9 +++++++++ docs/ci.md | 8 ++++++++ docs/environments.md | 30 ++++++++++++++++++++++++++++++ docs/production.md | 9 +++++++++ 5 files changed, 66 insertions(+) create mode 100644 docs/adding-functionality.md create mode 100644 docs/ci.md create mode 100644 docs/environments.md create mode 100644 docs/production.md diff --git a/README.md b/README.md index bda8aa6da..66373843b 100644 --- a/README.md +++ b/README.md @@ -102,3 +102,13 @@ You can now log in to the cluster using: ssh rocky@$login_ip where the IP of the login node is given in `environments/$ENV/inventory/hosts.yml` + + +## Overview of directory structure + +- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates. +- `ansible/`: Contains the ansible playbooks to configure the infrastruture. +- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information. +- `dev/`: Contains development tools. + +For further information see the [docs/](docs/) directory. diff --git a/docs/adding-functionality.md b/docs/adding-functionality.md new file mode 100644 index 000000000..4f2546135 --- /dev/null +++ b/docs/adding-functionality.md @@ -0,0 +1,9 @@ +# Adding new functionality + +Please contact us for specific advice, but in outline this generally involves: +- Adding a role. +- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`. +- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`. +- Adding new default group vars into `environments/common/inventory/group_vars/all//`. +- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`. +- Updating READMEs. diff --git a/docs/ci.md b/docs/ci.md new file mode 100644 index 000000000..c6fa8900d --- /dev/null +++ b/docs/ci.md @@ -0,0 +1,8 @@ +# CI/CD automation + +The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include: + +- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published. + +- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud. + diff --git a/docs/environments.md b/docs/environments.md new file mode 100644 index 000000000..686ad7d1a --- /dev/null +++ b/docs/environments.md @@ -0,0 +1,30 @@ +# Environments + +## Overview + +An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing: +- Any deployment automation required - e.g. OpenTofu configuration or HEAT templates. +- An Ansible `inventory/` directory. +- An `activate` script which sets environment variables to point to this configuration. +- Optionally, additional playbooks in `hooks/` to run in addition to the default playbooks. + +All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required. + +### Environment-specific inventory structure + +The ansible inventory for the environment is in `environments//inventory/`. It should generally contain: +- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc. +- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality. +- Optionally, group variable files in `group_vars//overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/.yml` (the use of `all` here is due to ansible's precedence rules). + +Although most of the inventory uses the group convention described above there are a few special cases: +- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file. +- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient. +- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments//inventory/group_vars/all/overrides.yml` file. +- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments//inventory/group_vars/all/secrets.yml`. +- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path. +- Each Slurm partition must have: + - An inventory group `_` defining the hosts it contains - these must be homogenous w.r.t CPU and memory. + - An entry in the `openhpc_slurm_partitions` mapping in `environments//inventory/group_vars/openhpc/overrides.yml`. + See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options. +- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file). diff --git a/docs/production.md b/docs/production.md new file mode 100644 index 000000000..dda65310c --- /dev/null +++ b/docs/production.md @@ -0,0 +1,9 @@ +# Production Deployments + +This page contains some brief notes about differences between the default/demo configuration, as described in the main [README.md](../README.md) and production-ready deployments. + +- Create a site environment. Usually at least production, staging and possibly development environments are required. To avoid divergence of configuration these should all have an `inventory` path referencing a shared, site-specific base environment. Where possible hooks should also be placed in this site-specific environment. +- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the site-specific environment. It can be be encrypted using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository. +- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not or for bare metal instances it may be necessary to configure or proxy `chronyd` via an environment hook. +- Remove production volumes from OpenTofu control. In the default OpenTofu configuration, deleting the resources also deletes the volumes used for persistent state and home directories. This is usually undesirable for production, so these resources should be removed from the OpenTofu configurations and manually deployed once. However note that for development environments leaving them under OpenTofu control is usually best. +- Configure Open OpenOndemand - see [specific documentation](openondemand.README.md). From eda090c71403c94583e8a2db45e472546e6170dc Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Wed, 9 Oct 2024 07:56:18 +0000 Subject: [PATCH 07/11] provide site docs directory --- docs/operations.md | 4 ++-- docs/{site.md => site/README.md} | 0 2 files changed, 2 insertions(+), 2 deletions(-) rename docs/{site.md => site/README.md} (100%) diff --git a/docs/operations.md b/docs/operations.md index a2feedd1f..2eaf26dfc 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -13,7 +13,7 @@ All subsequent sections assume that: - A string `some/path/to/file.yml:myvar` defines a path relative to the repository root and an Ansible variable in that file. - Configuration is generally common to all environments at a site, i.e. is made in `environments/$SITE_ENV` not `environments/$ENV`. -Review any [site-specific documentation](docs/site.md) for more details on the above. +Review any [site-specific documentation](site/README.md) for more details on the above. # Deploying a Cluster @@ -117,7 +117,7 @@ is skipped using: ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo. -Review any [site-specific documentation](docs/site.md) for more details. +Review any [site-specific documentation](site/README.md) for more details. # Running CUDA Tests This uses the [cuda-samples](https://github.com/NVIDIA/cuda-samples/) utilities "deviceQuery" and "bandwidthTest" to test GPU functionality. It automatically runs on any diff --git a/docs/site.md b/docs/site/README.md similarity index 100% rename from docs/site.md rename to docs/site/README.md From b3dddceaef7b849822deb6cc4ef02f8df170a878 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Wed, 9 Oct 2024 12:23:27 +0000 Subject: [PATCH 08/11] address docs review comments --- README.md | 12 +++++++----- docs/adding-functionality.md | 2 +- docs/environments.md | 2 +- docs/operations.md | 1 + 4 files changed, 10 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 66373843b..52db635b8 100644 --- a/README.md +++ b/README.md @@ -10,12 +10,14 @@ This repository contains playbooks and configuration to define a Slurm-based HPC - Slurm accounting using a MySQL database. - Monitoring integrated with Slurm jobs using Prometheus, ElasticSearch and Grafana. - A web-based portal from [OpenOndemand](https://openondemand.org/). -- Production-ready default Slurm configurations for access and memory. +- Production-ready default Slurm configurations for access and memory limits. - [Packer](https://developer.hashicorp.com/packer)-based image build configurations for node images. -The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters based off a single configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us! +The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters +sharing a common configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs +back upstream to us! -While it is tested on OpenStack it would work on any cloud with appropriate OpenTofu configuration files. +While it is tested on OpenStack it should work on any cloud with appropriate OpenTofu configuration files. ## Demonstration Deployment @@ -72,7 +74,7 @@ Now generate secrets for this environment: Create an OpenTofu variables file to define the required infrastructure, e.g.: - # environments/$ENV/terraform/terrraform.tfvars: + # environments/$ENV/terraform/terraform.tfvars: cluster_name = "mycluster" cluster_net = "some_network" # * @@ -106,7 +108,7 @@ where the IP of the login node is given in `environments/$ENV/inventory/hosts.ym ## Overview of directory structure -- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates. +- `environments/`: See [docs/environments.md](docs/environments.md). - `ansible/`: Contains the ansible playbooks to configure the infrastruture. - `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information. - `dev/`: Contains development tools. diff --git a/docs/adding-functionality.md b/docs/adding-functionality.md index 4f2546135..69d3b3a3f 100644 --- a/docs/adding-functionality.md +++ b/docs/adding-functionality.md @@ -1,6 +1,6 @@ # Adding new functionality -Please contact us for specific advice, but in outline this generally involves: +Please contact us for specific advice, but this generally involves: - Adding a role. - Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`. - Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`. diff --git a/docs/environments.md b/docs/environments.md index 686ad7d1a..92bf9464a 100644 --- a/docs/environments.md +++ b/docs/environments.md @@ -19,7 +19,7 @@ The ansible inventory for the environment is in `environments//inve Although most of the inventory uses the group convention described above there are a few special cases: - The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file. -- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient. +- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient. - `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments//inventory/group_vars/all/overrides.yml` file. - The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments//inventory/group_vars/all/secrets.yml`. - The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path. diff --git a/docs/operations.md b/docs/operations.md index 2eaf26dfc..6b75f92ec 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -22,6 +22,7 @@ This follows the same process as defined in the main [README.md](../README.md) f Note that tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the tasks in `site.yml`. # SSH to Cluster Nodes + This depends on how the cluster is accessed. The script `dev/ansible-ssh` may generally be used to connect to a host specified by a `inventory_hostname` using the same connection details as Ansible. If this does not work: From 0b1eabcdb1b3f2eef11ff9128e37d6969dfb96df Mon Sep 17 00:00:00 2001 From: Steve Brasier <33413598+sjpb@users.noreply.github.com> Date: Thu, 10 Oct 2024 14:54:34 +0100 Subject: [PATCH 09/11] Fix a / in docs Co-authored-by: Scott Davidson <49713135+sd109@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 52db635b8..d292b138d 100644 --- a/README.md +++ b/README.md @@ -113,4 +113,4 @@ where the IP of the login node is given in `environments/$ENV/inventory/hosts.ym - `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information. - `dev/`: Contains development tools. -For further information see the [docs/](docs/) directory. +For further information see the [docs](docs/) directory. From fd0b51a4abb953dcf79798c32275b9e407c63e72 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Tue, 15 Oct 2024 09:36:17 +0000 Subject: [PATCH 10/11] address PR comments on docs --- README.md | 5 ++--- docs/environments.md | 4 ++-- ...ng.README.md => monitoring-and-logging.md} | 0 docs/operations.md | 21 ++++++++++++------- docs/production.md | 2 +- 5 files changed, 19 insertions(+), 13 deletions(-) rename docs/{monitoring-and-logging.README.md => monitoring-and-logging.md} (100%) diff --git a/README.md b/README.md index d292b138d..b54cd110a 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,6 @@ The following operating systems are supported for the deploy host: - Rocky Linux 9 - Rocky Linux 8 -- Ubuntu 22.04 These instructions assume the deployment host is running Rocky Linux 8: @@ -93,7 +92,7 @@ Create an OpenTofu variables file to define the required infrastructure, e.g.: } Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables -and descriptions see `environments/$ENV/terraform/terrraform.tfvars`. +and descriptions see `environments/$ENV/terraform/terraform.tfvars`. ### Deploy appliance @@ -110,7 +109,7 @@ where the IP of the login node is given in `environments/$ENV/inventory/hosts.ym - `environments/`: See [docs/environments.md](docs/environments.md). - `ansible/`: Contains the ansible playbooks to configure the infrastruture. -- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information. +- `packer/`: Contains automation to use Packer to build machine images for an enviromment - see the README in this directory for further information. - `dev/`: Contains development tools. For further information see the [docs](docs/) directory. diff --git a/docs/environments.md b/docs/environments.md index 92bf9464a..d1c492312 100644 --- a/docs/environments.md +++ b/docs/environments.md @@ -6,7 +6,7 @@ An environment defines the configuration for a single instantiation of this Slur - Any deployment automation required - e.g. OpenTofu configuration or HEAT templates. - An Ansible `inventory/` directory. - An `activate` script which sets environment variables to point to this configuration. -- Optionally, additional playbooks in `hooks/` to run in addition to the default playbooks. +- Optionally, additional playbooks in `hooks/` to run before or after to the default playbooks. All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required. @@ -14,7 +14,7 @@ All environments load the inventory from the `common` environment first, with th The ansible inventory for the environment is in `environments//inventory/`. It should generally contain: - A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc. -- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality. +- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality. - Optionally, group variable files in `group_vars//overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/.yml` (the use of `all` here is due to ansible's precedence rules). Although most of the inventory uses the group convention described above there are a few special cases: diff --git a/docs/monitoring-and-logging.README.md b/docs/monitoring-and-logging.md similarity index 100% rename from docs/monitoring-and-logging.README.md rename to docs/monitoring-and-logging.md diff --git a/docs/operations.md b/docs/operations.md index 6b75f92ec..c1672e7d2 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -128,14 +128,21 @@ host in the `cuda` inventory group: **NB:** This test is not launched through Slurm, so confirm nodes are free/out of service or use `--limit` appropriately. -# Ad-hoc Commands +# Ad-hoc Commands and Playbooks -5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using: +"Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using: ansible-playbook ansible/adhoc/$PLAYBOOK - Currently they include the following (see each playbook for links to documentation): - - `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance. - - `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster). - - `restart-slurm.yml`: Restart all Slurm daemons in the correct order. - - `update-packages.yml`: Update specified packages on cluster nodes (NB: not recommended for routine use). +Currently they include the following (see each playbook for links to documentation): + +- `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance. +- `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster). +- `restart-slurm.yml`: Restart all Slurm daemons in the correct order. +- `update-packages.yml`: Update specified packages on cluster nodes (NB: not recommended for routine use). + +The `ansible` binary [can be used](https://docs.ansible.com/ansible/latest/command_guide/intro_adhoc.html) to run arbitrary shell commands against inventory groups or hosts, for example: + + ansible [--become] -m shell -a "" + +This can be useful for debugging and development but any modifications made this way will be lost if nodes are rebuilt/reimaged. diff --git a/docs/production.md b/docs/production.md index dda65310c..7219ee7fc 100644 --- a/docs/production.md +++ b/docs/production.md @@ -4,6 +4,6 @@ This page contains some brief notes about differences between the default/demo c - Create a site environment. Usually at least production, staging and possibly development environments are required. To avoid divergence of configuration these should all have an `inventory` path referencing a shared, site-specific base environment. Where possible hooks should also be placed in this site-specific environment. - Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the site-specific environment. It can be be encrypted using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository. -- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not or for bare metal instances it may be necessary to configure or proxy `chronyd` via an environment hook. +- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not (or for bare metal instances) it may be necessary to configure or proxy `chronyd` via an environment hook. - Remove production volumes from OpenTofu control. In the default OpenTofu configuration, deleting the resources also deletes the volumes used for persistent state and home directories. This is usually undesirable for production, so these resources should be removed from the OpenTofu configurations and manually deployed once. However note that for development environments leaving them under OpenTofu control is usually best. - Configure Open OpenOndemand - see [specific documentation](openondemand.README.md). From 91db27a2213dd73c6ef69370c74c60847db90368 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Tue, 15 Oct 2024 09:38:07 +0000 Subject: [PATCH 11/11] address PR comments on docs --- docs/operations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operations.md b/docs/operations.md index c1672e7d2..a20d7f10c 100644 --- a/docs/operations.md +++ b/docs/operations.md @@ -130,7 +130,7 @@ host in the `cuda` inventory group: # Ad-hoc Commands and Playbooks -"Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using: +A set of utility playbooks for managing a running appliance are provided in `ansible/adhoc` - run these by activating the environment and using: ansible-playbook ansible/adhoc/$PLAYBOOK