move more-specific documentation into their own files

sjpb · sjpb · commit 47092999220d · 2024-10-08T14:51:53.000Z
diff --git a/README.md b/README.md
@@ -102,3 +102,13 @@ You can now log in to the cluster using:
     ssh rocky@$login_ip
 
 where the IP of the login node is given in `environments/$ENV/inventory/hosts.yml`
+
+
+## Overview of directory structure
+
+- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates.
+- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
+- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information.
+- `dev/`: Contains development tools.
+
+For further information see the [docs/](docs/) directory.
diff --git a/docs/adding-functionality.md b/docs/adding-functionality.md
@@ -0,0 +1,9 @@
+# Adding new functionality
+
+Please contact us for specific advice, but in outline this generally involves:
+- Adding a role.
+- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
+- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
+- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
+- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
+- Updating READMEs.
diff --git a/docs/ci.md b/docs/ci.md
@@ -0,0 +1,8 @@
+# CI/CD automation
+
+The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include:
+
+- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published.
+
+- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud.
+
diff --git a/docs/environments.md b/docs/environments.md
@@ -0,0 +1,30 @@
+# Environments
+
+## Overview
+
+An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing:
+- Any deployment automation required - e.g. OpenTofu configuration or HEAT templates.
+- An Ansible `inventory/` directory.
+- An `activate` script which sets environment variables to point to this configuration.
+- Optionally, additional playbooks in `hooks/` to run in addition to the default playbooks.
+
+All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required.
+
+### Environment-specific inventory structure
+
+The ansible inventory for the environment is in `environments/<environment>/inventory/`. It should generally contain:
+- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc.
+- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality.
+- Optionally, group variable files in `group_vars/<group_name>/overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/<group_name>.yml` (the use of `all` here is due to ansible's precedence rules).
+
+Although most of the inventory uses the group convention described above there are a few special cases:
+- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file.
+- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an  `[all:vars]` section in the `hosts` file is usually convenient.
+- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments/<environment>/inventory/group_vars/all/overrides.yml` file.
+- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments/<environent>/inventory/group_vars/all/secrets.yml`.
+- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path.
+- Each Slurm partition must have:
+    - An inventory group `<cluster_name>_<partition_name>` defining the hosts it contains - these must be homogenous w.r.t CPU and memory.
+    - An entry in the `openhpc_slurm_partitions` mapping in `environments/<environment>/inventory/group_vars/openhpc/overrides.yml`.
+    See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options.
+- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file).
diff --git a/docs/production.md b/docs/production.md
@@ -0,0 +1,9 @@
+# Production Deployments
+
+This page contains some brief notes about differences between the default/demo configuration, as described in the main [README.md](../README.md) and production-ready deployments.
+
+- Create a site environment. Usually at least production, staging and possibly development environments are required. To avoid divergence of configuration these should all have an `inventory` path referencing a shared, site-specific base environment. Where possible hooks should also be placed in this site-specific environment.
+- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the site-specific environment. It can be be encrypted using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository.
+- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not or for bare metal instances it may be necessary to configure or proxy `chronyd` via an environment hook.
+- Remove production volumes from OpenTofu control. In the default OpenTofu configuration, deleting the resources also deletes the volumes used for persistent state and home directories. This is usually undesirable for production, so these resources should be removed from the OpenTofu configurations and manually deployed once. However note that for development environments leaving them under OpenTofu control is usually best.
+- Configure Open OpenOndemand - see [specific documentation](openondemand.README.md).