Skip to content

Commit 331a29f

Browse files
committed
WIP add operations docs
1 parent c9663ac commit 331a29f

File tree

2 files changed

+141
-0
lines changed

2 files changed

+141
-0
lines changed

docs/operations.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Operations
2+
3+
This page describes the commands required for common operations.
4+
5+
All subsequent sections assume that:
6+
- Commands are run from the repository root, unless otherwise indicated by a `cd` command.
7+
- An Ansible vault secret is configured.
8+
- The correct private key is available to Ansible.
9+
- Appropriate OpenStack credentials are available.
10+
- Any non-appliance controlled infrastructure is avaialble (e.g. networks, volumes, etc.).
11+
- `$ENV` is your current, activated environment, as defined by e.g. `environments/production/`.
12+
- `$SITE_ENV` is the base site-specific environment, as defined by e.g. `environments/mysite/`.
13+
- A string `some/path/to/file.yml:myvar` defines a path relative to the repository root and an Ansible variable in that file.
14+
- Configuration is generally common to all environments at a site, i.e. is made in `environments/$SITE_ENV` not `environments/$ENV`.
15+
16+
Review any [site-specific documentation](docs/site.md) for more details on the above.
17+
18+
# Deploying a Cluster
19+
20+
#TODO
21+
22+
# SSH to Cluster Nodes
23+
This depends on how the cluster is accessed.
24+
25+
The script `dev/ansible-ssh` may generally be used to connect to a host specified by a `inventory_hostname` using the same connection details as Ansible. If this does not work:
26+
- Instance IPs are normally defined in `ansible_host` variables in an inventory file `environments/$ENV/inventory/hosts{,.yml}`.
27+
- The ssh user is defined by `ansible_user`, default is `rocky`. This may be overriden in your environment.
28+
- If a jump host is required the user and address may be defined in the above inventory file.
29+
30+
# Modifying general Slurm.conf parameters
31+
Parameters for [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) can be added to an `openhpc_config_extra` mapping in `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`.
32+
Note that values in this mapping may be:
33+
- A string, which will be inserted as-is.
34+
- A list, which will be converted to a comma-separated string.
35+
36+
This allows specifying `slurm.conf` contents in an yaml-format Ansible-native way.
37+
38+
**NB:** The appliance provides some default values in `environments/common/inventory/group_vars/all/openhpc.yml:openhpc_config_default` which is combined with the above. The `enable_configless` flag in the `SlurmCtldParameters` key this sets must not be overridden - a validation step checks this has not happened.
39+
40+
See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes.
41+
42+
# Modifying Slurm Partition-specific Configuration
43+
44+
Modify the `openhpc_slurm_partitions` mapping usually in `enviroments/$SITE_ENV/inventory/group_vars/all/openhpc.yml` as described for [stackhpc.openhpc:slurmconf](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) (note the relevant version of this role is defined in the `requirements.yml`)
45+
46+
Note an Ansible inventory group for the partition is required. This is generally auto-defined by a template in the OpenTofu configuration.
47+
48+
**NB:** `default:NO` must be set on all non-default partitions, otherwise the last defined partition will always be set as the default.
49+
50+
See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes.
51+
52+
# Adding an Additional Partition
53+
This is a usually a two-step process:
54+
55+
- If new nodes are required, define a new node group by adding an entry to the `compute` mapping in `environments/$ENV/tofu/main.tf` assuming the default OpenTofu configuration:
56+
- The key is the partition name.
57+
- The value should be a mapping, with the parameters defined in `environments/$SITE_ENV/terraform/compute/variables.tf`, but in brief will need at least `flavor` (name) and `nodes` (a list of node name suffixes).
58+
- Add a new partition to the partition configuration as described under [Modifying Slurm Partition-specific Configuration](#Modifying-Slurm-Partition-specific-Configuration).
59+
60+
Deploying the additional nodes and applying these changes requires rerunning both Terraform and the Ansible site.yml playbook - follow [Deploying a Cluster](#Deploying-a-Cluster).
61+
62+
# Adding Additional Packages
63+
Packages from any enabled DNF repositories (which always includes EPEL, PowerTools and OpenHPC) can be added to all nodes by defining a list `openhpc_packages_extra` (defaulted to the empty list in the common environment) in e.g. `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`. For example:
64+
65+
# environments/foo-base/inventory/group_vars/all/openhpc.yml:
66+
openhpc_packages_extra:
67+
- somepackage
68+
- anotherpackage
69+
70+
71+
The packages available from the OpenHPC repos are described in Appendix E of the OpenHPC installation guide (linked from the [OpenHPC releases page](https://github.com/openhpc/ohpc/releases/)). Note "user-facing" OpenHPC packages such as compilers, mpi libraries etc. include corresponding `lmod` modules.
72+
73+
To add these packages to the current cluster, run the same command as for [Reconfiguring Slurm](#Reconfiguring-Slurm). TODO: describe what's required to add these to site-specific images.
74+
75+
If additional repositories are required, these could be added/enabled as necessary in a play added to `environments/$SITE_ENV/hooks/{pre,post}.yml` as appropriate. Note such a plat should NOT exclude the builder group, so that the repositories are also added to built images. There are various Ansible modules which might be useful for this:
76+
- `ansible.builtin.yum_repository`: Add a repo from an URL providing a 'repodata' directory.
77+
- `ansible.builtin.rpm_key` : Add a GPG key to the RPM database.
78+
- `ansible.builtin.get_url`: Can be used to install a repofile directly from an URL (e.g. https://turbovnc.org/pmwiki/uploads/Downloads/TurboVNC.repo)
79+
- `ansible.builtin.dnf`: Can be used to install 'release packages' providing repos, e.g. `epel-release`, `ohpc-release`.
80+
81+
The packages to be installed from that repo could also be defined in that play. Note using the `dnf` module with a list for its `name` parameter is more efficient and allows better dependency resolution than calling the module in a loop.
82+
83+
84+
Adding these repos/packages to the cluster/image would then require running:
85+
86+
ansible-playbook environments/$SITE_ENV/hooks/{pre,post}.yml
87+
88+
as appropriate.
89+
90+
TODO: improve description about adding these to extra images.
91+
92+
93+
# Reconfiguring Slurm
94+
95+
At a minimum run:
96+
97+
ansible-playbook ansible/slurm.yml --tags openhpc
98+
99+
100+
**NB:** This will restart all daemons if the `slurm.conf` has any changes, even if technically only a `scontrol reconfigure` is required.
101+
102+
103+
# Running the MPI Test Suite
104+
105+
See [ansible/roles/hpctests/README.md](ansible/roles/hpctests/README.md) for a description of these. They can be run using
106+
107+
ansible-playbook ansible/adhoc/hpctests.yml
108+
109+
Note that:
110+
- The above role provides variables to select specific partitions, nodes and interfaces which may be required. If not set in inventory, these can be passed as extravars:
111+
112+
ansible-playbook ansible/adhoc/hpctests.yml -e hpctests_myvar=foo
113+
- The HPL-based test is only resonably optimised on Intel processors due the libaries and default parallelisation scheme used. For AMD processors it is recommended this
114+
is skipped using:
115+
116+
ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo.
117+
118+
Review any [site-specific documentation](docs/site.md) for more details.
119+
120+
# Running CUDA Tests
121+
This uses the [cuda-samples](https://github.com/NVIDIA/cuda-samples/) utilities "deviceQuery" and "bandwidthTest" to test GPU functionality. It automatically runs on any
122+
host in the `cuda` inventory group:
123+
124+
ansible-playbook ansible/adhoc/cudatests.yml
125+
126+
**NB:** This test is not launched through Slurm, so confirm nodes are free/out of service or use `--limit` appropriately.
127+
128+
# Upgrading the Cluster
129+
130+
#TODO
131+
132+
# Building a Site-specific Image
133+
134+
#TODO
135+

docs/site.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Site-specific Documentation
2+
3+
This document is a placeholder for any site-specific documentation, e.g. environment descriptions.
4+
5+
#TODO: list things which should commonly be specified here.
6+

0 commit comments

Comments
 (0)