|
| 1 | +# Operations |
| 2 | + |
| 3 | +This page describes the commands required for common operations. |
| 4 | + |
| 5 | +All subsequent sections assume that: |
| 6 | +- Commands are run from the repository root, unless otherwise indicated by a `cd` command. |
| 7 | +- An Ansible vault secret is configured. |
| 8 | +- The correct private key is available to Ansible. |
| 9 | +- Appropriate OpenStack credentials are available. |
| 10 | +- Any non-appliance controlled infrastructure is avaialble (e.g. networks, volumes, etc.). |
| 11 | +- `$ENV` is your current, activated environment, as defined by e.g. `environments/production/`. |
| 12 | +- `$SITE_ENV` is the base site-specific environment, as defined by e.g. `environments/mysite/`. |
| 13 | +- A string `some/path/to/file.yml:myvar` defines a path relative to the repository root and an Ansible variable in that file. |
| 14 | +- Configuration is generally common to all environments at a site, i.e. is made in `environments/$SITE_ENV` not `environments/$ENV`. |
| 15 | + |
| 16 | +Review any [site-specific documentation](docs/site.md) for more details on the above. |
| 17 | + |
| 18 | +# Deploying a Cluster |
| 19 | + |
| 20 | +#TODO |
| 21 | + |
| 22 | +# SSH to Cluster Nodes |
| 23 | +This depends on how the cluster is accessed. |
| 24 | + |
| 25 | +The script `dev/ansible-ssh` may generally be used to connect to a host specified by a `inventory_hostname` using the same connection details as Ansible. If this does not work: |
| 26 | +- Instance IPs are normally defined in `ansible_host` variables in an inventory file `environments/$ENV/inventory/hosts{,.yml}`. |
| 27 | +- The ssh user is defined by `ansible_user`, default is `rocky`. This may be overriden in your environment. |
| 28 | +- If a jump host is required the user and address may be defined in the above inventory file. |
| 29 | + |
| 30 | +# Modifying general Slurm.conf parameters |
| 31 | +Parameters for [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) can be added to an `openhpc_config_extra` mapping in `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`. |
| 32 | +Note that values in this mapping may be: |
| 33 | +- A string, which will be inserted as-is. |
| 34 | +- A list, which will be converted to a comma-separated string. |
| 35 | + |
| 36 | +This allows specifying `slurm.conf` contents in an yaml-format Ansible-native way. |
| 37 | + |
| 38 | +**NB:** The appliance provides some default values in `environments/common/inventory/group_vars/all/openhpc.yml:openhpc_config_default` which is combined with the above. The `enable_configless` flag in the `SlurmCtldParameters` key this sets must not be overridden - a validation step checks this has not happened. |
| 39 | + |
| 40 | +See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes. |
| 41 | + |
| 42 | +# Modifying Slurm Partition-specific Configuration |
| 43 | + |
| 44 | +Modify the `openhpc_slurm_partitions` mapping usually in `enviroments/$SITE_ENV/inventory/group_vars/all/openhpc.yml` as described for [stackhpc.openhpc:slurmconf](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) (note the relevant version of this role is defined in the `requirements.yml`) |
| 45 | + |
| 46 | +Note an Ansible inventory group for the partition is required. This is generally auto-defined by a template in the OpenTofu configuration. |
| 47 | + |
| 48 | +**NB:** `default:NO` must be set on all non-default partitions, otherwise the last defined partition will always be set as the default. |
| 49 | + |
| 50 | +See [Reconfiguring Slurm](#Reconfiguring-Slurm) to apply changes. |
| 51 | + |
| 52 | +# Adding an Additional Partition |
| 53 | +This is a usually a two-step process: |
| 54 | + |
| 55 | +- If new nodes are required, define a new node group by adding an entry to the `compute` mapping in `environments/$ENV/tofu/main.tf` assuming the default OpenTofu configuration: |
| 56 | + - The key is the partition name. |
| 57 | + - The value should be a mapping, with the parameters defined in `environments/$SITE_ENV/terraform/compute/variables.tf`, but in brief will need at least `flavor` (name) and `nodes` (a list of node name suffixes). |
| 58 | +- Add a new partition to the partition configuration as described under [Modifying Slurm Partition-specific Configuration](#Modifying-Slurm-Partition-specific-Configuration). |
| 59 | + |
| 60 | +Deploying the additional nodes and applying these changes requires rerunning both Terraform and the Ansible site.yml playbook - follow [Deploying a Cluster](#Deploying-a-Cluster). |
| 61 | + |
| 62 | +# Adding Additional Packages |
| 63 | +Packages from any enabled DNF repositories (which always includes EPEL, PowerTools and OpenHPC) can be added to all nodes by defining a list `openhpc_packages_extra` (defaulted to the empty list in the common environment) in e.g. `environments/$SITE_ENV/inventory/group_vars/all/openhpc.yml`. For example: |
| 64 | + |
| 65 | + # environments/foo-base/inventory/group_vars/all/openhpc.yml: |
| 66 | + openhpc_packages_extra: |
| 67 | + - somepackage |
| 68 | + - anotherpackage |
| 69 | + |
| 70 | + |
| 71 | +The packages available from the OpenHPC repos are described in Appendix E of the OpenHPC installation guide (linked from the [OpenHPC releases page](https://github.com/openhpc/ohpc/releases/)). Note "user-facing" OpenHPC packages such as compilers, mpi libraries etc. include corresponding `lmod` modules. |
| 72 | + |
| 73 | +To add these packages to the current cluster, run the same command as for [Reconfiguring Slurm](#Reconfiguring-Slurm). TODO: describe what's required to add these to site-specific images. |
| 74 | + |
| 75 | +If additional repositories are required, these could be added/enabled as necessary in a play added to `environments/$SITE_ENV/hooks/{pre,post}.yml` as appropriate. Note such a plat should NOT exclude the builder group, so that the repositories are also added to built images. There are various Ansible modules which might be useful for this: |
| 76 | + - `ansible.builtin.yum_repository`: Add a repo from an URL providing a 'repodata' directory. |
| 77 | + - `ansible.builtin.rpm_key` : Add a GPG key to the RPM database. |
| 78 | + - `ansible.builtin.get_url`: Can be used to install a repofile directly from an URL (e.g. https://turbovnc.org/pmwiki/uploads/Downloads/TurboVNC.repo) |
| 79 | + - `ansible.builtin.dnf`: Can be used to install 'release packages' providing repos, e.g. `epel-release`, `ohpc-release`. |
| 80 | + |
| 81 | +The packages to be installed from that repo could also be defined in that play. Note using the `dnf` module with a list for its `name` parameter is more efficient and allows better dependency resolution than calling the module in a loop. |
| 82 | + |
| 83 | + |
| 84 | +Adding these repos/packages to the cluster/image would then require running: |
| 85 | + |
| 86 | + ansible-playbook environments/$SITE_ENV/hooks/{pre,post}.yml |
| 87 | + |
| 88 | +as appropriate. |
| 89 | + |
| 90 | +TODO: improve description about adding these to extra images. |
| 91 | + |
| 92 | + |
| 93 | +# Reconfiguring Slurm |
| 94 | + |
| 95 | +At a minimum run: |
| 96 | + |
| 97 | + ansible-playbook ansible/slurm.yml --tags openhpc |
| 98 | + |
| 99 | + |
| 100 | +**NB:** This will restart all daemons if the `slurm.conf` has any changes, even if technically only a `scontrol reconfigure` is required. |
| 101 | + |
| 102 | + |
| 103 | +# Running the MPI Test Suite |
| 104 | + |
| 105 | +See [ansible/roles/hpctests/README.md](ansible/roles/hpctests/README.md) for a description of these. They can be run using |
| 106 | + |
| 107 | + ansible-playbook ansible/adhoc/hpctests.yml |
| 108 | + |
| 109 | +Note that: |
| 110 | +- The above role provides variables to select specific partitions, nodes and interfaces which may be required. If not set in inventory, these can be passed as extravars: |
| 111 | + |
| 112 | + ansible-playbook ansible/adhoc/hpctests.yml -e hpctests_myvar=foo |
| 113 | +- The HPL-based test is only resonably optimised on Intel processors due the libaries and default parallelisation scheme used. For AMD processors it is recommended this |
| 114 | +is skipped using: |
| 115 | + |
| 116 | + ansible-playbook ansible/adhoc/hpctests.yml --skip-tags hpl-solo. |
| 117 | + |
| 118 | +Review any [site-specific documentation](docs/site.md) for more details. |
| 119 | + |
| 120 | +# Running CUDA Tests |
| 121 | +This uses the [cuda-samples](https://github.com/NVIDIA/cuda-samples/) utilities "deviceQuery" and "bandwidthTest" to test GPU functionality. It automatically runs on any |
| 122 | +host in the `cuda` inventory group: |
| 123 | + |
| 124 | + ansible-playbook ansible/adhoc/cudatests.yml |
| 125 | + |
| 126 | +**NB:** This test is not launched through Slurm, so confirm nodes are free/out of service or use `--limit` appropriately. |
| 127 | + |
| 128 | +# Upgrading the Cluster |
| 129 | + |
| 130 | +#TODO |
| 131 | + |
| 132 | +# Building a Site-specific Image |
| 133 | + |
| 134 | +#TODO |
| 135 | + |
0 commit comments