From 0fb76756bff855ed3cf0b551aa801dcdfd6928c3 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Fri, 25 Oct 2024 13:27:50 +0000 Subject: [PATCH 1/5] add upgrade docs --- docs/upgrades.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/upgrades.md diff --git a/docs/upgrades.md b/docs/upgrades.md new file mode 100644 index 000000000..0280c9d4c --- /dev/null +++ b/docs/upgrades.md @@ -0,0 +1,92 @@ +# Upgrades + +This document explains the generic steps required to upgrade a deployment of the Slurm Appliance with upstream changes from StackHPC. +Generally, upstream releases will happen roughly monthly. Releases may contain new functionality and/or updated images. + +Any site-specific instructions in [docs/site/README.md](site/README.md) should be reviewed in tandem with this. + +This document assumes the deployment repository has: +1. Remotes: + - `origin` referring to the site-specific remote repository. + - `stackhpc` referring to the StackHPC repository at https://github.com/stackhpc/ansible-slurm-appliance.git. +2. Branches: + - `main` - following `main/origin`, the current site-specific code deployed to production. + - `upstream` - following `main/stackhpc`, i.e. the upstream `main` branch from `stackhpc`. + +It also assumes the site has `staging` and `production` environments. + +**NB:** Commands which should be run on the Slurm login node are shown below prefixed `[LOGIN]$`. +All other commands should be run on the Ansible deploy host. + +1. Update the `upstream` branch from the `stackhpc` remote, including tags: + + git fetch stackhpc main --tags + +1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this is shown as `vX.Y`, which is the + +1. Ensure your local site branch is up to date and create a new branch from it for the + site-specfic release code: + + git checkout main + git pull --prune + git checkout -b update/vX.Y + +1. Merge the upstream code into your release branch: + + git merge stackhpc/vX.Y + + It is possible this will introduce merge conflicts; fix these following the usual git + prompts. Generally merge conflicts should only exist where functionality which was added + for your site (not in a hook) has subsequently been merged upstream. + +1. Push this branch and create a PR: + + git push + # follow instructions + +1. Review the PR to see if any added/changed functionality requires alteration of + site-specific configuration. In general changes to existing functionality will aim to be + backward compatible. Alteration of site-specific configuration will usually only be + necessary to use new functionality or where functionality has been upstreamed as above. + + Make changes as necessary. + +1. Download the relevant release image(s) using the link from the relevant [Slur +m appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), e.g.: + + wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb + + Note that some releases may not include new images. In this case use the image from the latest previous release with new images. + +1. If required, build an "extra" image with local modifications. See site-specific instructions in [docs/site/README.md](site/README.md). + +1. Modify your environments to use this image, test it in your staging cluster, and push commits to the PR created above. See site-specific instructions in [docs/site/README.md](site/README.md). + +1. Declare a future outage window to cluster users and create a [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) to prevent jobs running during that window, e.g.: + + [LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root + +1. At the outage window, check there are no jobs running: + + [LOGIN]$ squeue + +1. Deploy the branch created above to production. See site-specific instructions in [docs/site/README.md](site/README.md). + +1. Check slurm is up: + + [LOGIN]$ sinfo -R + + The `-R` shows the reason for any nodes being down. + +1. If the above shows nodes done for having been "unexpectedly rebooted", set them up again: + + [LOGIN]$ sudo scontrol update state=RESUME nodename=$HOSTLIST_EXPR + + where the hostlist expression might look like e.g. `general-[0-1]` to reset state for nodes 0 and 1 of the general partition. + +1. Delete the reservation: + + [LOGIN]$ sudo scontrol delete ReservationName="upgrade-slurm-v1.160" + +1. Tell users the cluster is available again. + From 89abd58e343b6d21b4ad96a1140524b5d056b0ef Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Wed, 30 Oct 2024 09:53:59 +0000 Subject: [PATCH 2/5] link to generic image build docs from upgrade docs --- docs/upgrades.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/upgrades.md b/docs/upgrades.md index 0280c9d4c..7b22a40a9 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -58,7 +58,7 @@ m appliance release](https://github.com/stackhpc/ansible-slurm-appliance/release Note that some releases may not include new images. In this case use the image from the latest previous release with new images. -1. If required, build an "extra" image with local modifications. See site-specific instructions in [docs/site/README.md](site/README.md). +1. If required, build an "extra" image with local modifications. See [docs/image-build.md](./image-build.md) and site-specific instructions in [docs/site/README.md](site/README.md). 1. Modify your environments to use this image, test it in your staging cluster, and push commits to the PR created above. See site-specific instructions in [docs/site/README.md](site/README.md). From cadb6a3afd9be08d933e2a2be577f3273237c446 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Wed, 30 Oct 2024 10:15:32 +0000 Subject: [PATCH 3/5] address minor upgrade docs issues --- docs/upgrades.md | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/docs/upgrades.md b/docs/upgrades.md index 7b22a40a9..e2c12ff88 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -12,8 +12,10 @@ This document assumes the deployment repository has: 2. Branches: - `main` - following `main/origin`, the current site-specific code deployed to production. - `upstream` - following `main/stackhpc`, i.e. the upstream `main` branch from `stackhpc`. - -It also assumes the site has `staging` and `production` environments. +3. The following environments: + - `$PRODUCTION`: a production environment, as defined by e.g. `environments/production/`. + - `$STAGING`: a production environment, as defined by e.g. `environments/staging/`. + - `$SITE_ENV`: a base site-specific environment, as defined by e.g. `environments/mysite/`. **NB:** Commands which should be run on the Slurm login node are shown below prefixed `[LOGIN]$`. All other commands should be run on the Ansible deploy host. @@ -51,26 +53,35 @@ All other commands should be run on the Ansible deploy host. Make changes as necessary. -1. Download the relevant release image(s) using the link from the relevant [Slur -m appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), e.g.: +1. Identify image(s) from the relevant [Slurm appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), and download + using the link on the release plus the image name, e.g. for an image `openhpc-ofed-RL8-240906-1042-32568dbb`: wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb Note that some releases may not include new images. In this case use the image from the latest previous release with new images. -1. If required, build an "extra" image with local modifications. See [docs/image-build.md](./image-build.md) and site-specific instructions in [docs/site/README.md](site/README.md). +1. If required, build an "extra" image with local modifications, see [docs/image-build.md](./image-build.md). + +1. Modify your site-specific environment to use this image, e.g. via `cluster_image_id` in `environments/$SITE_ENV/terraform/variables.tf`. -1. Modify your environments to use this image, test it in your staging cluster, and push commits to the PR created above. See site-specific instructions in [docs/site/README.md](site/README.md). +1. Test this in your staging cluster. -1. Declare a future outage window to cluster users and create a [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) to prevent jobs running during that window, e.g.: +1. Commit changes and push to the PR created above. + +1. Declare a future outage window to cluster users. A [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) can be + used to prevent jobs running during that window, e.g.: [LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root + Note a reservation cannot be created if it may overlap with currently running jobs (defined by job or partition time limits). + 1. At the outage window, check there are no jobs running: [LOGIN]$ squeue -1. Deploy the branch created above to production. See site-specific instructions in [docs/site/README.md](site/README.md). +1. Deploy the branch created above to production, i.e. activate the production environment, run OpenTofu to reimage or +delete/recreate instances with the new images (depending on how the root disk is defined), and run Ansible's `site.yml` +playbook to reconfigure the cluster, e.g. as described in the main [README.md](../README.md). 1. Check slurm is up: From e150fcff1cd22f9a887de4cfd29f65c19f1e5e20 Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Wed, 30 Oct 2024 11:00:13 +0000 Subject: [PATCH 4/5] fix upgrade merge tag command --- docs/upgrades.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/upgrades.md b/docs/upgrades.md index e2c12ff88..40851d2d7 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -35,7 +35,7 @@ All other commands should be run on the Ansible deploy host. 1. Merge the upstream code into your release branch: - git merge stackhpc/vX.Y + git merge vX.Y It is possible this will introduce merge conflicts; fix these following the usual git prompts. Generally merge conflicts should only exist where functionality which was added From 51d5e739936fa6de8cd138d494abd026e08d066c Mon Sep 17 00:00:00 2001 From: Steve Brasier Date: Tue, 5 Nov 2024 13:43:13 +0000 Subject: [PATCH 5/5] fix upgrade docs typo --- docs/upgrades.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/upgrades.md b/docs/upgrades.md index 40851d2d7..6e398934e 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -24,7 +24,7 @@ All other commands should be run on the Ansible deploy host. git fetch stackhpc main --tags -1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this is shown as `vX.Y`, which is the +1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this release is shown as `vX.Y`. 1. Ensure your local site branch is up to date and create a new branch from it for the site-specfic release code: