Skip to content

Conversation

@technowhizz
Copy link
Contributor

No description provided.

bertiethorpe and others added 30 commits August 28, 2025 13:55
* Add filesystems docs

* Apply suggestions from code review

Co-authored-by: Steve Brasier <[email protected]>

* Update Ceph instructions for Manila integrations

* Update overview

* Update docs/filesystems.md

Co-authored-by: Steve Brasier <[email protected]>

* Update image build instructions for Manila

---------

Co-authored-by: Steve Brasier <[email protected]>
* pre-hook to copy requirements.yml.last

* remove mention of CI in comments
* First draft of production end-to-end docs

* Ubuntu Jammy is also supported

* Add TODOs

* Accomplish TODOs

* Mention networks docs

* NFS

* Clarify image

* Formatting changes

* Apply suggestions from code review

Co-authored-by: Steve Brasier <[email protected]>

* Suggestions from code review

* Update docs/production.md

Co-authored-by: Steve Brasier <[email protected]>

* Add git remote instructions

* Update cookiecutter info

* Link filesystems docs

* Move tofu into define and deploy infra section

* Reorganise configuration

* Move tofu note

---------

Co-authored-by: Steve Brasier <[email protected]>
Without any top-level inventory file, Ansible will fail with:

```
ERROR! Completely failed to parse inventory source /home/ubuntu/ansible-slurm-appliance/environments/$ENV/inventory
```
* WIP: refactor repos definitions

* add more repos and cope with CRB/PowerTools oddness

* add epel

* use pulp_server as a group

* add epel default

* wip: get pulp sync working

* fixed sync

* autodetect latest in adhoc script, refactored timestamps to allow gated ohpc repos, fixed pulp site

* fixed distributions + ohpc repos

* updated timestamps script + bumped rocky 9 timestamps

* removed pulp_repo_name fields

* updated docs, added gpg checks, simplified filters

* Added pulp systemd file + removed unused vars

* added READMEs + updated variable names

* disabled gpg checks for dnf_repos

* typo

* fixed disable repos task

* bump images

* remove dnf_repos extra index/key and make epel/openhpc special-cases simpler

* clarify pulp distro selection

* fixup sync vars

* fixup grafana vars

* revert latest timestamp changes for extra key level

* review suggestions

Co-authored-by: Steve Brasier <[email protected]>

* updated README

* docs tweaks

* regularised group names

* updated operations guide for functionality requiring additional installs

* review changes from docs

Co-authored-by: Steve Brasier <[email protected]>

* renamed timestamps.yml to dnf_repos_timestamps.yml

---------

Co-authored-by: Steve Brasier <[email protected]>
Co-authored-by: Steve Brasier <[email protected]>
* Reorder repositories alphabetically

* Bump Pulp snapshots for RL 9.6

* Bump CI image (RL9 only)
Bump CUDA to 13.0.1 and NVIDIA driver to 580.82.07
Make CaaS specific role: `persist_openhpc_secrets` idempotent
* add validation for tofu-templated vars

* update error message iaw review
* Add Github Actions for running code linters

* Fix linting issues.

The super-linter.env currently has the following additions that are to be addressed in the future:
VALIDATE_GITHUB_ACTIONS=false
VALIDATE_SHELL_SHFMT=false
VALIDATE_YAML=false

Most of the linting for the above has been addressed with just a single issue remaining that blocks the linter from being enabled.

* Update GH workflow so linting always runs befor any other jobs

* Update GH workflow so linting always runs befor any other jobs

* Fix linting issues on the merge of origin/main

* Fix linting issues on the merge of origin/main

* Use the head ref for workflow concurrency

* Output the path filter result of the workflow

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Tweak github action used to detect changed paths on push/pull request

* Troubleshooting: ansible.builtin.user

* Troubleshooting: debugging temporarily added

* Shift pylint invalid-name linting behond python bang line

* Temporarily disable the ansible galaxy requirements validation

* Reverting changes made to ansible.builtin.user and ansible.builtin.group where the name parameter was added.
Reverting to ansible.builtin.group: <args>

becasue args aren't an expected label:

groupadd: '{'name': 'grafana', 'gid': 979}' is not a valid group name

* Arguments are dicts not labels

* Preserve file permissions on .ssh directory contents

* Wherever we use become_user set become: true, keeps the linter happy and maintains functionality

* Fix linting on merge of origin/main

* Fix linting on merge of origin/main

* Update cluster image - using fatimage built from ci/linting branch

* Add comments to workflow files detailing the CI workflow and enable these workflows

* Fix workflow execution:
 1. change trivvy to trivy
 2. extra, stackhpc, and trivyscan workflows should trigger on workflow_call and workflow_dispatch

* Fix linting issues from merge of origin/main

* Exclude 'ansible/roles/compute_init/files/compute-init.yml' from ansible lint.
The parser can't load the 'tasks/tuned.yml' ansible so fails with:

load-failure[filenotfounderror]: [Errno 2] No such file or directory: 'ansible-slurm-appliance/tasks/main.yml'
tasks/main.yml:1

This failure can't be skipped beause it's the output of the parser that's fed to the linter where such exceptions are made.

* Temporarily disable Rocky 8 to speed up testing and reduce CI resources
Temporarily disable ansible-lint:

Run ansible/[email protected]
Run if [[ -n "" ]]; then
Run action_ref="${GH_ACTION_REF_INPUT:-${GITHUB_ACTION_REF:-main}}"
Using ansible-lint ref: main
Run reqs_file=$(git rev-parse --show-toplevel)/.git/ansible-lint-requirements.txt
--2025-09-09 14:51:58--  https://raw.githubusercontent.com/ansible/ansible-lint/main/.config/requirements-lock.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-09-09 14:51:58 ERROR 404: Not Found.

* Fix some bad ansible-lint line-length markup

* Fix ansible-lint markup for line-length

* Bump CI image - FOR RL9 ONLY TO CONSERVE CI RESOURCES

* Revert ansible.builtin.command to ansible.builtin.shell due to missed comment "need login shell for module command" and mask ansible-lint error

* Disable extra-build.yml workflow which has previously passed so we can focus on the stackhpc.yml workflow

* Disable concurrency to see if this is killing stackhpc.yml

* Remove concurrency from extr.yml, stackhpc.yml, and trivyscan.yml as they're all being triggered from main.yml which has its own concurrency check - the trivscan concurrency was also killing stackhpc

* Enable ansible-lint

* Enable triggering of all workflows from the main CI workflow

* Bump CI image - FOR RL9 ONLY TO CONSERVE CI RESOURCES

* Fix bad ansible-lint markup affecting the bang line

* Reduce workflow CI resources whilst fixing test deploy and reimage workflow

* Bump CI image - FOR RL9 ONLY TO CONSERVE CI RESOURCES

* Enable Rocky Linux 8 - disabled to speed up testing

* Enable all CI workflows

* Bump CI image - FOR RL9 ONLY TO CONSERVE CI RESOURCES

* Remove empty line between ansible "when" and "block" added by ansible-lint --fix, it's not required by the linter.

* Enable check for ansible galaxy requirements

* Revert the ansible collections path to ansible/collections so we don't inadvertently break any existing checkouts.
Direct ansible-lint to use .ansible/collections so downloads are excluded from linting by our .ansible-lint.yml

* Bump CI image
It resolves some limitations with login subgroups, such as difficulty to
bind the Open OnDemand service to a specific node when naming of the
nodes is not predictable.

This replicates what is already done for compute subgroups.
* ignore port binding info; fixes tf when admin

* ignore port dhcp changes to fix networking-mlxn

* ignore port binding/dhcp options for caas

* fix TF linter errors
* Fix various comments in Ansible group files

* Expose vgpu group in site inventory
* wip: add TF remote state docs

* wip s3 remote state

* improve gitlab backend configuration

* automate s3 creds

* make s3 buckets clearer

* fix linting

* try to allow same headings at different levels in markdown

* fix tf lint errors

* fix prettier errors
…I) (#792)

* update dnf_repos_timestamps.yml

* bump Ark timestamps

* update again

* make it possible NOT to clean up packer builds

* fixup source repo path typo

* add missing RL8 PowerTools source repo

* correct RL8 source repo files

* update timestamps

* bump CI image

* disable Lustre for RL8 extrabuild tests due to kernel mismatch

---------

Co-authored-by: bertiethorpe <[email protected]>
* validate nodename groups

* add validation for nodegroup name clashes

* add validation for nodegroup name clashes

* fix linter whinges

* extend validation to cover additional_nodegroups

* fix TF linting

* fixup logic

* fix logic

* fix linter
sjpb and others added 15 commits September 30, 2025 19:18
Fix image sync workflow for new larger fat images
From access.conf(5):

    The second field, the users/group field, should be a list of one or
    more login names, group names, or ALL (which always matches). To
    differentiate user entries from group entries, group entries should
    be written with brackets, e.g. (group).
* support raid root disks in stackhpc-built images

* clarify image requirements

* bump CI image

* remove default build groups

* fixup doca/cuda inventory groups

* add fatimage inventory group

* update docs for image build

* minor docs tweaks

* fixup fatimage group definition

* fix build groups

* bump CI image

* minor docs tweak

* fix linter markdown error

* fix linter markdown error

* swap example site image build to normal case

* fix borked merge

* fixes after self-review

* bump CI image
* Expose FIPs in inventory hosts file

* adding output for "fip_address"

* changing 'fip_address' to 'nodegroup_fips'
* delete build VMs in CI nightly cleanup

* name build volumes and include in nightly cleanup

* simplify cleanup of volumes and include fatimage build VMs

---------

Co-authored-by: bertiethorpe <[email protected]>
Co-authored-by: bertiethorpe <[email protected]>
* export state directory to ondemand nodes for caas

* fixed caas config
@technowhizz technowhizz requested review from priteau and sjpb November 7, 2025 11:17
@technowhizz technowhizz self-assigned this Nov 7, 2025
Use image ID instead of name due to duplicate image names in OpenStack.
Increase volume size to 20GB for packer due to increased size of image.
Decreases size of of volumes to allow more labs.
@technowhizz
Copy link
Contributor Author

Working config from the last lab we ran

@technowhizz
Copy link
Contributor Author

linter still failing but I think its some misconfiguration

@sjpb
Copy link
Collaborator

sjpb commented Nov 14, 2025

Jeeze. TBH if it works I'd just disable the linting and merge it!

Copy link
Contributor

@elelaysh elelaysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elelaysh elelaysh merged commit 5168c83 into training/leafcloud Dec 31, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants