diff --git a/.github/workflows/package-build-ofed.yml b/.github/workflows/package-build-ofed.yml index 2df2462175..4dd23065f9 100644 --- a/.github/workflows/package-build-ofed.yml +++ b/.github/workflows/package-build-ofed.yml @@ -28,6 +28,11 @@ jobs: runs-on: arc-skc-host-image-builder-runner permissions: {} steps: + - name: Generate OFED tag + id: ofed_tag + run: | + echo "ofed_tag=$(date +%Y%m%dT%H%M%S)" >> $GITHUB_OUTPUT + - name: Install Package uses: ConorMacBride/install-package@main with: @@ -42,24 +47,6 @@ jobs: with: path: src/kayobe-config - - name: Determine OpenStack release - id: openstack_release - run: | - BRANCH=$(awk -F'=' '/defaultbranch/ {print $2}' src/kayobe-config/.gitreview) - echo "openstack_release=${BRANCH}" | sed -E "s,(stable|unmaintained)/,," >> $GITHUB_OUTPUT - - - name: Generate OFED tag - id: ofed_tag - run: | - echo "ofed_tag=$(date +%Y%m%dT%H%M%S)" >> $GITHUB_OUTPUT - - - name: Clone StackHPC Kayobe repository - uses: actions/checkout@v4 - with: - repository: stackhpc/kayobe - ref: refs/heads/stackhpc/${{ steps.openstack_release.outputs.openstack_release }} - path: src/kayobe - - name: Install Kayobe run: | mkdir -p venvs && @@ -67,7 +54,7 @@ jobs: python3 -m venv kayobe && source kayobe/bin/activate && pip install -U pip && - pip install ../src/kayobe + pip install -r ../src/kayobe-config/requirements.txt - name: Install terraform uses: hashicorp/setup-terraform@v2 diff --git a/.github/workflows/stackhpc-multinode.yml b/.github/workflows/stackhpc-multinode.yml index c9e8193da6..4869df6feb 100644 --- a/.github/workflows/stackhpc-multinode.yml +++ b/.github/workflows/stackhpc-multinode.yml @@ -56,7 +56,7 @@ name: Multinode jobs: multinode: name: Multinode - uses: stackhpc/stackhpc-openstack-gh-workflows/.github/workflows/multinode.yml@1.4.0 + uses: stackhpc/stackhpc-openstack-gh-workflows/.github/workflows/multinode.yml@1.4.1 with: multinode_name: ${{ inputs.multinode_name }} os_distribution: ${{ inputs.os_distribution }} diff --git a/doc/source/configuration/ci-cd.rst b/doc/source/configuration/ci-cd.rst index 435c114f7e..7865272ff1 100644 --- a/doc/source/configuration/ci-cd.rst +++ b/doc/source/configuration/ci-cd.rst @@ -5,14 +5,21 @@ CI/CD Concepts ======== -The CI/CD system developed for managing Kayobe based OpenStack clouds is composed of three main components; workflows, runners and kayobe automation. +The CI/CD system developed for managing Kayobe based OpenStack clouds is composed of four main components; workflows, runners, OpenBao and kayobe automation. + Firstly, the workflows are files which describe a series of tasks to be performed in relation to the deployed cloud. These workflows are executed on request, on schedule or in response to an event such as a pull request being opened. + The workflows are designed to carry out various day-to-day activites such as; running Tempest tests, configuring running services or displaying the change to configuration files if a pull request is merged. Secondly, in order for the workflows to run against a cloud we would need private runners present within the cloud positioned in such a way they can reach the internal network and public API. Deployment of private runners is supported by all major providers with the use of community developed Ansible roles. + +Thirdly, OpenBao is used to store secrets on the same virtual machine the runners are hosted within. +This provides a secure way of storing secrets and variables which can be accessed by the runners when executing workflows and ensures that secrets never have to leave the cloud. + Finally, due to the requirement that we support various different platforms tooling in the form of `Kayobe automation `__ was developed. This tooling is not tied to any single CI/CD platform as all tasks are a series of shell script and Ansible playbooks which are designed to run in a purpose build kayobe container. + This is complemented by the use of an Ansible collection known as `stackhpc.kayobe_workflows `__ which aims to provide users with a quick and easy way of customising all workflows to fit within a customer's cloud. Currently we support the creation and deployment of workflows for GitHub with Gitlab support being actively worked upon. @@ -42,6 +49,12 @@ These services will listen for jobs which have been tagged appropriately and dis The runners will need to be deployed using existing roles and playbooks whereby the binary/package is downloaded and registered using a special token. In some deployments runner hosts can be shared between environments however this is not always true and dedicated hosts will need to be used for each environment you intend to deploy kayobe automation within. +OpenBao +------- + +OpenBao is recommended when deploying kayobe automation to achieve a simple and secure way of storing secrets. +OpenBao can easily be configured to hold the secrets for all environments and only permit access to the runners which require them utilising different authorisation mechanisms such as GitLab's JWT (JSON Web Token). + GitHub Actions ================= @@ -181,3 +194,201 @@ Sometimes the kayobe docker image must be rebuilt the reasons for this include b * Update kolla-ansible * UID/GID collision when deploying workflows to a new environment * Prior to deployment of new a OpenStack release + +GitLab Pipelines +================ + +To enable CI/CD where GitLab Pipelines is used please follow the steps described below starting with the deployment of the runners. + +Runner Deployment +----------------- + +1. Identify a suitable host for hosting the runners. + Ideally an infra-vm would be deployed to allow for easily compartmentalising the runners from the rest of the environment. + 8 VCPUs and 16GB of RAM is recommended for the guest machine however this may need to be adjusted for larger deployments. + Whether the host is in an infra-vm or not it will need access to the :code:`admin_network` or :code:`provision_oc_network`, :code:`public_network` and the :code:`pulp registry` on the seed. + The steps will assume that an infra-vm will be used for the purpose of hosting the runners. + +2. Edit the environment's :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/hosts` to define the host(s) that will host the runners. + +.. code-block:: ini + + [gitlab-runners] + gitlab-runner-01 + +4. Provide all the relevant Kayobe :code:`group_vars` for :code:`gitlab-runners` under :code:`${KAYOBE_CONFIG_PATH}/environments/${KAYOBE_ENVIRONMENT}/inventory/group_vars/gitlab-runners` + * `infra-vms` ensuring all required `infra_vm_extra_network_interfaces` are defined + * `network-interfaces` + * `allocated IPs` + +5. Edit the ``${KAYOBE_CONFIG_PATH}/inventory/group_vars/gitlab-runners/runners.yml`` file which will contain the variables required to deploy a series of runners. + Below is an example of how GitLab runners can be configured for deployment. + In this example we have two runners, one for production and one for staging and will both be deployed on the same host. + This might not be possible for all deployments as multiple environments may require different runners as no single runner can serve all environments. + Note a GitLab runner can run multiple jobs concurrently so deploying a single runner per environment is recommended. + +.. code-block:: yaml + + --- + gitlab_runner_coordinator_url: "https://gitlab.example.com" + gitlab_runner_runners: + - name: "Kayobe Automation Runner [Production] #1" + executor: docker + docker_image: 'alpine' + token: "{{ secrets_gitlab_production_runner_token }}" + env_vars: + - "GIT_CONFIG_COUNT=1" + - "GIT_CONFIG_KEY_0=safe.directory" + - "GIT_CONFIG_VALUE_0=*" + tags: + - kayobe + - openstack + - production + docker_volumes: + - "/var/run/docker.sock:/var/run/docker.sock" + - "/opt/.docker/config.json:/root/.docker/config.json:ro" + - "/cache" + extra_configs: + runners.docker: + network_mode: host + - name: "Kayobe Automation Runner [Staging] #1" + executor: docker + docker_image: 'alpine' + token: "{{ secrets_gitlab_staging_runner_token }}" + env_vars: + - "GIT_CONFIG_COUNT=1" + - "GIT_CONFIG_KEY_0=safe.directory" + - "GIT_CONFIG_VALUE_0=*" + tags: + - kayobe + - openstack + - staging + docker_volumes: + - "/var/run/docker.sock:/var/run/docker.sock" + - "/opt/.docker/config.json:/root/.docker/config.json:ro" + - "/cache" + extra_configs: + runners.docker: + network_mode: host + +6. Obtain a runner token for each runner that is required for deployment. + This token can be obtained by visiting the GitLab project -> Settings -> CI/CD -> Runners -> New project runner -> Complete the form including any tags used by the runners such as kayobe, openstack and environment_name. + Once the token has been obtained, add it to :code:`secrets.yml` under :code:`secrets_gitlab_production_runner_token` and :code:`secrets_gitlab_staging_runner_token` + +7. Deploy the infra-vm + +.. code-block:: bash + + kayobe infra vm provision --limit gitlab-runner-01 + +8. Perform a host configure against the infra-vm + +.. code-block:: bash + + kayobe infra vm host configure --limit gitlab-runner-01 + +9. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/deploy-gitlab-runner.yml` + +10. Check runners have registered properly by visiting the repository's :code:`CI/CD` tab -> :code:`Runners` + +11. The contents of :code:`/opt/.docker/config.json` on the runner should be added to GitLab CI/CD settings as a sercret variable if GitLab version permits otherwise variable is fine. + This is required to allow the runners to pull images from the registry. + Visit the GitLab project -> Settings -> CI/CD -> Variables -> Add a new variable with the key :code:`DOCKER_AUTH_CONFIG` and the value of the contents of :code:`/opt/.docker/config.json` + +OpenBao Deployment +------------------ + +OpenBao must be installed on the same host as the runners. +If you have multiple environments that each have the own runners then OpenBao must be installed on each host. +However, if you have a single host that is shared between environments then OpenBao only needs to be installed once and can be achieved by running the following playbook. + +.. code-block:: bash + + kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/deploy-openbao-kayobe-automation.yml + +.. note:: + + If you are sharing OpenBao between environments then you will need to rerun the playbook under each environment to ensure that the correct secrets are available to the runners. + You may use :code:`--tags add_secrets` to skip the deployment within other environments. + For this to work you will need to copy :code:`vault/kayobe-automation-keys.json` from the first environment to the other environments in addition to copying the host definition of the gitlab runner add network IP. + +Once the above playbook has been applied you need to grab the root token from :code:`vault/kayobe-automation-keys.json` as you will need this to enable JWT support. +This would also be an opportune time to encrypt the :code:`vault/kayobe-automation-keys.json` to protect the contents. + +.. code-block:: bash + + ansible-vault encrypt vault/kayobe-automation-keys.json --vault-password-file ~/.vault.password + +In order to enable JWT support the following steps must be carried out within the openbao container on the runner host. + +1. SSH into the runner host + +2. Run :code:`sudo docker exec -it bao sh` + +3. Run :code:`export BAO_ADDR=http://127.0.0.1:8200` + +4. Run :code:`bao login` and use root token + +5. Run the following to enable and configure JWT support + +.. note:: + + The following steps are an example and should be adapted to suit your deployment. + For example project_id within the gitlab role will need ID of the project that the runners are registered against. + This can acquired by visiting the project -> Settings -> General -> General project settings -> Project ID. + +.. code-block:: bash + + bao auth enable jwt + bao policy write kayobe-automation - <`__. +Following the instructions in the documentation will allow you to customise the workflows to fit within your deployment. +If using multiple environments ensure that :code:`gitlab_kayobe_environments` is updated to reflect all environments present in the deployment. +Also consider the impact runbooks might have as the runbooks are designed with a particular cloud in mind and may not be suitable for all deployments such as hyperconverged deployments with Ceph on hypervisors. + +2. Run :code:`kayobe playbook run ${KAYOBE_CONFIG_PATH}/ansible/write-gitlab-pipelines.yml` + +3. Commit and push all newly generated pipelines found under root of the repository. + +Things to consider +================== + +- Adjust General Pipeline settings by visiting the project -> Settings -> CI/CD -> General pipelines + - Disable :code:`Public Pipelines` + - Disable :code:`Auto-cancel redundant pipelines` + - Disable :code:`Prevent outdated deployment jobs` + - Increase :code:`Timeout` to :code:`12h` + +- Disable Auto DevOps in the GitLab project settings by visiting the project -> Settings -> CI/CD -> Auto DevOps -> Disable Auto DevOps + +Sometimes the kayobe docker image must be rebuilt. The reasons for this include but are not limited to the following; + + * Change :code:`$KAYOBE_CONFIG_PATH/ansible/requirements.yml` + * Change to requirements.txt + * Update Kayobe + * Update kolla-ansible + * Prior to deployment of new a OpenStack release diff --git a/etc/kayobe/ansible/check-kayobe-version.yml b/etc/kayobe/ansible/check-kayobe-version.yml index b527fc5d8e..b893d806f0 100644 --- a/etc/kayobe/ansible/check-kayobe-version.yml +++ b/etc/kayobe/ansible/check-kayobe-version.yml @@ -29,18 +29,28 @@ register: kayobe_git_commit failed_when: kayobe_git_commit.stdout == "" + - name: Create a temporary directory to clone Kayobe into + ansible.builtin.tempfile: + state: directory + register: kayobe_temp_dir + - name: Clone Kayobe ansible.builtin.git: repo: https://github.com/stackhpc/kayobe.git - dest: /tmp/kayobe-git + dest: "{{ kayobe_temp_dir.path }}/kayobe-git" version: stackhpc/{{ openstack_release }} - name: Get tag from Kayobe commit ansible.builtin.command: cmd: git describe --tags {{ kayobe_git_commit.stdout }} - chdir: /tmp/kayobe-git + chdir: "{{ kayobe_temp_dir.path }}/kayobe-git" register: kayobe_current_version + - name: Clean up temporary directory + ansible.builtin.file: + state: absent + path: "{{ kayobe_temp_dir.path }}" + - name: Get latest Kayobe version ansible.builtin.shell: cmd: set -o pipefail && grep -o kayobe@stackhpc\/.*$ {{ requirements_path }} | cut -d @ -f 2 diff --git a/etc/kayobe/ansible/deploy-gitlab-runner.yml b/etc/kayobe/ansible/deploy-gitlab-runner.yml new file mode 100644 index 0000000000..44a1002b85 --- /dev/null +++ b/etc/kayobe/ansible/deploy-gitlab-runner.yml @@ -0,0 +1,24 @@ +--- +- name: Deploy GitLab runners + hosts: gitlab-runners + become: true + pre_tasks: + - name: Ensure /opt/.docker folder exists + ansible.builtin.file: + path: /opt/.docker + state: directory + + - name: Ensure docker/config.json exists for runner + ansible.builtin.copy: + content: | + { + "auths": { + "{{ pulp_url | regex_replace('^https?://|^http?://', '') }}": { + "auth": "{{ (pulp_username + ':' + pulp_password) | b64encode }}" + } + } + } + dest: /opt/.docker/config.json + mode: "0600" + roles: + - role: riemers.gitlab-runner diff --git a/etc/kayobe/ansible/deploy-openbao-kayobe-automation.yml b/etc/kayobe/ansible/deploy-openbao-kayobe-automation.yml new file mode 100644 index 0000000000..195f23add1 --- /dev/null +++ b/etc/kayobe/ansible/deploy-openbao-kayobe-automation.yml @@ -0,0 +1,78 @@ +--- +- name: Deploy OpenBao on the runners + any_errors_fatal: true + gather_facts: true + hosts: github-runners,gitlab-runners + tasks: + - name: Set a fact about the virtualenv on the remote system + ansible.builtin.set_fact: + virtualenv: "{{ ansible_python_interpreter | dirname | dirname }}" + when: + - ansible_python_interpreter is defined + - not ansible_python_interpreter.startswith('/bin/') + - not ansible_python_interpreter.startswith('/usr/bin/') + + - name: Ensure Python hvac module is installed + ansible.builtin.pip: + name: hvac + state: latest + extra_args: "{% if pip_upper_constraints_file %}-c {{ pip_upper_constraints_file }}{% endif %}" + virtualenv: "{{ virtualenv is defined | ternary(virtualenv, omit) }}" + become: "{{ virtualenv is not defined }}" + + - name: Ensure /opt/kayobe/vault exists + ansible.builtin.file: + path: /opt/kayobe/vault + state: directory + become: true + + - name: Ensure vault directory exists in environment + ansible.builtin.file: + path: "{{ kayobe_env_config_path }}/vault" + state: directory + become: true + + - name: Import OpenBao role + ansible.builtin.import_role: + name: stackhpc.hashicorp.openbao + vars: + openbao_config_dir: "/opt/kayobe/vault" + openbao_cluster_name: "kayobe-automation" + copy_self_signed_ca: false + openbao_write_keys_file: true + openbao_write_keys_file_path: "{{ kayobe_env_config_path }}/vault/kayobe-automation-keys.json" + + - name: Include OpenBao keys + ansible.builtin.include_vars: + file: "{{ kayobe_env_config_path }}/vault/kayobe-automation-keys.json" + name: openbao_keys + tags: always + + - name: Import Vault unseal role + ansible.builtin.import_role: + name: stackhpc.hashicorp.vault_unseal + vars: + vault_api_addr: "{{ openbao_api_addr }}" + vault_unseal_token: "{{ openbao_keys.root_token }}" + vault_unseal_keys: "{{ openbao_keys.keys_base64 }}" + vault_unseal_verify: false + environment: + https_proxy: '' + + - name: Create secret store + ansible.legacy.hashivault_secret_engine: + name: kayobe-automation + backend: kv + url: "{{ openbao_api_addr }}" + token: "{{ openbao_keys.root_token }}" + + - name: Ensure secret store is present + community.hashi_vault.vault_write: + url: "{{ openbao_api_addr }}" + token: "{{ openbao_keys.root_token }}" + path: kayobe-automation/{{ kayobe_environment }} + data: + kayobe_vault_password: "{{ kolla_ansible_vault_password }}" + kayobe_automation_ssh_private_key: "{{ lookup('ansible.builtin.file', '{{ ssh_private_key_path }}') }}" + kayobe_public_openrc: "{{ lookup('ansible.builtin.file', '{{ kolla_config_path }}/public-openrc.sh') }}" + tags: add_secrets diff --git a/etc/kayobe/ansible/deploy-radosgw-usage-exporter.yml b/etc/kayobe/ansible/deploy-radosgw-usage-exporter.yml index e7c0cf254e..d9db106bf8 100644 --- a/etc/kayobe/ansible/deploy-radosgw-usage-exporter.yml +++ b/etc/kayobe/ansible/deploy-radosgw-usage-exporter.yml @@ -114,7 +114,7 @@ ADMIN_ENTRY: admin ACCESS_KEY: "{{ ec2.Access }}" SECRET_KEY: "{{ ec2.Secret }}" - VIRTUAL_PORT: "{{ stackhpc_radosgw_usage_exporter_port | string }}" + VIRTUAL_PORT: "{{ stackhpc_radosgw_usage_exporter_backend_port | string }}" REQUESTS_CA_BUNDLE: "/etc/ssl/certs/ca-certificates.crt" entrypoint: "{{ ['python', '-u', './radosgw_usage_exporter.py', '--insecure'] if not stackhpc_radosgw_usage_exporter_verify else omit }}" vars: diff --git a/etc/kayobe/ansible/get-cloud-facts.yml b/etc/kayobe/ansible/get-cloud-facts.yml new file mode 100644 index 0000000000..e966f8acce --- /dev/null +++ b/etc/kayobe/ansible/get-cloud-facts.yml @@ -0,0 +1,87 @@ +--- +- name: Gather Cloud Facts + hosts: localhost + gather_facts: true + tasks: + - name: Write facts to file + vars: + cloud_facts: + ansible_control_host_distribution: "{{ ansible_facts.distribution }}" + ansible_control_host_distribution_release: "{{ ansible_facts.distribution_release }}" + openstack_release: "{{ openstack_release }}" + openstack_release_name: "{{ openstack_release_codename }}" + ansible_control_host_is_vm: "{{ ansible_facts.virtualization_role == 'guest' }}" + controller_count: "{{ groups['controllers'] | length }}" + hypervisor_count: "{{ groups['hypervisors'] | length }}" + monitoring_count: "{{ groups['monitoring'] | length }}" + osd_count: "{{ groups['osds'] | length }}" + compute_count: "{{ groups['compute'] | length }}" + baremetal_count: "{{ groups['baremetal-compute'] | length }}" + ceph_deployed: "{{ groups['ceph'] | length > 0 | bool }}" + ceph_count: "{{ groups['ceph'] | length }}" + ceph_release: "{{ cephadm_ceph_release }}" + storage_hyperconverged: "{{ groups['controllers'] | intersect(groups['osds']) | length > 0 | bool }}" + wazuh_enabled: "{{ groups['wazuh-agent'] | length > 0 | bool }}" + kayobe_managed_switches: "{{ groups['switches'] | length > 0 | bool }}" + proxy_configured: "{{ http_proxy | bool or https_proxy | bool }}" + bifrost_version: "{{ kolla_bifrost_source_version }}" + barbican_enabled: "{{ kolla_enable_barbican }}" + nova_enabled: "{{ kolla_enable_nova }}" + neutron_enabled: "{{ kolla_enable_neutron }}" + ovs_enabled: "{{ kolla_enable_openvswitch }}" + ovn_enabled: "{{ kolla_enable_ovn }}" + glance_enabled: "{{ kolla_enable_glance }}" + cinder_enabled: "{{ kolla_enable_cinder }}" + keystone_enabled: "{{ kolla_enable_keystone }}" + horizon_enabled: "{{ kolla_enable_horizon }}" + fluentd_enabled: "{{ kolla_enable_fluentd }}" + rabbitmq_enabled: "{{ kolla_enable_rabbitmq }}" + mariadb_enabled: "{{ kolla_enable_mariadb }}" + mariabackup_enabled: "{{ kolla_enable_mariabackup }}" + memcached_enabled: "{{ kolla_enable_memcached }}" + haproxy_enabled: "{{ kolla_enable_haproxy }}" + keepalived_enabled: "{{ kolla_enable_keepalived }}" + octavia_enabled: "{{ kolla_enable_octavia }}" + designate_enabled: "{{ kolla_enable_designate }}" + manila_enabled: "{{ kolla_enable_manila }}" + magnum_enabled: "{{ kolla_enable_magnum }}" + heat_enabled: "{{ kolla_enable_heat }}" + ironic_enabled: "{{ kolla_enable_ironic }}" + skyline_enabled: "{{ kolla_enable_skyline }}" + blazar_enabled: "{{ kolla_enable_blazar }}" + pulp_enabled: "{{ seed_pulp_container_enabled }}" + opensearch_enabled: "{{ kolla_enable_opensearch }}" + opensearch_dashboards_enabled: "{{ kolla_enable_opensearch_dashboards }}" + influxdb_enabled: "{{ kolla_enable_influxdb }}" + grafana_enabled: "{{ kolla_enable_grafana }}" + prometheus_enabled: "{{ kolla_enable_prometheus }}" + cloudkitty_enabled: "{{ kolla_enable_cloudkitty }}" + telegraf_enabled: "{{ kolla_enable_telegraf }}" + internal_tls_enabled: "{{ kolla_enable_tls_internal }}" + external_tls_enabled: "{{ kolla_enable_tls_external }}" + firewalld_enabled_all: >- + {{ + controller_firewalld_enabled and + compute_firewalld_enabled and + storage_firewalld_enabled and + monitoring_firewalld_enabled and + infra_vm_firewalld_enabled and + seed_firewalld_enabled and + seed_hypervisor_firewalld_enabled + }} + firewalld_enabled_any: >- + {{ + controller_firewalld_enabled or + compute_firewalld_enabled or + storage_firewalld_enabled or + monitoring_firewalld_enabled or + infra_vm_firewalld_enabled or + seed_firewalld_enabled or + seed_hypervisor_firewalld_enabled + }} + stackhpc_package_repos_enabled: "{{ stackhpc_repos_enabled }}" + pulp_tls_enabled: "{{ pulp_enable_tls }}" + kolla_image_tags: "{{ kolla_image_tags }}" + ansible.builtin.copy: + content: "{{ cloud_facts | to_nice_json(sort_keys=false) }}" + dest: ~/cloud-facts.json diff --git a/etc/kayobe/ansible/get-nvme-drives.yml b/etc/kayobe/ansible/get-nvme-drives.yml new file mode 100644 index 0000000000..1d2404d805 --- /dev/null +++ b/etc/kayobe/ansible/get-nvme-drives.yml @@ -0,0 +1,96 @@ +--- +- name: Gather unique NVMe disk models on all hosts + hosts: overcloud + gather_facts: no + tasks: + - name: Retrieve NVMe device information + ansible.builtin.command: "nvme list -o json" + register: nvme_list + changed_when: false + become: true + + - name: Parse NVMe device model names + ansible.builtin.set_fact: + nvme_models: "{{ nvme_models | default([]) + [item.ModelNumber] }}" + loop: "{{ nvme_list.stdout | from_json | json_query('Devices[].{ModelNumber: ModelNumber}') }}" + changed_when: false + + - name: Set unique NVMe models as host facts + ansible.builtin.set_fact: + unique_nvme_models: "{{ (nvme_models | default([])) | unique }}" + + - name: Show unique NVMe models per host + ansible.builtin.debug: + var: unique_nvme_models + +- name: Aggregate all unique NVMe models from all hosts + hosts: localhost + gather_facts: no + tasks: + - name: Aggregate unique NVMe models from all overcloud hosts + ansible.builtin.set_fact: + all_nvme_models: "{{ groups['overcloud'] | map('extract', hostvars, 'unique_nvme_models') | select('defined') | sum(start=[]) | unique }}" + + - name: Show all unique NVMe models + ansible.builtin.debug: + var: all_nvme_models + + - name: Ensure dwpd-ratings.yml exists + ansible.builtin.stat: + path: "{{ kayobe_env_config_path }}/dwpd-ratings.yml" + register: dwpd_ratings_stat + run_once: true + + - name: Load existing dwpd-ratings.yml + ansible.builtin.set_fact: + existing_dwpd_yml: "{{ lookup('file', kayobe_env_config_path ~ '/dwpd-ratings.yml') | from_yaml }}" + when: dwpd_ratings_stat.stat.exists + run_once: true + + - name: Convert existing YAML array into a dictionary + ansible.builtin.set_fact: + dwpd_lookup: "{{ dwpd_lookup | default({}) | combine({item.model_name: item.rated_dwpd}) }}" + loop: "{{ existing_dwpd_yml.stackhpc_dwpd_ratings | default([]) }}" + loop_control: + label: "{{ item.model_name }}" + run_once: true + + - name: Get list of existing model names + ansible.builtin.set_fact: + existing_model_names: "{{ existing_dwpd_yml.stackhpc_dwpd_ratings | default([]) | map(attribute='model_name') | list }}" + run_once: true + + - name: Identify new models not already in the configuration + ansible.builtin.set_fact: + new_models: "{{ all_nvme_models | default([]) | reject('in', existing_model_names | default([])) | list }}" + run_once: true + + - name: Create entry dictionary for new models + ansible.builtin.set_fact: + new_entries: "{{ new_entries | default([]) + [{'model_name': item, 'rated_dwpd': 1}] }}" + loop: "{{ new_models }}" + run_once: true + when: new_models | length > 0 + + - name: Build updated list for stackhpc_dwpd_ratings + ansible.builtin.set_fact: + new_dwpd_list: "{{ existing_dwpd_yml.stackhpc_dwpd_ratings | default([]) + (new_entries | default([])) }}" + run_once: true + + - name: Write updated dwpd-ratings.yml + ansible.builtin.copy: + content: "---\nstackhpc_dwpd_ratings:\n{% for item in new_dwpd_list %} - model_name: \"{{ item.model_name }}\"\n rated_dwpd: {{ item.rated_dwpd }}\n{% endfor %}" + dest: "{{ kayobe_env_config_path }}/dwpd-ratings.yml" + run_once: true + notify: Show updated dwpd-ratings.yml contents + when: new_dwpd_list is defined and new_dwpd_list | length > 0 + + handlers: + - name: Show updated dwpd-ratings.yml contents + ansible.builtin.debug: + msg: + - "Updated local dwpd-ratings.yml contents" + - "{{ {'stackhpc_dwpd_ratings': new_dwpd_list} | to_nice_yaml }}" + - "PLEASE REVIEW AND COMMIT {{ kayobe_env_config_path }}/dwpd-ratings.yml TO VERSION CONTROL." + run_once: true + changed_when: true diff --git a/etc/kayobe/ansible/requirements.yml b/etc/kayobe/ansible/requirements.yml index fe061e0c9c..3aceec6d43 100644 --- a/etc/kayobe/ansible/requirements.yml +++ b/etc/kayobe/ansible/requirements.yml @@ -11,7 +11,7 @@ collections: - name: stackhpc.hashicorp version: 2.7.1 - name: stackhpc.kayobe_workflows - version: 1.1.0 + version: 1.2.0 roles: - src: stackhpc.vxlan version: 1.1.0 @@ -32,3 +32,7 @@ roles: - name: geerlingguy.docker src: https://github.com/stackhpc/ansible-role-docker.git version: stackhpc/7.0.1.1 + # (jackhodgkiss) Update once patch is merged and released upstream. + - src: https://github.com/stackhpc/ansible-gitlab-runner + name: riemers.gitlab-runner + version: use-ansible-facts diff --git a/etc/kayobe/ansible/scripts/nvmemon.sh b/etc/kayobe/ansible/scripts/nvmemon.sh index 761e81b7da..40c2cb70f0 100644 --- a/etc/kayobe/ansible/scripts/nvmemon.sh +++ b/etc/kayobe/ansible/scripts/nvmemon.sh @@ -21,6 +21,43 @@ if ! command -v nvme >/dev/null 2>&1; then exit 1 fi +if ! command -v jq >/dev/null 2>&1; then + echo "${0##*/}: jq is required but not installed. Aborting." >&2 + exit 1 +fi + +# Path to the DWPD ratings JSON file +dwpd_file="/opt/kayobe/etc/monitoring/dwpd_ratings.json" + +declare -A rated_dwpd + +load_dwpd_ratings() { + if [[ -f "$dwpd_file" ]]; then + # Read the JSON; if it fails, default to empty array + dwpd_json="$(cat "$dwpd_file" 2>/dev/null | jq '.' || echo '[]')" + + # We iterate over each array element in dwpd_json + while IFS= read -r line; do + key="$(echo "$line" | jq -r '.model_name')" + value="$(echo "$line" | jq -r '.rated_dwpd')" + + # Clean up trailing whitespace + key="${key%%[[:space:]]*}" + value="${value%%[[:space:]]*}" + + # If we have a valid key, store it in the dictionary + if [[ -n "$key" && "$key" != "null" ]]; then + rated_dwpd["$key"]="$value" + fi + done < <(echo "$dwpd_json" | jq -c '.[]') + else + echo "Warning: DWPD ratings file not found at '$dwpd_file'. Defaulting to rated_dwpd=1." >&2 + fi +} + + +load_dwpd_ratings + output_format_awk="$( cat <<'OUTPUTAWK' BEGIN { v = "" } @@ -44,58 +81,70 @@ format_output() { nvme_version="$(nvme version | awk '$1 == "nvme" {print $3}')" echo "nvmecli{version=\"${nvme_version}\"} 1" | format_output -# Get devices (DevicePath and PhysicalSize) -device_info="$(nvme list -o json | jq -c '.Devices[] | {DevicePath: .DevicePath, PhysicalSize: .PhysicalSize}')" +# Get devices (DevicePath, PhysicalSize and ModelNumber) +device_info="$(nvme list -o json | jq -c '.Devices[] | {DevicePath, PhysicalSize, ModelNumber, SerialNumber}')" + +# Convert device_info to an array +device_info_array=() +while IFS= read -r line; do + device_info_array+=("$line") +done <<< "$device_info" # Loop through the NVMe devices -echo "$device_info" | while read -r device_data; do - device=$(echo "$device_data" | jq -r '.DevicePath') +for device_data in "${device_info_array[@]}"; do + device="$(echo "$device_data" | jq -r '.DevicePath')" json_check="$(nvme smart-log -o json "${device}")" disk="${device##*/}" + model_name="$(echo "$device_data" | jq -r '.ModelNumber')" + serial_number="$(echo "$device_data" | jq -r '.SerialNumber')" - physical_size=$(echo "$device_data" | jq -r '.PhysicalSize') - echo "physical_size_bytes{device=\"${disk}\"} ${physical_size}" + physical_size="$(echo "$device_data" | jq -r '.PhysicalSize')" + echo "physical_size_bytes{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${physical_size}" # The temperature value in JSON is in Kelvin, we want Celsius value_temperature="$(echo "$json_check" | jq '.temperature - 273')" - echo "temperature_celsius{device=\"${disk}\"} ${value_temperature}" + echo "temperature_celsius{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_temperature}" + + # Get the rated DWPD from the dictionary or default to 1 if not found + value_rated_dwpd="${rated_dwpd[$model_name]:-1}" + echo "rated_dwpd{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_rated_dwpd}" value_available_spare="$(echo "$json_check" | jq '.avail_spare / 100')" - echo "available_spare_ratio{device=\"${disk}\"} ${value_available_spare}" + echo "available_spare_ratio{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_available_spare}" value_available_spare_threshold="$(echo "$json_check" | jq '.spare_thresh / 100')" - echo "available_spare_threshold_ratio{device=\"${disk}\"} ${value_available_spare_threshold}" + echo "available_spare_threshold_ratio{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_available_spare_threshold}" value_percentage_used="$(echo "$json_check" | jq '.percent_used / 100')" - echo "percentage_used_ratio{device=\"${disk}\"} ${value_percentage_used}" + echo "percentage_used_ratio{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_percentage_used}" value_critical_warning="$(echo "$json_check" | jq '.critical_warning')" - echo "critical_warning_total{device=\"${disk}\"} ${value_critical_warning}" + echo "critical_warning_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_critical_warning}" value_media_errors="$(echo "$json_check" | jq '.media_errors')" - echo "media_errors_total{device=\"${disk}\"} ${value_media_errors}" + echo "media_errors_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_media_errors}" value_num_err_log_entries="$(echo "$json_check" | jq '.num_err_log_entries')" - echo "num_err_log_entries_total{device=\"${disk}\"} ${value_num_err_log_entries}" + echo "num_err_log_entries_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_num_err_log_entries}" value_power_cycles="$(echo "$json_check" | jq '.power_cycles')" - echo "power_cycles_total{device=\"${disk}\"} ${value_power_cycles}" + echo "power_cycles_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_power_cycles}" value_power_on_hours="$(echo "$json_check" | jq '.power_on_hours')" - echo "power_on_hours_total{device=\"${disk}\"} ${value_power_on_hours}" + echo "power_on_hours_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_power_on_hours}" value_controller_busy_time="$(echo "$json_check" | jq '.controller_busy_time')" - echo "controller_busy_time_seconds{device=\"${disk}\"} ${value_controller_busy_time}" + echo "controller_busy_time_seconds{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_controller_busy_time}" value_data_units_written="$(echo "$json_check" | jq '.data_units_written')" - echo "data_units_written_total{device=\"${disk}\"} ${value_data_units_written}" + echo "data_units_written_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_data_units_written}" value_data_units_read="$(echo "$json_check" | jq '.data_units_read')" - echo "data_units_read_total{device=\"${disk}\"} ${value_data_units_read}" + echo "data_units_read_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_data_units_read}" value_host_read_commands="$(echo "$json_check" | jq '.host_read_commands')" - echo "host_read_commands_total{device=\"${disk}\"} ${value_host_read_commands}" + echo "host_read_commands_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_host_read_commands}" value_host_write_commands="$(echo "$json_check" | jq '.host_write_commands')" - echo "host_write_commands_total{device=\"${disk}\"} ${value_host_write_commands}" + echo "host_write_commands_total{device=\"${disk}\",model=\"${model_name}\",serial_number=\"${serial_number}\"} ${value_host_write_commands}" done | format_output diff --git a/etc/kayobe/ansible/smartmon-tools.yml b/etc/kayobe/ansible/smartmon-tools.yml index c6fa35accf..f9024f6e3d 100644 --- a/etc/kayobe/ansible/smartmon-tools.yml +++ b/etc/kayobe/ansible/smartmon-tools.yml @@ -1,7 +1,6 @@ --- -- name: Install and set up smartmon-tools +- name: Install and set up SMART monitoring tools hosts: overcloud - tasks: - name: Ensure smartmontools, jq, nvme-cli and cron/cronie are installed ansible.builtin.package: @@ -13,11 +12,23 @@ state: present become: true - - name: Ensure Python 3, venv, and pip are installed - ansible.builtin.package: - name: > - {{ ['python3', 'python3-pip'] + (['python3-venv'] if ansible_facts['distribution'] == 'Ubuntu' else []) }} + - name: Ensure Python 3, venv, and pip are installed on Debian/Ubuntu + ansible.builtin.apt: + name: + - python3 + - python3-venv + - python3-pip + state: present + when: ansible_facts.os_family == 'Debian' + become: true + + - name: Ensure Python 3, and pip are installed on RedHat/CentOS + ansible.builtin.dnf: + name: + - python3 + - python3-pip state: present + when: ansible_facts.os_family == 'RedHat' become: true - name: Create smartmon Python virtual environment @@ -31,6 +42,7 @@ name: - prometheus_client - pySMART + state: present virtualenv: /opt/smartmon-venv virtualenv_python: python3 become: true @@ -98,3 +110,35 @@ path: /usr/local/bin/smartmon.sh state: absent become: true + +- name: Gather NVMe drives and generate dwpd ratings + import_playbook: get-nvme-drives.yml + when: create_dwpd_ratings | default(false) + +- name: Copy DWPD ratings to overcloud hosts + hosts: overcloud + gather_facts: false + tasks: + - name: Convert the stackhpc_dwpd_ratings variable to JSON + ansible.builtin.set_fact: + dwpd_ratings_json: "{{ stackhpc_dwpd_ratings | default([]) | to_json }}" + run_once: true + when: stackhpc_dwpd_ratings is defined + + - name: Ensure /opt/kayobe/etc/monitoring directory exists + ansible.builtin.file: + path: /opt/kayobe/etc/monitoring + state: directory + mode: '0755' + become: true + when: stackhpc_dwpd_ratings is defined + + - name: Copy JSON file to remote + ansible.builtin.copy: + content: "{{ dwpd_ratings_json }}" + dest: "/opt/kayobe/etc/monitoring/dwpd_ratings.json" + owner: root + group: root + mode: '0644' + become: true + when: stackhpc_dwpd_ratings is defined diff --git a/etc/kayobe/ansible/write-gitlab-pipelines.yml b/etc/kayobe/ansible/write-gitlab-pipelines.yml new file mode 100644 index 0000000000..c85e6adcdd --- /dev/null +++ b/etc/kayobe/ansible/write-gitlab-pipelines.yml @@ -0,0 +1,7 @@ +--- +- name: Write Kayobe Automation Pipeline for GitLab + hosts: gitlab-writer + vars: + gitlab_output_directory: "{{ kayobe_config_path }}/../../" + roles: + - stackhpc.kayobe_workflows.gitlab diff --git a/etc/kayobe/containers/pulp/post.yml b/etc/kayobe/containers/pulp/post.yml index 7a4e7e5957..34c25093f1 100644 --- a/etc/kayobe/containers/pulp/post.yml +++ b/etc/kayobe/containers/pulp/post.yml @@ -4,7 +4,7 @@ url: "{{ pulp_url }}/pulp/api/v3/status/" register: pulp_status until: pulp_status is success - retries: "{{ pulp_timeout_retries | default(30) }}" + retries: "{{ pulp_timeout_retries | default(120) }}" delay: "{{ pulp_timeout_delay | default(3) }}" - name: Set the Pulp admin password diff --git a/etc/kayobe/environments/ci-multinode/stackhpc-monitoring.yml b/etc/kayobe/environments/ci-multinode/stackhpc-monitoring.yml index 1d9514553e..93ce650b4f 100644 --- a/etc/kayobe/environments/ci-multinode/stackhpc-monitoring.yml +++ b/etc/kayobe/environments/ci-multinode/stackhpc-monitoring.yml @@ -1,3 +1,3 @@ --- # Path to a CA certificate file to trust in the OpenStack Capacity exporter. -stackhpc_os_capacity_openstack_cacert: "{{ kayobe_env_config_path }}/kolla/certificates/ca/openbao.crt" +stackhpc_os_capacity_openstack_cacert: "{{ kayobe_env_config_path }}/kolla/certificates/ca/vault.crt" diff --git a/etc/kayobe/environments/ci-multinode/tempest.yml b/etc/kayobe/environments/ci-multinode/tempest.yml index ae2d8f1325..0657946bb4 100644 --- a/etc/kayobe/environments/ci-multinode/tempest.yml +++ b/etc/kayobe/environments/ci-multinode/tempest.yml @@ -3,4 +3,4 @@ rally_no_sensitive_log: false # Add the Vault CA certificate to the rally container when running tempest. -tempest_cacert: "{{ kayobe_env_config_path }}/kolla/certificates/ca/openbao.crt" +tempest_cacert: "{{ kayobe_env_config_path }}/kolla/certificates/ca/vault.crt" diff --git a/etc/kayobe/inventory/group_vars/gitlab-runners/runners.yml b/etc/kayobe/inventory/group_vars/gitlab-runners/runners.yml new file mode 100644 index 0000000000..346acef278 --- /dev/null +++ b/etc/kayobe/inventory/group_vars/gitlab-runners/runners.yml @@ -0,0 +1,9 @@ +--- +# Configuration of GitLab runners using riemers.gitlab-runner should go here. +# See documentation for more information +# https://github.com/riemers/ansible-gitlab-runner +# https://stackhpc-kayobe-config.readthedocs.io/en/stackhpc-2024.1/configuration/ci-cd.html + +############################################################################### +# Dummy variable to allow Ansible to accept this file. +workaround_ansible_issue_8743: yes diff --git a/etc/kayobe/inventory/group_vars/gitlab-writer/writer.yml b/etc/kayobe/inventory/group_vars/gitlab-writer/writer.yml new file mode 100644 index 0000000000..be2ce1c7e4 --- /dev/null +++ b/etc/kayobe/inventory/group_vars/gitlab-writer/writer.yml @@ -0,0 +1,14 @@ +--- +# Configuration of GitLab pipelines generated with stackhpc.kayobe_workflows.gitlab should go here. +# See documentation for more information +# https://github.com/stackhpc/ansible-collection-kayobe-workflows/blob/main/roles/gitlab/README.md + +gitlab_output_directory: $KAYOBE_CONFIG_PATH/../../.gitlab/ + +gitlab_registry: "{{ pulp_url | regex_replace('^https?://|^http?://', '') }}" + +gitlab_openstack_release: "{{ openstack_release }}" + +############################################################################### +# Dummy variable to allow Ansible to accept this file. +workaround_ansible_issue_8743: yes diff --git a/etc/kayobe/inventory/groups b/etc/kayobe/inventory/groups index dfaa264a41..3f4b7241a9 100644 --- a/etc/kayobe/inventory/groups +++ b/etc/kayobe/inventory/groups @@ -38,6 +38,12 @@ overcloud [github-writer] localhost +[gitlab-runners] +# Empty group to provide declaration of gitlab-runner group. + +[gitlab-writer] +localhost + ############################################################################### # Overcloud groups. @@ -90,6 +96,7 @@ network monitoring storage compute +gitlab-runners [docker-registry:children] # Hosts in this group will have a Docker Registry deployed. This group should diff --git a/etc/kayobe/kolla-image-tags.yml b/etc/kayobe/kolla-image-tags.yml index 55f0c0970c..b4f9d28769 100644 --- a/etc/kayobe/kolla-image-tags.yml +++ b/etc/kayobe/kolla-image-tags.yml @@ -6,6 +6,9 @@ kolla_image_tags: openstack: rocky-9: 2025.1-rocky-9-20250616T133037 ubuntu-noble: 2025.1-ubuntu-noble-20250613T131221 + neutron_bgp_dragent: + rocky-9: 2025.1-rocky-9-20250715T140744 + ubuntu-noble: 2025.1-ubuntu-noble-20250715T140744 neutron_metadata_agent: rocky-9: 2025.1-rocky-9-20250626T074649 ubuntu-noble: 2025.1-ubuntu-noble-20250626T074649 diff --git a/etc/kayobe/kolla.yml b/etc/kayobe/kolla.yml index 5e6a8b9739..02e40dc76a 100644 --- a/etc/kayobe/kolla.yml +++ b/etc/kayobe/kolla.yml @@ -513,6 +513,8 @@ kolla_build_args: {} kolla_overcloud_inventory_pass_through_host_vars_extra: - stackhpc_gpu_data - gpu_group_map + - stackhpc_radosgw_usage_exporter_frontend_port + - stackhpc_radosgw_usage_exporter_backend_port # List of names of host variables to pass through from kayobe hosts to # kolla-ansible hosts, if set. See also diff --git a/etc/kayobe/kolla/config/grafana/dashboards/openstack/hardware_overview.json b/etc/kayobe/kolla/config/grafana/dashboards/openstack/hardware_overview.json index b27496136e..b305502223 100644 --- a/etc/kayobe/kolla/config/grafana/dashboards/openstack/hardware_overview.json +++ b/etc/kayobe/kolla/config/grafana/dashboards/openstack/hardware_overview.json @@ -1,5 +1,48 @@ {% raw %} { + "__inputs": [ + { + "name": "datasource", + "label": "Prometheus", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + } + ], + "__elements": {}, + "__requires": [ + { + "type": "grafana", + "id": "grafana", + "name": "Grafana", + "version": "11.4.0" + }, + { + "type": "datasource", + "id": "prometheus", + "name": "Prometheus", + "version": "1.0.0" + }, + { + "type": "panel", + "id": "stat", + "name": "Stat", + "version": "" + }, + { + "type": "panel", + "id": "table", + "name": "Table", + "version": "" + }, + { + "type": "panel", + "id": "timeseries", + "name": "Time series", + "version": "" + } + ], "annotations": { "list": [ { @@ -25,9 +68,22 @@ "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, + "id": null, "links": [], - "liveNow": false, "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 10, + "panels": [], + "title": "All Disks Metrics", + "type": "row" + }, { "datasource": { "type": "prometheus", @@ -56,15 +112,15 @@ "h": 7, "w": 6, "x": 0, - "y": 0 + "y": 1 }, - "hideTimeOverride": false, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", + "percentChangeColorMode": "standard", "reduceOptions": { "calcs": [ "lastNotNull" @@ -76,7 +132,7 @@ "textMode": "auto", "wideLayout": true }, - "pluginVersion": "11.0.0", + "pluginVersion": "11.4.0", "targets": [ { "datasource": { @@ -130,15 +186,15 @@ "h": 7, "w": 6, "x": 6, - "y": 0 + "y": 1 }, - "hideTimeOverride": false, "id": 5, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", + "percentChangeColorMode": "standard", "reduceOptions": { "calcs": [ "lastNotNull" @@ -150,7 +206,7 @@ "textMode": "auto", "wideLayout": true }, - "pluginVersion": "11.0.0", + "pluginVersion": "11.4.0", "targets": [ { "datasource": { @@ -199,15 +255,15 @@ "h": 7, "w": 6, "x": 12, - "y": 0 + "y": 1 }, - "hideTimeOverride": false, "id": 6, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", + "percentChangeColorMode": "standard", "reduceOptions": { "calcs": [ "lastNotNull" @@ -219,7 +275,7 @@ "textMode": "auto", "wideLayout": true }, - "pluginVersion": "11.0.0", + "pluginVersion": "11.4.0", "targets": [ { "datasource": { @@ -489,7 +545,7 @@ "h": 10, "w": 20, "x": 0, - "y": 7 + "y": 8 }, "id": 2, "options": { @@ -506,7 +562,7 @@ "showHeader": true, "sortBy": [] }, - "pluginVersion": "11.0.0", + "pluginVersion": "11.4.0", "targets": [ { "$$hashKey": "object:40", @@ -541,7 +597,7 @@ }, "editorMode": "code", "exemplar": false, - "expr": "smartmon_temperature_case_raw_value{instance=~\"$node\"} or smartmon_temperature_celsius_raw_value{instance=~\"$node\"}", + "expr": "smartmon_temperature{instance=~\"$node\"}", "format": "table", "hide": false, "instant": true, @@ -562,7 +618,12 @@ { "id": "organize", "options": { - "excludeByName": {}, + "excludeByName": { + "Time": true, + "__name__": true, + "job": true + }, + "includeByName": {}, "indexByName": { "Time 1": 3, "Time 2": 10, @@ -583,7 +644,12 @@ "type 1": 2, "type 2": 17 }, - "renameByName": {} + "renameByName": { + "disk": "Disk", + "instance": "Instance", + "job": "", + "type": "Type" + } } } ], @@ -607,6 +673,7 @@ "axisLabel": "Temperature (°C)", "axisPlacement": "auto", "barAlignment": 0, + "barWidthFactor": 0.6, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", @@ -653,9 +720,8 @@ "h": 13, "w": 9, "x": 0, - "y": 17 + "y": 18 }, - "hideTimeOverride": false, "id": 8, "options": { "legend": { @@ -665,11 +731,11 @@ "showLegend": true }, "tooltip": { - "maxHeight": 600, "mode": "single", "sort": "none" } }, + "pluginVersion": "11.4.0", "targets": [ { "datasource": { @@ -678,7 +744,7 @@ }, "editorMode": "code", "exemplar": false, - "expr": "avg_over_time(smartmon_temperature_case_raw_value{instance=~\"$node\"}[1h]) or avg_over_time(smartmon_temperature_celsius_raw_value{instance=~\"$node\"}[1h])", + "expr": "avg_over_time(smartmon_temperature{instance=~\"$node\"}[1h])", "instant": false, "interval": "", "legendFormat": "{{instance}} - {{disk}} - {{serial_number}}", @@ -686,9 +752,548 @@ "refId": "A" } ], - "title": "Disk Temperatures", + "title": "All Disk Temperatures", "type": "timeseries" }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 31 + }, + "id": 11, + "panels": [], + "title": "NVMe Metrics", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "description": "", + "fieldConfig": { + "defaults": { + "custom": { + "align": "center", + "cellOptions": { + "type": "auto" + }, + "filterable": false, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "job" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "unique_device" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Value #Health" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "mode": "basic", + "type": "color-background" + } + }, + { + "id": "custom.width" + }, + { + "id": "displayName", + "value": "Health" + }, + { + "id": "mappings", + "value": [ + { + "options": { + "0": { + "color": "green", + "index": 0, + "text": "Ok" + } + }, + "type": "value" + }, + { + "options": { + "from": 1, + "result": { + "color": "red", + "index": 1, + "text": "Bad" + }, + "to": 1000000000000000 + }, + "type": "range" + } + ] + } + ] + }, + { + "matcher": { + "id": "byRegexp", + "options": ".* 2" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Value #Temp" + }, + "properties": [ + { + "id": "displayName", + "value": "Temperature" + }, + { + "id": "unit", + "value": "celsius" + }, + { + "id": "noValue", + "value": "-" + }, + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + } + ] + }, + { + "matcher": { + "id": "byRegexp", + "options": ".* 1" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "__name__" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byRegexp", + "options": ".* 3" + }, + "properties": [ + { + "id": "custom.hidden", + "value": true + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Value #TBW" + }, + "properties": [ + { + "id": "unit", + "value": "deckbytes" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Value #TBR" + }, + "properties": [ + { + "id": "unit", + "value": "deckbytes" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Value #Capacity" + }, + "properties": [ + { + "id": "unit", + "value": "decbytes" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Device" + }, + "properties": [ + { + "id": "links", + "value": [ + { + "title": "", + "url": "/d/uesjf83hh/nvme-monitoring?var-serial_number=${__data.fields[\"Serial Number\"]}" + } + ] + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 20, + "x": 0, + "y": 32 + }, + "id": 12, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 1, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "TBW" + } + ] + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "$$hashKey": "object:40", + "aggregation": "Last", + "alias": "Healthy", + "crit": 0, + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "decimals": 0, + "displayAliasType": "Warning / Critical", + "displayType": "Regular", + "displayValueWithAlias": "Never", + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_critical_warning_total{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\")", + "format": "table", + "instant": true, + "interval": "", + "legendFormat": "", + "range": false, + "refId": "Health", + "units": "none", + "valueHandler": "Number Threshold", + "warn": 0 + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_temperature_celsius{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\")", + "format": "table", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "__auto", + "range": false, + "refId": "Temp" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_data_units_written_total{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\") * 512", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "", + "range": false, + "refId": "TBW" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_data_units_read_total{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\") * 512", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "", + "range": false, + "refId": "TBR" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_physical_size_bytes{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\")", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "", + "range": false, + "refId": "Capacity" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(delta(nvme_data_units_written_total{instance=~\"$node\"}[24h])*512000,\"unique_device\", \"-\", \"instance\", \"device\")/label_join(nvme_physical_size_bytes{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\")", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "", + "range": false, + "refId": "DWPD" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "label_join(nvme_rated_dwpd{instance=~\"$node\"},\"unique_device\", \"-\", \"instance\", \"device\")", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "Rated DWPD" + } + ], + "title": "SMART Info", + "transformations": [ + { + "id": "seriesToColumns", + "options": { + "byField": "unique_device", + "mode": "outer" + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time 1": true, + "Time 2": true, + "Time 3": true, + "Time 4": true, + "Time 5": true, + "Time 6": true, + "Time 7": true, + "Value #Health": false, + "__name__ 1": true, + "__name__ 2": true, + "__name__ 3": true, + "__name__ 4": true, + "device 1": false, + "device 2": true, + "device 3": true, + "device 4": true, + "device 5": true, + "device 6": true, + "device 7": true, + "instance 1": false, + "instance 2": true, + "instance 3": true, + "instance 4": true, + "instance 5": true, + "instance 6": true, + "instance 7": true, + "job 1": true, + "job 2": true, + "job 3": true, + "job 4": true, + "job 5": true, + "job 6": true, + "job 7": true, + "model 2": true, + "model 3": true, + "model 4": true, + "model 5": true, + "model 6": true, + "model 7": true, + "original_device 1": true, + "original_device 2": true, + "original_device 3": true, + "original_device 4": true, + "original_device 5": true, + "original_device 6": true, + "original_device 7": true, + "serial_number 2": true, + "serial_number 3": true, + "serial_number 4": true, + "serial_number 5": true, + "serial_number 6": true, + "serial_number 7": true, + "unique_device": true + }, + "includeByName": {}, + "indexByName": { + "Time 1": 11, + "Time 2": 15, + "Time 3": 23, + "Time 4": 27, + "Time 5": 32, + "Time 6": 38, + "Time 7": 53, + "Value #Capacity": 6, + "Value #DWPD": 8, + "Value #Health": 2, + "Value #Rated DWPD": 7, + "Value #TBR": 5, + "Value #TBW": 4, + "Value #Temp": 3, + "__name__ 1": 12, + "__name__ 2": 16, + "__name__ 3": 37, + "__name__ 4": 54, + "device 1": 1, + "device 2": 21, + "device 3": 24, + "device 4": 28, + "device 5": 33, + "device 6": 39, + "device 7": 55, + "instance 1": 0, + "instance 2": 17, + "instance 3": 14, + "instance 4": 29, + "instance 5": 34, + "instance 6": 40, + "instance 7": 56, + "job 1": 13, + "job 2": 18, + "job 3": 25, + "job 4": 30, + "job 5": 35, + "job 6": 41, + "job 7": 57, + "model 1": 9, + "model 2": 43, + "model 3": 45, + "model 4": 47, + "model 5": 49, + "model 6": 51, + "model 7": 58, + "original_device 1": 20, + "original_device 2": 22, + "original_device 3": 26, + "original_device 4": 31, + "original_device 5": 36, + "original_device 6": 42, + "original_device 7": 59, + "serial_number 1": 10, + "serial_number 2": 44, + "serial_number 3": 46, + "serial_number 4": 48, + "serial_number 5": 50, + "serial_number 6": 52, + "serial_number 7": 60, + "unique_device": 19 + }, + "renameByName": { + "Time 1": "", + "Value #Capacity": "Disk Size", + "Value #DWPD": "DWPD", + "Value #Rated DWPD": "Rated DWPD", + "Value #TBR": "TBR", + "Value #TBW": "TBW", + "__name__ 1": "", + "device 1": "Device", + "instance 1": "Hostname", + "model 1": "Model Name", + "serial_number 1": "Serial Number" + } + } + } + ], + "transparent": true, + "type": "table" + }, { "datasource": { "type": "prometheus", @@ -707,6 +1312,7 @@ "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, + "barWidthFactor": 0.6, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", @@ -748,8 +1354,8 @@ "gridPos": { "h": 13, "w": 10, - "x": 9, - "y": 17 + "x": 0, + "y": 42 }, "id": 9, "options": { @@ -760,11 +1366,11 @@ "showLegend": true }, "tooltip": { - "maxHeight": 600, "mode": "single", "sort": "none" } }, + "pluginVersion": "11.4.0", "targets": [ { "datasource": { @@ -780,58 +1386,142 @@ ], "title": "DWPD", "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "Temperature (°C)", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 13, + "w": 9, + "x": 10, + "y": 42 + }, + "id": 13, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "avg_over_time(nvme_temperature_celsius{instance=~\"$node\"}[1h]) ", + "instant": false, + "interval": "", + "legendFormat": "{{instance}} - {{device}}", + "range": true, + "refId": "A" + } + ], + "title": "NVMe Temperatures", + "type": "timeseries" } ], "refresh": false, - "schemaVersion": 39, + "schemaVersion": 40, "tags": [], "templating": { "list": [ { + "baseFilters": [], "datasource": { "type": "prometheus", "uid": "PBFA97CFB590B2093" }, "filters": [], - "hide": 0, "name": "Filters", - "skipUrlSync": false, "type": "adhoc" }, { - "current": { - "selected": false, - "text": "Prometheus", - "value": "Prometheus" - }, - "hide": 0, + "current": {}, "includeAll": false, - "multi": false, "name": "datasource", "options": [], "query": "prometheus", - "queryValue": "", "refresh": 1, "regex": "", - "skipUrlSync": false, "type": "datasource" }, { "allValue": ".*", - "current": { - "selected": false, - "text": "All", - "value": "$__all" - }, + "current": {}, "datasource": { "type": "prometheus", "uid": "${datasource}" }, "definition": "label_values(node_cpu_seconds_total{job=\"node\"}, instance)", - "hide": 0, "includeAll": true, "label": "Host:", - "multi": false, "name": "node", "options": [], "query": { @@ -840,7 +1530,6 @@ }, "refresh": 1, "regex": "", - "skipUrlSync": false, "sort": 1, "type": "query" } @@ -850,12 +1539,11 @@ "from": "now-24h", "to": "now" }, - "timeRangeUpdatedDuringEditOrView": false, "timepicker": {}, "timezone": "", "title": "Hardware Overview", "uid": "TCN51Y25P", - "version": 1, + "version": 10, "weekStart": "" } {% endraw %} diff --git a/etc/kayobe/kolla/config/grafana/dashboards/openstack/nvme.json b/etc/kayobe/kolla/config/grafana/dashboards/openstack/nvme.json new file mode 100644 index 0000000000..1669b02a06 --- /dev/null +++ b/etc/kayobe/kolla/config/grafana/dashboards/openstack/nvme.json @@ -0,0 +1,1217 @@ +{% raw %} +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": 17197, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 2, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "text": {}, + "textMode": "name", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "expr": "nvme_data_units_written_total{serial_number=~\"$serial_number\"}", + "instant": true, + "legendFormat": "{{instance}} - {{device}} - {{serial_number}}", + "refId": "A" + } + ], + "title": "Device & Serial Number", + "type": "stat" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 3 + }, + "id": 26, + "panels": [], + "title": "Device Information", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "decbytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 0, + "y": 4 + }, + "id": 22, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_physical_size_bytes{serial_number=\"$serial_number\"}", + "legendFormat": "Physical Size", + "refId": "A" + } + ], + "title": "Physical Size", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "yellow", + "value": 65 + }, + { + "color": "red", + "value": 75 + } + ] + }, + "unit": "celsius" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 4, + "y": 4 + }, + "id": 6, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_temperature_celsius{serial_number=\"$serial_number\"}", + "legendFormat": "Temperature", + "refId": "A" + } + ], + "title": "Temperature", + "type": "gauge" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 8, + "y": 4 + }, + "id": 23, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_rated_dwpd{serial_number=\"$serial_number\"}", + "legendFormat": "Rated DWPD", + "refId": "A" + } + ], + "title": "Rated DWPD", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "orange", + "value": 50000 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 12, + "y": 4 + }, + "id": 17, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_power_on_hours_total{serial_number=\"$serial_number\"}", + "legendFormat": "Power Hours", + "refId": "A" + } + ], + "title": "Power-On Hours", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 16, + "y": 4 + }, + "id": 16, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_power_cycles_total{serial_number=\"$serial_number\"}", + "legendFormat": "Power Cycles", + "refId": "A" + } + ], + "title": "Power Cycles", + "type": "stat" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 11 + }, + "id": 10, + "panels": [], + "title": "Health Indicators", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 0, + "y": 12 + }, + "id": 21, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_critical_warning_total{serial_number=\"$serial_number\"}", + "legendFormat": "Critical Warnings", + "refId": "A" + } + ], + "title": "Critical Warnings", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "orange", + "value": 80 + }, + { + "color": "red", + "value": 90 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 4, + "y": 12 + }, + "id": 5, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_percentage_used_ratio{serial_number=\"$serial_number\"} * 100", + "legendFormat": "Percentage Used", + "refId": "A" + } + ], + "title": "Percentage Used", + "type": "gauge" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 8, + "y": 12 + }, + "id": 19, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_num_err_log_entries_total{serial_number=\"$serial_number\"}", + "legendFormat": "Error Log Entries", + "refId": "A" + } + ], + "title": "Error Log Entries", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 12, + "y": 12 + }, + "id": 18, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_media_errors_total{serial_number=\"$serial_number\"}", + "legendFormat": "Media Errors", + "refId": "A" + } + ], + "title": "Media Errors", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "yellow", + "value": 20 + }, + { + "color": "green", + "value": 50 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 16, + "y": 12 + }, + "id": 4, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_available_spare_ratio{serial_number=\"$serial_number\"} * 100", + "legendFormat": "Available Spare", + "refId": "A" + } + ], + "title": "Available Spare Ratio", + "type": "gauge" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 20, + "y": 12 + }, + "id": 15, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "expr": "nvme_available_spare_threshold_ratio{serial_number=\"$serial_number\"} * 100", + "legendFormat": "Spare Threshold", + "refId": "A" + } + ], + "title": "Spare Threshold Ratio", + "type": "stat" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 19 + }, + "id": 11, + "panels": [], + "title": "Performance Metrics", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 12, + "x": 0, + "y": 20 + }, + "id": 7, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "expr": "rate(nvme_data_units_read_total{serial_number=\"$serial_number\"}[5m])*512000", + "legendFormat": "Data Read", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "expr": "rate(nvme_data_units_written_total{serial_number=\"$serial_number\"}[5m])*512000", + "legendFormat": "Data Written", + "range": true, + "refId": "B" + } + ], + "title": "Disk I/O", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "decbytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 12, + "y": 20 + }, + "id": 25, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "nvme_data_units_written_total{serial_number=\"$serial_number\"} * 512000", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "All Time TBW", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "decbytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 18, + "y": 20 + }, + "id": 24, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "expr": "nvme_data_units_read_total{serial_number=\"$serial_number\"} * 512000", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "All Time TBR", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 12, + "x": 0, + "y": 27 + }, + "id": 20, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "editorMode": "code", + "expr": "irate(nvme_controller_busy_time_seconds{serial_number=\"$serial_number\"}[5m])", + "legendFormat": "Controller Busy Time", + "range": true, + "refId": "A" + } + ], + "title": "Controller Busy Time", + "type": "timeseries" + } + ], + "preload": false, + "refresh": "1m", + "schemaVersion": 40, + "tags": [], + "templating": { + "list": [ + { + "current": { + "text": "Prometheus", + "value": "PBFA97CFB590B2093" + }, + "description": "", + "label": "Datasource", + "name": "datasource", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "type": "datasource" + }, + { + "current": { + "text": "Z2M0A13LTCD8", + "value": "Z2M0A13LTCD8" + }, + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "definition": "label_values(nvme_data_units_read_total,serial_number)", + "includeAll": false, + "label": "Serial Number", + "name": "serial_number", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(nvme_data_units_read_total,serial_number)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + } + ] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "NVMe Monitoring", + "uid": "uesjf83hh", + "version": 1, + "weekStart": "" +} +{% endraw %} diff --git a/etc/kayobe/kolla/config/grafana/dashboards/openstack/rabbitmq.json b/etc/kayobe/kolla/config/grafana/dashboards/openstack/rabbitmq.json index 82008deb3e..010adfa2bf 100644 --- a/etc/kayobe/kolla/config/grafana/dashboards/openstack/rabbitmq.json +++ b/etc/kayobe/kolla/config/grafana/dashboards/openstack/rabbitmq.json @@ -3391,7 +3391,7 @@ "steppedLine": false, "targets": [ { - "expr": "sum(\n (rate(rabbitmq_channel_messages_delivered_total[60s]) * on(instance) group_left(rabbitmq_node, rabbitmq_node) rabbitmq_identity_info{rabbitmq_node=\"$rabbitmq_node\", namespace=\"$namespace\"}) +\n (rate(rabbitmq_channel_messages_delivered_ack_total[60s]) * on(instance) group_left(rabbitmq_node, rabbitmq_node) rabbitmq_identity_info{rabbitmq_node=\"$rabbitmq_node\", namespace=\"$namespace\"})\n) by(rabbitmq_node)", + "expr": "sum(\n (rate(rabbitmq_channel_messages_delivered_total[5m]) * on(instance) group_left(rabbitmq_node, rabbitmq_node) rabbitmq_identity_info{rabbitmq_node=\"$rabbitmq_node\", namespace=\"$namespace\"}) +\n (rate(rabbitmq_channel_messages_delivered_ack_total[5m]) * on(instance) group_left(rabbitmq_node, rabbitmq_node) rabbitmq_identity_info{rabbitmq_node=\"$rabbitmq_node\", namespace=\"$namespace\"})\n) by(rabbitmq_node)", "format": "time_series", "instant": false, "intervalFactor": 1, @@ -5887,9 +5887,9 @@ ] }, "timezone": "", - "title": "RabbitMQ-Overview - Update", + "title": "RabbitMQ-Overview", "uid": "Hz7D2_LGp", - "version": 1, + "version": 2, "weekStart": "" } {% endraw %} diff --git a/etc/kayobe/kolla/config/haproxy/services.d/radosgw_usage_exporter.cfg b/etc/kayobe/kolla/config/haproxy/services.d/radosgw_usage_exporter.cfg new file mode 100644 index 0000000000..50e55a2831 --- /dev/null +++ b/etc/kayobe/kolla/config/haproxy/services.d/radosgw_usage_exporter.cfg @@ -0,0 +1,25 @@ +{% if stackhpc_enable_radosgw_usage_exporter | bool %} +{% raw %} +frontend radosgw_usage_exporter_frontend + mode http + http-request del-header X-Forwarded-Proto + option httplog + option forwardfor + http-request set-header X-Forwarded-Proto https if { ssl_fc } +{% if kolla_enable_tls_internal | bool %} + bind {{ kolla_internal_vip_address }}:{{ stackhpc_radosgw_usage_exporter_frontend_port }} ssl crt /etc/haproxy/certificates/haproxy-internal.pem +{% else %} + bind {{ kolla_internal_vip_address }}:{{ stackhpc_radosgw_usage_exporter_frontend_port }} +{% endif %} + default_backend radosgw_usage_exporter_backend + +backend radosgw_usage_exporter_backend + mode http + +{% for host in groups['monitoring'] %} +{% set host_name = hostvars[host].ansible_facts.hostname %} +{% set host_ip = 'api' | kolla_address(host) %} + server {{ host_name }} {{ host_ip }}:{{ stackhpc_radosgw_usage_exporter_backend_port }} check inter 2000 rise 2 fall 5 +{% endfor %} +{% endraw %} +{% endif %} diff --git a/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/80-radosgw-exporter.yml b/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/80-radosgw-exporter.yml index 304736a80f..7c3a204bd0 100644 --- a/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/80-radosgw-exporter.yml +++ b/etc/kayobe/kolla/config/prometheus/prometheus.yml.d/80-radosgw-exporter.yml @@ -14,8 +14,9 @@ scrape_configs: regex: (.+) static_configs: - targets: - {% for host in groups['monitoring'] %} - - "{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{% endraw %}{{ stackhpc_radosgw_usage_exporter_port }}{% raw %}" - {% endfor %} + - "{{ kolla_internal_fqdn | put_address_in_context('url') }}:{{ stackhpc_radosgw_usage_exporter_frontend_port }}" +{% if kolla_enable_tls_internal | bool %} + scheme: https +{% endif %} {% endraw %} {% endif %} diff --git a/etc/kayobe/kolla/config/prometheus/smart.rules b/etc/kayobe/kolla/config/prometheus/smart.rules index 853d9268a1..cd7dbb3d6b 100644 --- a/etc/kayobe/kolla/config/prometheus/smart.rules +++ b/etc/kayobe/kolla/config/prometheus/smart.rules @@ -14,19 +14,19 @@ groups: description: "{{ $labels.instance }} is reporting unhealthy for the disk at {{ $labels.disk }}. Disk serial number is: {{ $labels.serial_number }}" - alert: DWPDTooHigh - expr: (delta(nvme_data_units_written_total[30d])*512000 / nvme_physical_size_bytes) / 30 > 1 + expr: (delta(nvme_data_units_written_total[30d])*512000 / nvme_physical_size_bytes) / 30 > nvme_rated_dwpd labels: severity: alert annotations: summary: "High 30-Day Average DWPD for {{ $labels.instance }}" - description: "The 30-Day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds 1 DWPD" + description: "The 30-Day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds the rated DWPD" - alert: DWPDTooHighWarning - expr: (delta(nvme_data_units_written_total[7d])*512000 / nvme_physical_size_bytes) / 7 > 1 + expr: (delta(nvme_data_units_written_total[7d])*512000 / nvme_physical_size_bytes) / 7 > nvme_rated_dwpd labels: severity: warning annotations: summary: "High 7-Day Average DWPD for {{ $labels.instance }}" - description: "The 7-day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds 1 DWPD" + description: "The 7-day average for Disk Writes Per Day for disk {{ $labels.device }} on {{ $labels.instance }} exceeds the rated DWPD" {% endraw %} diff --git a/etc/kayobe/ofed.yml b/etc/kayobe/ofed.yml index 3ca9201fb5..7867206f64 100644 --- a/etc/kayobe/ofed.yml +++ b/etc/kayobe/ofed.yml @@ -3,7 +3,7 @@ ############################################################################### # DOCA host version -stackhpc_pulp_doca_version: 2.9.1 +stackhpc_pulp_doca_version: "{{ '2.9.3' if stackhpc_pulp_repo_rocky_9_minor_version == '6' else '2.9.1' }}" ############################################################################### # Pulp configuration for DOCA OFED @@ -11,10 +11,14 @@ stackhpc_pulp_doca_version: 2.9.1 # Whether to sync OFED repositories into the local Pulp service stackhpc_pulp_sync_ofed: "{{ groups['mlnx'] | length > 0 }}" +# DOCA Snapshot lookup vars +doca_version_lookup_var: "stackhpc_pulp_repo_doca_{{ stackhpc_pulp_doca_version | replace('.', '_') }}_rhel9_{{ stackhpc_pulp_repo_rocky_9_minor_version }}_version" +doca_modules_version_lookup_var: "stackhpc_pulp_repo_doca_{{ stackhpc_pulp_doca_version | replace('.', '_') }}_rhel9_{{ stackhpc_pulp_repo_rocky_9_minor_version }}_modules_version" + # DOCA Snapshot versions. The defaults use the appropriate version from # pulp-repo-versions.yml -stackhpc_pulp_repo_rhel9_doca_version: "{{ lookup('vars', 'stackhpc_pulp_repo_rhel_9_{{ stackhpc_pulp_repo_rocky_9_minor_version }}_doca_version') }}" -stackhpc_pulp_repo_rhel9_doca_modules_version: "{{ lookup('vars', 'stackhpc_pulp_repo_rhel_9_{{ stackhpc_pulp_repo_rocky_9_minor_version }}_doca_modules_version') }}" +stackhpc_pulp_repo_rhel9_doca_version: "{{ lookup('vars', doca_version_lookup_var) }}" +stackhpc_pulp_repo_rhel9_doca_modules_version: "{{ lookup('vars', doca_modules_version_lookup_var) }}" ############################################################################### # Dummy variable to allow Ansible to accept this file. diff --git a/etc/kayobe/pulp-repo-versions.yml b/etc/kayobe/pulp-repo-versions.yml index a7315553dc..33fa847191 100644 --- a/etc/kayobe/pulp-repo-versions.yml +++ b/etc/kayobe/pulp-repo-versions.yml @@ -1,10 +1,17 @@ --- # This file is autogenerated by Ansible using the following workflow: # https://github.com/stackhpc/stackhpc-release-train/actions/workflows/package-update-kayobe.yml +stackhpc_pulp_repo_almalinux_9_proxysql_2_7_version: 20250627T134211 stackhpc_pulp_repo_centos_stream_9_docker_version: 20250531T002004 stackhpc_pulp_repo_centos_stream_9_nfv_openvswitch_version: 20250528T022338 stackhpc_pulp_repo_centos_stream_9_opstools_version: 20231213T031318 stackhpc_pulp_repo_centos_stream_9_storage_ceph_squid_version: 20250412T024303 +stackhpc_pulp_repo_doca_2_9_1_rhel9_4_version: 20241211T153620 +stackhpc_pulp_repo_doca_2_9_1_rhel9_4_modules_version: 20241213T112245 +stackhpc_pulp_repo_doca_2_9_1_rhel9_5_version: 20241211T171301 +stackhpc_pulp_repo_doca_2_9_1_rhel9_5_modules_version: 20250115T150314 +stackhpc_pulp_repo_doca_2_9_3_rhel9_6_version: 20250703T135021 +stackhpc_pulp_repo_doca_2_9_3_rhel9_6_modules_version: 20250714T141841 stackhpc_pulp_repo_docker_ce_ubuntu_noble_version: 20250604T001951 stackhpc_pulp_repo_elrepo_9_version: 20250610T235426 stackhpc_pulp_repo_epel_9_version: 20250615T000221 @@ -13,16 +20,7 @@ stackhpc_pulp_repo_opensearch_2_x_version: 20250430T014638 stackhpc_pulp_repo_opensearch_dashboards_2_x_version: 20250430T014638 stackhpc_pulp_repo_rhel9_rabbitmq_erlang_version: 20250607T003941 stackhpc_pulp_repo_rhel9_rabbitmq_server_version: 20250607T003941 -stackhpc_pulp_repo_rhel_9_4_doca_modules_version: 20241213T112245 -stackhpc_pulp_repo_rhel_9_4_doca_version: 20241211T153620 -stackhpc_pulp_repo_rhel_9_5_doca_modules_version: 20250115T150314 -stackhpc_pulp_repo_rhel_9_5_doca_version: 20241211T171301 -###### NOTE: Dummy variables, currently no RL9.6 DOCA -stackhpc_pulp_repo_rhel_9_6_doca_modules_version: 00000000T000000 -stackhpc_pulp_repo_rhel_9_6_doca_version: 00000000T000000 -###### stackhpc_pulp_repo_rhel_9_influxdb_version: 20250529T023704 -stackhpc_pulp_repo_almalinux_9_proxysql_2_7_version: 20250627T134211 stackhpc_pulp_repo_rhel_9_mariadb_10_11_version: 20250523T014203 stackhpc_pulp_repo_rhel_9_rabbitmq_erlang_version: 20240711T091318 stackhpc_pulp_repo_rhel_9_rabbitmq_server_version: 20240711T091318 diff --git a/etc/kayobe/pulp.yml b/etc/kayobe/pulp.yml index 345850b261..6a1d5b3873 100644 --- a/etc/kayobe/pulp.yml +++ b/etc/kayobe/pulp.yml @@ -201,7 +201,7 @@ stackhpc_pulp_distribution_deb_production: >- # Whether to sync Rocky Linux 9 packages. stackhpc_pulp_sync_rocky_9: "{{ os_distribution == 'rocky' }}" # Rocky 9 minor version number. Supported values: 6. Default is 6 -stackhpc_pulp_repo_rocky_9_minor_version: 6 +stackhpc_pulp_repo_rocky_9_minor_version: '6' # Rocky 9 Snapshot versions. The defaults use the appropriate version from # pulp-repo-versions.yml for the selected minor release. stackhpc_pulp_repo_rocky_9_appstream_version: "{{ lookup('vars', 'stackhpc_pulp_repo_rocky_9_%s_appstream_version' % stackhpc_pulp_repo_rocky_9_minor_version) }}" @@ -495,6 +495,7 @@ stackhpc_pulp_images_kolla: - mariadb-clustercheck - mariadb-server - memcached + - neutron-bgp-dragent - neutron-dhcp-agent - neutron-l3-agent - neutron-metadata-agent @@ -683,7 +684,7 @@ stackhpc_pulp_repository_container_repos_openbao: policy: on_demand proxy_url: "{{ pulp_proxy_url }}" state: present - include_tags: "{{ overcloud_vault_docker_tag }}" + include_tags: "{{ overcloud_openbao_docker_tag }}" required: "{{ stackhpc_sync_openbao_images | bool }}" # List of OpenBao container image distributions. diff --git a/etc/kayobe/stackhpc-monitoring.yml b/etc/kayobe/stackhpc-monitoring.yml index a2a88b503e..f629b153b9 100644 --- a/etc/kayobe/stackhpc-monitoring.yml +++ b/etc/kayobe/stackhpc-monitoring.yml @@ -74,8 +74,11 @@ stackhpc_prometheus_openstack_exporter_interval: 300 # Prometheus scrape targets during deployment. stackhpc_enable_radosgw_usage_exporter: false -# Port to expose RADOS gateway usage exporter. Default is 9242 -stackhpc_radosgw_usage_exporter_port: 9242 +# Port to expose RADOS gateway usage exporter backend. Default is 9242 +stackhpc_radosgw_usage_exporter_backend_port: 9242 + +# Port to expose RADOS gateway usage exporter frontend (via HAProxy). Default is 9240 +stackhpc_radosgw_usage_exporter_frontend_port: 9240 # Path to a certificate for internal TLS in the RADOS gateway usage exporter. stackhpc_radosgw_usage_exporter_cacert: "" diff --git a/etc/kayobe/trivy/allowed-vulnerabilities.yml b/etc/kayobe/trivy/allowed-vulnerabilities.yml index a44e0508b4..4759862058 100644 --- a/etc/kayobe/trivy/allowed-vulnerabilities.yml +++ b/etc/kayobe/trivy/allowed-vulnerabilities.yml @@ -14,10 +14,12 @@ # - CVE-2023-31047 fluentd_allowed_vulnerabilities: - CVE-2024-27280 - grafana_allowed_vulnerabilities: - CVE-2024-8986 - +influxdb_allowed_vulnerabilities: + - CVE-2024-45337 +magnum_conductor_allowed_vulnerabilities: + - CVE-2024-45337 prometheus_blackbox_exporter_allowed_vulnerabilities: - CVE-2024-45337 prometheus_memcached_exporter_allowed_vulnerabilities: @@ -35,8 +37,6 @@ prometheus_libvirt_exporter_allowed_vulnerabilities: prometheus_cadvisor_allowed_vulnerabilities: - CVE-2024-41110 - CVE-2024-45337 -influxdb_allowed_vulnerabilities: - - CVE-2024-45337 ############################################################################### # Dummy variable to allow Ansible to accept this file. diff --git a/releasenotes/notes/doca-2-9-3-238838fb78e0c7d9.yaml b/releasenotes/notes/doca-2-9-3-238838fb78e0c7d9.yaml new file mode 100644 index 0000000000..054f460a07 --- /dev/null +++ b/releasenotes/notes/doca-2-9-3-238838fb78e0c7d9.yaml @@ -0,0 +1,5 @@ +--- +features: + - | + Added support for DOCA OFED on Rocky Linux 9.6 at version ``2.9.3``. The + package versions for Rocky 9.4 and 9.5 remain unchanged, using ``2.9.1``. diff --git a/releasenotes/notes/fix-blackbox-exporter-config-with-no-external-grafana-de1db02c540af6d8.yaml b/releasenotes/notes/fix-blackbox-exporter-config-with-no-external-grafana-de1db02c540af6d8.yaml new file mode 100644 index 0000000000..bcf30fa2e0 --- /dev/null +++ b/releasenotes/notes/fix-blackbox-exporter-config-with-no-external-grafana-de1db02c540af6d8.yaml @@ -0,0 +1,6 @@ +--- +fixes: + - | + Fixes an issue where the external Grafana endpoint would be added to + Prometheus Blackbox Exporter config, even when ``enable_grafana_external`` + was disabled. diff --git a/releasenotes/notes/fix-kayobe-version-checks-d1fb3e09391e4a3e.yaml b/releasenotes/notes/fix-kayobe-version-checks-d1fb3e09391e4a3e.yaml new file mode 100644 index 0000000000..f185977a45 --- /dev/null +++ b/releasenotes/notes/fix-kayobe-version-checks-d1fb3e09391e4a3e.yaml @@ -0,0 +1,5 @@ +--- +fixes: + - | + Fix Kayobe version checks that were failing on multiuser + Ansible control hosts. diff --git a/releasenotes/notes/fix-openbao-include-tag-dfef2a0e731674f0.yaml b/releasenotes/notes/fix-openbao-include-tag-dfef2a0e731674f0.yaml new file mode 100644 index 0000000000..0b3a889e5e --- /dev/null +++ b/releasenotes/notes/fix-openbao-include-tag-dfef2a0e731674f0.yaml @@ -0,0 +1,4 @@ +--- +fixes: + - | + Ensure that the correct tag is used for ``OpenBao`` repository in ``Pulp``. diff --git a/releasenotes/notes/increase-pulp-retries-79cc258da9aabb4f.yaml b/releasenotes/notes/increase-pulp-retries-79cc258da9aabb4f.yaml new file mode 100644 index 0000000000..b388672c70 --- /dev/null +++ b/releasenotes/notes/increase-pulp-retries-79cc258da9aabb4f.yaml @@ -0,0 +1,5 @@ +--- +features: + - | + Increase the number of retries when waiting for Pulp to become ready. + This is to avoid issues with Pulp taking longer than expected to start up. diff --git a/releasenotes/notes/radosgw-usage-exporter-behind-haproxy-e4371d2732f2a081.yaml b/releasenotes/notes/radosgw-usage-exporter-behind-haproxy-e4371d2732f2a081.yaml new file mode 100644 index 0000000000..338cde866b --- /dev/null +++ b/releasenotes/notes/radosgw-usage-exporter-behind-haproxy-e4371d2732f2a081.yaml @@ -0,0 +1,12 @@ +--- +features: + - | + The radosgw-usage-exporter is now put behind HAProxy. To facilitate this, + ``stackhpc_radosgw_usage_exporter_port`` had been renamed to + ``stackhpc_radosgw_usage_exporter_backend_port`` (it remains 9242) and + ``stackhpc_radosgw_usage_exporter_frontend_port`` (defaults to 9240) has + been introduced. +fixes: + - | + Fixes an issue where object storage metrics were missing from Prometheus by + putting the radosgw-usage-exporter behind HAProxy. diff --git a/releasenotes/notes/rated-dwpd-40526e85e24ef7ea.yaml b/releasenotes/notes/rated-dwpd-40526e85e24ef7ea.yaml new file mode 100644 index 0000000000..5b8eb50650 --- /dev/null +++ b/releasenotes/notes/rated-dwpd-40526e85e24ef7ea.yaml @@ -0,0 +1,8 @@ +--- +features: + - | + Add support of the operator supplying the rated DWPD value for NVMe drives. + There is a playbook ``get-nvme-drives.yml`` that will populate a new + section in the ``stackhpc-monitoring.yml`` file with drive model names for + NVMes in the cloud. The operator can then fill in the rated DWPD values for + each drive. diff --git a/releasenotes/notes/release-neutron-bgp-dragent-tags-39b9c2595ae11872.yaml b/releasenotes/notes/release-neutron-bgp-dragent-tags-39b9c2595ae11872.yaml new file mode 100644 index 0000000000..bf9efd40a1 --- /dev/null +++ b/releasenotes/notes/release-neutron-bgp-dragent-tags-39b9c2595ae11872.yaml @@ -0,0 +1,4 @@ +--- +features: + - | + Add support for syncing ``neutron-bgp-dragent`` image into ``Pulp``. diff --git a/releasenotes/notes/rmq-dashboard-5bb5eb36ff72100c.yaml b/releasenotes/notes/rmq-dashboard-5bb5eb36ff72100c.yaml new file mode 100644 index 0000000000..39fdbba236 --- /dev/null +++ b/releasenotes/notes/rmq-dashboard-5bb5eb36ff72100c.yaml @@ -0,0 +1,5 @@ +--- +fixes: + - | + Minor issue with RabbitMQ dashboard panel showing no data with + default scrape settings.