Skip to content

Conversation

@jackhodgkiss
Copy link
Contributor

The RabbitMQNodeDown made the assumption that all deployments involve three controllers. However, this is not always the case as we do support deployments with a single controller or more than three controllers.

Before this would have caused false alerts in deployments with a single controller. Whilst also concealing alerts in deployments with more than three controllers.

@jackhodgkiss jackhodgkiss self-assigned this Mar 17, 2025
@jackhodgkiss jackhodgkiss requested a review from a team as a code owner March 17, 2025 13:21
@product-auto-label product-auto-label bot added size: xs monitoring All things related to observability & telemetry labels Mar 17, 2025
The `RabbitMQNodeDown` made the assumption that all deployments involve
only three RabbitMQ nodes. However, this is not always the case as we
do support deployments with a single node or more than three.

Before this would have caused false alerts in deployments with a single
RabbitMQ node. Whilst also concealing alerts in deployments with more
than three nodes.
@jackhodgkiss jackhodgkiss force-pushed the fix-rabbitmq-node-down-rule branch from 61b564c to e183052 Compare March 23, 2025 12:39
@jackhodgkiss jackhodgkiss requested review from MoteHue and jovial March 23, 2025 12:40
@jackhodgkiss jackhodgkiss changed the title fix: use controller length for RabbitMQNodeDown fix: use rabbitmq length for RabbitMQNodeDown Mar 24, 2025
Copy link
Contributor

@MoteHue MoteHue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

MoteHue
MoteHue previously approved these changes Mar 24, 2025
@jackhodgkiss jackhodgkiss marked this pull request as draft March 24, 2025 22:54
@jackhodgkiss
Copy link
Contributor Author

This fails to template correctly.

  - alert: RabbitMQNodeDown
    expr: sum(rabbitmq_build_info{instance!=""}) < {{ groups['rabbitmq'] | length }}
    for: 30m
    labels:

@Alex-Welsh
Copy link
Member

Kolla-Ansible uses copy, not template, for rules files [1], so they can either be hard-coded or templated by Kayobe.

Possible Kayobe groups are: all, ungrouped, seed, seed-hypervisor, container-image-builders, hypervisors, infra-vms, wazuh-manager, wazuh-agent, github-runners, github-writer, controllers, network, monitoring, storage, compute-vgpu, compute, overcloud, vgpu, iommu, mlnx, docker, docker-registry, ntp, baremetal-compute, mgmt-switches, ctl-switches, hs-switches, switches, ceph, mons, mgrs, osds, rgws, cis-hardening, redfish_exporter_targets, fix-hostname, tempest_runner, controllers_with_ironic_enabled_False

Short term I'd say we make a new variable in SKC and default it to the length of the controller group, and have a backlog task to make the prometheus rules files templatable in KA
[1] https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/prometheus/tasks/config.yml#L38

@seunghun1ee
Copy link
Member

Good idea. Happy to +1 once it's in ready-to-review state

@jackhodgkiss jackhodgkiss marked this pull request as ready for review May 18, 2025 12:11
Copy link
Member

@Alex-Welsh Alex-Welsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack, I think this solution works well. Just need to update the release note

@jackhodgkiss jackhodgkiss force-pushed the fix-rabbitmq-node-down-rule branch from e62f3fb to 747181f Compare May 19, 2025 13:17
@Alex-Welsh Alex-Welsh enabled auto-merge (squash) May 19, 2025 13:20
@Alex-Welsh Alex-Welsh merged commit 64da1b1 into stackhpc/2024.1 May 19, 2025
15 checks passed
@Alex-Welsh Alex-Welsh deleted the fix-rabbitmq-node-down-rule branch May 19, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

monitoring All things related to observability & telemetry size: s

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants