Skip to content

Conversation

@rarescosma
Copy link
Contributor

@rarescosma rarescosma commented Dec 11, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

Fixes a couple of issues identified while running the end-to-end suite from the main branch while standing up a brand new CAPI development cluster.

#2892 introduced some fixes for the top_queries index, but was also missing a closing quote character on an alert rule, causing the prometheus-alerts chart to fail:

diff --git a/helmfile.d/charts/prometheus-alerts/templates/alerts/opensearch.yaml b/helmfile.d/charts/prometheus-alerts/templates/alerts/opensearch.yaml
index 37ebbe22..e6912931 100644
--- a/helmfile.d/charts/prometheus-alerts/templates/alerts/opensearch.yaml
+++ b/helmfile.d/charts/prometheus-alerts/templates/alerts/opensearch.yaml
@@ -66,7 +66,7 @@ spec:
         summary: Index {{`{{ $labels.index }}`}} is using {{`{{ $value }}`}} percent of max field limit
         runbook_url: {{ .Values.runbookUrls.opensearch.OpenSearchFieldLimit }}
     - alert: OpenSearchFieldLimit
-      expr: (sum(max_over_time(elasticsearch_indices_mappings_stats_fields{namespace="opensearch-system",index!~"top_queries.*"}[5m])) by (index) / sum(max_over_time(elasticsearch_indices_settings_total_fields{namespace="opensearch-system",index!~"top_queries.*}[5m])) by (index)) * 100 > 95
+      expr: (sum(max_over_time(elasticsearch_indices_mappings_stats_fields{namespace="opensearch-system",index!~"top_queries.*"}[5m])) by (index) / sum(max_over_time(elasticsearch_indices_settings_total_fields{namespace="opensearch-system",index!~"top_queries.*"}[5m])) by (index)) * 100 > 95
       for: 15m
       labels:
         severity: critical

One-character fix, but I also introduced a new unit test suite that uses promtool to go over all alerting rules and make sure they are valid.

#2884 made some velero configurations that resulted in the snapshotVolumes key to appear in specs, but the end-to-end specs were not updated.

Additionally, I added terminationGracePeriodSeconds: 1 to the velero test application manifests to speed up the teardown phase of the e2e suite.

Information to reviewers

To run the new suite, you'll need to rebuild the unit test Docker image:

make -C tests build-unit
make -C tests run-unit/general/alerting-rules.bats

If feeling adventurous, run the Velero e2e suite as well:

make -C tests run-end-to-end/velero

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@rarescosma rarescosma self-assigned this Dec 11, 2025
@rarescosma rarescosma requested review from a team as code owners December 11, 2025 08:48
@rarescosma rarescosma added the kind/bug Something isn't working label Dec 11, 2025
Copy link
Contributor

@AlbinB97 AlbinB97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM, will test during apps patch release

Copy link
Contributor

@elastisys-staffan elastisys-staffan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

Copy link
Contributor

@aarnq aarnq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvements.

@rarescosma rarescosma force-pushed the rar/upcoming-patch-fixes branch from 8fca74d to bd4ee25 Compare December 11, 2025 11:51
@rarescosma rarescosma merged commit 0b5ffda into main Dec 11, 2025
12 checks passed
@rarescosma rarescosma deleted the rar/upcoming-patch-fixes branch December 11, 2025 14:11
AlbinB97 pushed a commit that referenced this pull request Dec 17, 2025
AlbinB97 pushed a commit that referenced this pull request Dec 17, 2025
rarescosma added a commit that referenced this pull request Dec 19, 2025
rarescosma added a commit that referenced this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants