|
| 1 | +h1. Serverless Quality Gate Manual Approvals Analysis (Last 6 Months) |
| 2 | + |
| 3 | +h2. Summary |
| 4 | + |
| 5 | +This report analyzes serverless promotions from June 2025 to November 2025 (approximately 6 months) and identifies all quality gate checks that required manual approval to proceed. Out of 30 promotions analyzed, *29 quality gate checks* across multiple environments required manual intervention due to various failures including transient errors, SLO violations, alert triggers, and infrastructure issues. |
| 6 | + |
| 7 | +h2. Manual Approval Details |
| 8 | + |
| 9 | +||Date||Environment||Build||Failure Details||Manual Approval Message||Buildkite Link|| |
| 10 | +|2025-11-10|staging|#319|Quality gate failure requiring manual review|transient 429s|[Build 319|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/319]| |
| 11 | +|2025-11-10|qa|#319|Project SLO failure|Project of SLO recovered we need to revisit our SLO strategy for promotions here I think to make this less distracting but still reliable|[Build 319|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/319]| |
| 12 | +|2025-11-05|staging|#318|Quality gate failure|https://elastic.slack.com/archives/C0631EPCFLP/p1762350012524689|[Build 318|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/318]| |
| 13 | +|2025-11-05|qa|#318|Project SLO violation during recovery|As discussed in https://elastic.slack.com/archives/C05PJK7UZE1/p1762346054878269?thread_ts=1762283421.843679&cid=C05PJK7UZE1 we ignore the SLO for project cc5b2b7dab2b4b7b98a3a3eae71e8f98 as it's recuperating slowly|[Build 318|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/318]| |
| 14 | +|2025-10-27|qa|#316|Quality gate threshold exceeded|https://elastic.slack.com/archives/C05PJK7UZE1/p1761570514143299?thread_ts=1761566557.196319&cid=C05PJK7UZE1|[Build 316|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/316]| |
| 15 | +|2025-10-20|qa|#315|Troublesome project issue|Issue with troublesome project has been resolved. See https://elastic.slack.com/archives/C09NB1LNEPJ|[Build 315|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/315]| |
| 16 | +|2025-10-16|production-canary|#314|Buildkite agent stopped with soft fail|Buildkite agent stopped with soft fail after waiting properly for 24h. All quality gates checks passed.|[Build 314|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/314]| |
| 17 | +|2025-10-15|staging|#314|Projects in shutdown process|The three projects no longer exist and seemed in the process of shutting down - see https://elastic.slack.com/archives/C05PJK7UZE1/p1760529019169529?thread_ts=1760528319.123829&cid=C05PJK7UZE1|[Build 314|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/314]| |
| 18 | +|2025-10-15|qa|#314|Red health due to known issue ES-13187|Checked red health was due to https://elasticco.atlassian.net/browse/ES-13187 - check https://elastic.slack.com/archives/C05PJK7UZE1/p1760516114291399?thread_ts=1760458971.389209&cid=C05PJK7UZE1|[Build 314|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/314]| |
| 19 | +|2025-10-08|staging|#313|E2E test failure due to rotated API keys|e2e tests failed due to setup issue with rotated project api keys for staging|[Build 313|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/313]| |
| 20 | +|2025-10-08|qa|#313|Red health percentage slightly over threshold|The red health percentage was 0.152, just over the threshold of 0.15. I have looked at the dashboards and nothing looks concerning.|[Build 313|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/313]| |
| 21 | +|2025-10-02|staging|#312|Retry for transient issue|Retry for https://elastic.slack.com/archives/C09GW9DPXA6/p1759428637467299|[Build 312|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/312]| |
| 22 | +|2025-09-25|production-canary|#36 (Emergency)|Error log rate from NPE exceptions|The error log rate is caused by NPE exceptions. Considering this as temporary issue because metrics are attempting to fetching cluster stats too early, when initial cluster state was not set yet. - https://elastic.slack.com/archives/C09GG7A3CQ7/p1758790369294919 - https://elasticco.atlassian.net/browse/ES-13022|[Build 36|https://buildkite.com/elastic/elasticsearch-serverless-promote-emergency-release/builds/36]| |
| 23 | +|2025-09-25|staging|#36 (Emergency)|Manual verification required|Staging verified manually. See https://elastic.slack.com/archives/C05PJK7UZE1/p1758785399579619?thread_ts=1758783527.000659&cid=C05PJK7UZE1|[Build 36|https://buildkite.com/elastic/elasticsearch-serverless-promote-emergency-release/builds/36]| |
| 24 | +|2025-09-25|staging|#36 (Emergency)|Quality gate issue|More details here: https://elastic.slack.com/archives/C05PJK7UZE1/p1758784646185769?thread_ts=1758783527.000659&cid=C05PJK7UZE1|[Build 36|https://buildkite.com/elastic/elasticsearch-serverless-promote-emergency-release/builds/36]| |
| 25 | +|2025-09-16|staging|#310|Known 429 rate alert|known "Elasticsearch Search 429 Rate burn rate" alert fired|[Build 310|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/310]| |
| 26 | +|2025-09-16|qa|#310|High red health rate general issue|the high red health rate is something general we dealing with in QA these days. It hasn't raised really by this promotion. Also error logs kept been low|[Build 310|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/310]| |
| 27 | +|2025-09-08|staging|#308|Transient issues and known 429 alerts|Transient issues and known 429 alerts|[Build 308|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/308]| |
| 28 | +|2025-09-01|staging|#307|RED health after restarts - shard unavailability|There were 6 instances where health was reported as RED. All occurred just after restarts due to shard unavailability, and all projects recovered themselves within a short period. We decided this is not a blocker for the promotion, so I'm restarting it|[Build 307|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/307]| |
| 29 | +|2025-08-27|staging|#306|Spurious 429 error alerts|A few clusters experienced a momentary high rate of 429 errors that tripped alerts. They all auto resolved themselves shortly after alerting. These alerts aren't considered fine tuned yet and have been known to be spurious sometimes. Proceeding with the release.|[Build 306|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/306]| |
| 30 | +|2025-08-27|qa|#306|Pre-existing indices bug affecting projects|The two projects (b2289f7bd808419abd85edeed7be1cce & c0b90ee5bb90441995ad9faf8b6b6c3d) that failed are now green. The source of the issue was a bug with pre-existing indices. The promotion changes should clean up the problem for new indices, but there could be some projects alerts until indices roll over. It has been agreed upon that the promotion should proceed.|[Build 306|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/306]| |
| 31 | +|2025-08-18|staging|#304|Non-blocking failed alerts|Triaged failed alerts in https://elastic.slack.com/archives/C05PJK7UZE1/p1755535840885439. We've determined that both alerts are not blocking.|[Build 304|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/304]| |
| 32 | +|2025-08-18|qa|#304|Red project spike resolved|Checked the overview dashboard for red projects which are now below the spike we had due to the promotion|[Build 304|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/304]| |
| 33 | +|2025-08-12|staging|#303|Transient alerts investigated manually|Transient Alerts have been investigated manually. Discussion is slack at https://elastic.slack.com/archives/C05PJK7UZE1/p1754997331769909|[Build 303|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/303]| |
| 34 | +|2025-08-12|qa|#303|Kibana FTR test incompatibility|Incompatibility issue with Kibana FTR Tests. Fixed by the kibana team and validated by rerunning faliing tests. now all green|[Build 303|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/303]| |
| 35 | +|2025-08-05|staging|#302|Known serverless alerts|Two alerts are already known about in Serverless|[Build 302|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/302]| |
| 36 | +|2025-07-18|production-canary|#299|Benign errors|We agreed in https://elastic.slack.com/archives/C05QR7WNXA4/p1752804112104249 that the errors are benign, we can continue the promotion process.|[Build 299|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/299]| |
| 37 | +|2025-06-27|production-canary|#295|Baking failed but checks look good|baking failed. but nearly 24h done. checks look good. continuing|[Build 295|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/295]| |
| 38 | +|2025-06-09|staging|#292|429 errors reviewed with team|429s - checked with Julio and he was good with proceeding. https://elastic.slack.com/archives/C05PJK7UZE1/p1749479606102719?thread_ts=1749472164.513829&cid=C05PJK7UZE1|[Build 292|https://buildkite.com/elastic/elasticsearch-serverless-promote-release/builds/292]| |
| 39 | + |
| 40 | +h2. Failure Categories Analysis |
| 41 | + |
| 42 | +h3. Most Common Issues |
| 43 | + |
| 44 | +# *429 Rate Limiting Alerts* (7 occurrences) - Transient 429 errors and rate limiting alerts, often spurious or auto-resolving |
| 45 | +# *Red Health Status* (5 occurrences) - Projects reporting red health due to shard unavailability after restarts, SLO violations, or threshold exceedances |
| 46 | +# *Infrastructure/Buildkite Issues* (3 occurrences) - Agent failures, baking process issues, API key rotation problems |
| 47 | +# *Known Bugs/Issues* (4 occurrences) - Pre-existing bugs, NPE exceptions, compatibility issues (e.g., ES-13187, ES-13022, Kibana FTR tests) |
| 48 | +# *Transient/Spurious Alerts* (6 occurrences) - Various transient issues that auto-resolved |
| 49 | +# *Project-Specific Issues* (4 occurrences) - Individual projects in shutdown, recovery, or with specific problems |
| 50 | + |
| 51 | +h3. Environment Breakdown |
| 52 | + |
| 53 | +* *QA*: 11 manual approvals |
| 54 | +* *Staging*: 14 manual approvals |
| 55 | +* *Production-Canary*: 4 manual approvals |
| 56 | +* *Production-NonCanary*: 0 manual approvals |
| 57 | + |
| 58 | +h2. Conclusions |
| 59 | + |
| 60 | +# *High Manual Intervention Rate*: Approximately 83% of promotions (25 out of 30) required at least one manual approval, indicating significant quality gate tuning is needed. |
| 61 | + |
| 62 | +# *QA and Staging Most Affected*: The majority of manual approvals occur in QA and Staging environments, which is expected as these are earlier stages in the promotion pipeline. However, the high rate suggests thresholds may be too strict or tests too sensitive. |
| 63 | + |
| 64 | +# *Recurring Issues*: |
| 65 | +** 429 rate limiting alerts are the most common issue but are often deemed non-blocking after investigation |
| 66 | +** Red health status frequently triggers gates but often resolves automatically |
| 67 | +** Infrastructure issues (agent failures, baking problems) cause unnecessary delays |
| 68 | + |
| 69 | +# *Alert Tuning Needed*: Multiple approval messages explicitly mention that alerts are "not fine tuned yet" or are "spurious", indicating a need for better alert thresholds and SLO definitions. |
| 70 | + |
| 71 | +# *Emergency Releases*: One emergency release (build #36) required multiple manual approvals across all environments, highlighting the tension between speed and quality gates during incidents. |
| 72 | + |
| 73 | +# *Documentation and Tracking*: Most manual approvals reference Slack discussions or JIRA tickets, showing good traceability but also indicating these issues require cross-team communication to resolve. |
| 74 | + |
| 75 | +h2. Recommendations |
| 76 | + |
| 77 | +# *Refine Quality Gate Thresholds*: Review and adjust SLO thresholds for red health percentage, 429 error rates, and other metrics to reduce false positives |
| 78 | +# *Improve Alert Signal-to-Noise*: Filter out transient/auto-resolving alerts or implement grace periods before triggering quality gate failures |
| 79 | +# *Automate Common Approval Patterns*: For known-safe issues (e.g., projects in shutdown, specific red health patterns after restarts), consider auto-approval logic |
| 80 | +# *Infrastructure Reliability*: Address recurring Buildkite agent issues and baking process failures to reduce non-functional blockers |
| 81 | +# *Better Pre-flight Checks*: Implement validation for infrastructure readiness (API keys, agent availability) before starting promotions |
| 82 | +# *SLO Strategy Review*: As mentioned in the most recent approval (Build #319), revisit the SLO strategy for promotions to make quality gates less distracting while maintaining reliability |
0 commit comments