ADR-019: Merge queue acceleration via optimistic non-overlapping multi-merge#5439
ADR-019: Merge queue acceleration via optimistic non-overlapping multi-merge#5439TGPSKI wants to merge 7 commits intoapp-sre:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds ADR-019 documenting an approach to accelerate merge throughput in gitlab_housekeeping.py by optimistically merging multiple non-overlapping MRs per reconcile loop.
Changes:
- Added new ADR-019 describing the optimistic non-overlapping multi-merge strategy and implementation plan.
- Updated ADR index and category listing to include ADR-019.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/adr/README.md | Adds ADR-019 to the main ADR table and “Execution Patterns” category list. |
| docs/adr/ADR-019-merge-queue-acceleration.md | New ADR detailing motivation, decision, alternatives, and implementation guidelines for optimistic multi-merge. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cd57da3 to
fdc1c78
Compare
ADR-019 Update: Addressing Review FeedbackPushed a second commit incorporating feedback from @jfchevrette, @BumbleFeng, and @hemslo. Here's what changed: 1. Reference-aware overlap detection (@jfchevrette)Problem: File-level overlap misses cross-file semantic dependencies via
Fix: Each MR's changed paths are now expanded to include forward refs ( 2.
|
Made-with: Cursor
|
Addressed @hemslo's follow-up on The problem: The fix: Redefined Changes in this push:
|
…ne refs Made-with: Cursor
|
Consistency review pass — fixed several inaccuracies:
|
|
Current rendered markdown |
chassing
left a comment
There was a problem hiding this comment.
Well written ADR ❤️!
gitlab-housekeeping is a beast, not just because of the complex code but also because of the many edge cases that must be considered. Maybe a simpler approach (comparing labels) would be easier to implement and to maintain.
|
|
||
| 1. **Redefine `limit` as a per-repo cap, not per-run.** Currently `limit` controls how many MRs are rebased *per reconcile run*, with the counter resetting each invocation. Since `gitlab-housekeeping` runs frequently (~1 min cycles), a `limit: 2` still results in many MRs being rebased across runs, each triggering a pipeline -- most of which are wasted when only one MR merges per cycle. The fix: change the semantics so `limit` means "at most N MRs should be in a rebased-and-pipeline-pending state at any time for this repo." Before rebasing, count how many MRs already have running or recent pipelines and only rebase up to `limit - already_active`. Keep `limit` small (2-3) to minimize wasted CI. This is a prerequisite for Phase 1: with per-repo semantics, bumping `limit` to 6 for multi-merge becomes safe because it controls the *steady-state pipeline concurrency*, not the per-run rebase burst. | ||
|
|
||
| 2. **Filter MRs with failed pipelines from the merge queue.** Move the pipeline-success check from `merge_merge_requests` into `preprocess_merge_requests` so MRs without a passing last pipeline are excluded before sorting and slot allocation. This prevents failed MRs from consuming rebase slots. |
There was a problem hiding this comment.
This is very dangerous, and we need to monitor that or ping the IC. In case of flaky integrations, e.g., old non-qontract-api-based slack-usergroups, a single failed build would disable retries, and an MR would be stuck.
There was a problem hiding this comment.
Agreed. We have too many transient errors as of now in MR checks. Retesting is crucial to move forward. Maybe at least have some amount of retry budget - but tracking that would need to happen through label or commit comment, increasing complexity.
There was a problem hiding this comment.
only use rebase to handle transient errors is not reliable, 2 cases:
- high priority approved mr, but pipeline run failed, it will stick at top and keep rebasing, this is my major toil to set block label and notify mr author
- transient error but no other mr merged, like in APAC time, the failed mr can stay there for the whole day
The proper way is treat pipeline run status like healthcheck probe, if last N pipelines failed, then it's dead, move to error queue for investigation. When it's fixed with a success run, move back to merge queue.
Optional we can auto /retest on failed pipeline for N times, but that count tracking is not easy as rebase triggered one, /retest will rerun last pipeline, rebase will create a new pipeline. So this can wait as this kind of mr can wait longer, if it's urgent, the author will manual /retest.
There was a problem hiding this comment.
Addressed in v4 (1c53883). Phase 0 now uses a healthcheck-probe model exactly as @hemslo described:
- Track consecutive pipeline failures per MR IID in
State - After N consecutive failures (configurable, default 3), apply
merge-errorlabel and exclude from merge queue - Auto-recover: when a new pipeline succeeds, reset the failure counter and remove the label
This balances retesting for flaky CI (@fishi0x01's point) with preventing zombie MRs from blocking the queue indefinitely (@chassing's concern). Simple pipeline-success filtering is explicitly called out as dangerous in the ADR for exactly the reasons you all identified.
The /retest auto-retry is deferred as @hemslo suggested -- count tracking across rebase vs retest pipelines adds complexity for limited benefit since urgent MRs will be manually retested.
There was a problem hiding this comment.
a few simplifications on implementation:
- we already fetch all pipelines for a given mr via
gl.get_merge_request_pipelines(mr), every rebase should create a new pipeline, we can just check last N pipelines for healthcheck, no need to maintain another state. - error queue just a mental modal, since we also have rendered markdown files / inscope plugins to show merge / review queues , it's easier to add a new flag in merge queue to indicate this mr has errors, won't be merged or considered until errors fixed. IC can check error mrs in the merge queue page.
|
|
||
| - **Bundle-local:** Load the bundle JSON directly in `gitlab_housekeeping` and traverse datafiles to build the ref graph. This is self-contained but adds memory and startup cost. | ||
| - **qontract-server query:** Query the GraphQL API for synthetic backref fields and forward refs. This reuses existing infrastructure but adds network calls. | ||
| - **Pre-computed lookup table:** Build the reference graph as a periodic job (or as part of bundle validation) and store it as a JSON artifact. `gitlab_housekeeping` loads this artifact at startup. This is the most efficient at runtime. |
There was a problem hiding this comment.
TBH, I don't get it. gitlab-housekeeping has access to the prod qontract-server bundle with the state of master. How would this help to identify $refs for new files added in an MR? E.g. new user in MR-A, and MR-B changes roles/permissions.
There was a problem hiding this comment.
My understanding is this is purely about saving computation effort. In terms of overlap identification accuracy, there is no difference in whether you use the qontract-server bundle or the precomputed table. The precomputed table is a transformation of the bundle to have a more efficient lookup structure.
There was a problem hiding this comment.
Yes, that's clear, but the prod bundle doesn't know anything about the new files introduced by an MR.
There was a problem hiding this comment.
This was the fatal flaw that killed the ref-graph approach. Addressed in v4 (1c53883) by pivoting entirely to label-based overlap detection.
@chassing you were right -- the prod bundle cannot know about $refs in new files from MR branches. Fixing it would require fetching each MR's branch content and rebuilding the graph per-MR, which negates all the simplicity advantages.
The ref-graph approach is preserved as Alternative 5 in the ADR with this fatal flaw documented. The selected approach now uses tenant-* labels (already applied by gitlab_labeler), which are computed per-MR from its actual changed files -- so they inherently capture what each MR touches regardless of what's on master.
fishi0x01
left a comment
There was a problem hiding this comment.
This is an awesome idea and write up! Thank you Tyler! ❤️
My main concern would be to unleash that instantly on all MRs. E.g., what about qr-bump MRs or any hack-script changes? It would very likely make sense to exclude some types of MRs from this. Might also make sense to only include certain types of MRs initially, like only self-servicable MRs.
|
|
||
| 1. **Redefine `limit` as a per-repo cap, not per-run.** Currently `limit` controls how many MRs are rebased *per reconcile run*, with the counter resetting each invocation. Since `gitlab-housekeeping` runs frequently (~1 min cycles), a `limit: 2` still results in many MRs being rebased across runs, each triggering a pipeline -- most of which are wasted when only one MR merges per cycle. The fix: change the semantics so `limit` means "at most N MRs should be in a rebased-and-pipeline-pending state at any time for this repo." Before rebasing, count how many MRs already have running or recent pipelines and only rebase up to `limit - already_active`. Keep `limit` small (2-3) to minimize wasted CI. This is a prerequisite for Phase 1: with per-repo semantics, bumping `limit` to 6 for multi-merge becomes safe because it controls the *steady-state pipeline concurrency*, not the per-run rebase burst. | ||
|
|
||
| 2. **Filter MRs with failed pipelines from the merge queue.** Move the pipeline-success check from `merge_merge_requests` into `preprocess_merge_requests` so MRs without a passing last pipeline are excluded before sorting and slot allocation. This prevents failed MRs from consuming rebase slots. |
There was a problem hiding this comment.
Agreed. We have too many transient errors as of now in MR checks. Retesting is crucial to move forward. Maybe at least have some amount of retry budget - but tracking that would need to happen through label or commit comment, increasing complexity.
|
|
||
| - **Bundle-local:** Load the bundle JSON directly in `gitlab_housekeeping` and traverse datafiles to build the ref graph. This is self-contained but adds memory and startup cost. | ||
| - **qontract-server query:** Query the GraphQL API for synthetic backref fields and forward refs. This reuses existing infrastructure but adds network calls. | ||
| - **Pre-computed lookup table:** Build the reference graph as a periodic job (or as part of bundle validation) and store it as a JSON artifact. `gitlab_housekeeping` loads this artifact at startup. This is the most efficient at runtime. |
There was a problem hiding this comment.
My understanding is this is purely about saving computation effort. In terms of overlap identification accuracy, there is no difference in whether you use the qontract-server bundle or the precomputed table. The precomputed table is a transformation of the bundle to have a more efficient lookup structure.
|
|
||
| The recommended approach is the pre-computed lookup table, built during bundle validation and stored alongside the bundle. This adds zero runtime overhead to the merge loop beyond a dictionary lookup. | ||
|
|
||
| The expansion depth should be configurable (default: 1 hop). For most app-interface patterns, 1-hop expansion (direct refs and backrefs) is sufficient. Deeper transitive expansion increases safety but also increases the overlap rate, reducing throughput. |
There was a problem hiding this comment.
Tbh I would tend to go very conservative way and start with at least 2 hops, then observe MR statistics https://grafana.app-sre.devshift.net/d/xNTPSl-Vk/appsre-overview?orgId=1&from=now-28d&to=now&timezone=utc for performance. I would only tune to less hops if it is required by performance.
There was a problem hiding this comment.
No longer applicable -- the ref-graph and its hop-depth config were dropped entirely in v4 (1c53883) in favor of label-based overlap detection. There's no expansion depth to tune since we're comparing tenant-* label sets, not traversing a reference graph.
Your conservative instinct was well-placed though -- it was one of the signals that the ref-graph approach was getting complex. The label approach sidesteps all of that.
Replace reference-graph overlap detection with tenant-* label comparison. The ref-graph approach has a fatal flaw: the production bundle cannot capture $ref crossrefs from new files introduced by MR branches. Labels are coarser but provably safe (service-level isolation), require zero API calls, and are ~50 lines of Python. Key changes: - Decision/Key Points: label-based overlap + self-serviceable gate - Phase 0: healthcheck-probe model for pipeline failures (retry budget) - Phase 1: label comparison replaces ref-graph code entirely - Phase 2: change-type coverage refinement (conditional on metrics) - Alternative 5: ref-graph preserved for future reference - Alternative 6: change-type coverage overlap (mafriedm) - Throughput projections revised for label-based approach - Reviewers expanded with all feedback contributors Assisted-by: Claude (Anthropic) Made-with: Cursor
ADR-019 v4: Pivot to Label-Based Overlap DetectionMajor revision based on feedback from @jfchevrette, @BumbleFeng, @hemslo, @chassing, @fishi0x01, @mafriedm, @kfischer. What changedOverlap detection: ref-graph → tenant-* labels The reference-graph approach had a fatal flaw identified by @chassing: the production bundle doesn't contain files introduced by MR branches. If MR-A adds a new file with a The new approach uses
Self-serviceable eligibility gate (per @fishi0x01) Only MRs with Phase 0: Healthcheck-probe model (per @chassing, @fishi0x01, @hemslo) Simple pipeline-failure filtering replaced with retry budget: track consecutive failures per MR in Phase 2: Change-type refinement (per @mafriedm, @kfischer) Replaced "Speculative Stacking" with conditional change-type coverage overlap. If Phase 1 metrics show high same-service overlap, use change-type coverage to unlock finer-grained multi-merge. Only pursue if Ref-graph preserved as Alternative 5 for future reference, with the fatal flaw documented. Change-type overlap added as Alternative 6. Section-by-section changes
|
The self-serviceable label was too restrictive -- many non-self-serviceable MRs are safe for optimistic multi-merge. The real gate is having tenant-* labels (so the overlap check is meaningful). MRs without labels fall back to serial; MRs touching many services are naturally serialized by overlap. Assisted-by: Claude (Anthropic) Made-with: Cursor
ADR-019 v5: Relax eligibility gate from
|
ADR-019 v6: Empirical Throughput Baseline from Production Pod LogThe throughput estimates in the ADR were previously based on theoretical calculations (~10 min/cycle → ~6 MRs/hour). This update anchors them to empirical measurements from a production pod log captured during ADR development. Measurement MethodologyA full pod log from Merge events were extracted by grepping for Key Findings
MR Starvation Case Study (MR 177008)MR 177008 had an
This is the exact starvation pattern the ADR addresses. What Changed in the ADR
The rebase churn (2,142 events) and merge error count (267) further motivate Phase 0 work (per-repo |
…alysis Made-with: Cursor
|
Please check kfischer - you certainly meant someone else :) |
Summary
gitlab_housekeeping.pythat merges multiple non-overlapping MRs per reconcile loop, increasing throughput from ~7 MRs/hour to ~15-30 MRs/hour with no infrastructure changes.is_rebased()+if rebase: return) limits merges to 1 per ~10-minute cycle regardless of thelimitconfig. This ADR addresses the structural problem by skipping the rebase check for subsequent MRs whose changed files don't overlap with already-merged MRs.Changes
docs/adr/ADR-019-merge-queue-acceleration.md-- new ADRdocs/adr/README.md-- index and category updates