Skip to content

ADR-019: Merge queue acceleration via optimistic non-overlapping multi-merge#5439

Open
TGPSKI wants to merge 7 commits intoapp-sre:masterfrom
TGPSKI:tgpski-optimistic-multi-merge-adr
Open

ADR-019: Merge queue acceleration via optimistic non-overlapping multi-merge#5439
TGPSKI wants to merge 7 commits intoapp-sre:masterfrom
TGPSKI:tgpski-optimistic-multi-merge-adr

Conversation

@TGPSKI
Copy link
Contributor

@TGPSKI TGPSKI commented Mar 5, 2026

Summary

  • Adds ADR-019 proposing an optimistic multi-merge strategy for gitlab_housekeeping.py that merges multiple non-overlapping MRs per reconcile loop, increasing throughput from ~7 MRs/hour to ~15-30 MRs/hour with no infrastructure changes.
  • The current serial bottleneck (is_rebased() + if rebase: return) limits merges to 1 per ~10-minute cycle regardless of the limit config. This ADR addresses the structural problem by skipping the rebase check for subsequent MRs whose changed files don't overlap with already-merged MRs.
  • Evaluates three alternatives (GitLab merge trains, DIY speculative train, optimistic multi-merge) and selects optimistic multi-merge for its low complexity and zero infrastructure overhead.

Changes

  • docs/adr/ADR-019-merge-queue-acceleration.md -- new ADR
  • docs/adr/README.md -- index and category updates

@TGPSKI TGPSKI requested review from chassing, Copilot and hemslo March 5, 2026 20:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ADR-019 documenting an approach to accelerate merge throughput in gitlab_housekeeping.py by optimistically merging multiple non-overlapping MRs per reconcile loop.

Changes:

  • Added new ADR-019 describing the optimistic non-overlapping multi-merge strategy and implementation plan.
  • Updated ADR index and category listing to include ADR-019.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/adr/README.md Adds ADR-019 to the main ADR table and “Execution Patterns” category list.
docs/adr/ADR-019-merge-queue-acceleration.md New ADR detailing motivation, decision, alternatives, and implementation guidelines for optimistic multi-merge.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@TGPSKI TGPSKI force-pushed the tgpski-optimistic-multi-merge-adr branch from cd57da3 to fdc1c78 Compare March 6, 2026 04:24
@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

ADR-019 Update: Addressing Review Feedback

Pushed a second commit incorporating feedback from @jfchevrette, @BumbleFeng, and @hemslo. Here's what changed:

1. Reference-aware overlap detection (@jfchevrette)

Problem: File-level overlap misses cross-file semantic dependencies via $ref crossrefs. Two concrete examples:

  • MR-A changes a namespace file, MR-B changes the resource template it references — different files, but semantically coupled
  • MR-A modifies a role's permissions, MR-B adds a user ref to that role — reviewer approved MR-B based on the old permissions

Fix: Each MR's changed paths are now expanded to include forward refs ($ref targets) and backward refs (files that reference the changed file). Two MRs touching files connected by $ref crossrefs are treated as overlapping and fall back to serial merge. Added new section "Why file-level overlap is not enough" and "Reference-Aware Path Expansion" with implementation details including build_ref_graph(), expand_paths(), and configurable expansion depth.

2. mr.rebase(skip_ci=True) before optimistic merge (@BumbleFeng)

Problem: App-interface uses fast-forward merge. After the first merge changes the target branch, subsequent MRs are not rebased — mr.merge() will be rejected by GitLab.

Fix: Optimistic MRs now call mr.rebase(skip_ci=True) before mr.merge(). This rebases onto the new target HEAD without triggering a redundant pipeline. GitlabMRRebaseError is caught separately from GitlabMergeError. Added checklist item to verify skip_ci parameter support in python-gitlab.

3. Phase 0: Queue Hardening (@hemslo's operational improvements)

Problem: Several operational edge cases aren't handled (failed merges not tracked, failed-pipeline MRs consuming queue slots).

Fix: Phase 0 expanded from "Config Bump" to "Queue Hardening and Config Tuning" with four items:

  1. Bump limit from 2 to 6
  2. Filter MRs with failed pipelines from merge queue (move check to preprocess_merge_requests)
  3. Label MRs that fail to merge repeatedly (merge-error label → error queue for triage)
  4. Broaden merge exception handling to catch GitlabMergeError

This is now listed as Alternative 3: Serial Queue Hardening in the Alternatives section, with an explicit note that these improvements should be adopted regardless of which acceleration strategy is selected.

4. Adjusted projections

  • Throughput estimates revised downward to 10-20 MRs/hour (from 15-30) to account for the larger overlap surface from reference expansion
  • Non-overlapping rate assumption lowered to 60-70% (from 80%)
  • Implementation timeline adjusted to 2-3 weeks (from 1-2) to account for reference graph work
  • Still represents a ~3x improvement over the current ~6 MRs/hour ceiling

@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

Addressed @hemslo's follow-up on limit semantics:

The problem: limit currently controls rebases per run, but the counter resets every ~1 min reconcile cycle. With limit: 2, each run rebases up to 2 MRs — but across many rapid runs, far more MRs get rebased, each triggering a pipeline that's wasted when only 1 MR merges per cycle. Bumping to 6 per-run would make this worse.

The fix: Redefined limit as a per-repo cap — "at most N MRs in a rebased-and-pipeline-pending state at any time." Before rebasing, count how many MRs already have running/recent pipelines and only rebase up to limit - already_active. Keep limit small (2-3) for Phase 0 to minimize wasted CI.

Changes in this push:

  • Phase 0 item 1 rewritten from "bump limit to 6" to "redefine limit as per-repo cap"
  • Introduced merge_limit as a separate schema field controlling max merges per loop iteration (for multi-merge in Phase 1)
  • Updated throughput analysis, key points, consequences, and checklists throughout
  • Removed all stale rebase_limit references

@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

Consistency review pass — fixed several inaccuracies:

  • GitlabMergeError does not exist in python-gitlab. All references replaced with GitlabOperationError, which is the actual common parent of GitlabMRClosedError, GitlabMRForbiddenError, GitlabMROnBuildSuccessError, etc. Added explicit note in Phase 0 clarifying the class hierarchy.
  • Duplicate step numbering in Phase 1 (two "step 4" headers) — renumbered to 4–7
  • Typo on line 116: "(s)me safety" → "same safety"
  • Stale line references: get_merge_request_changed_paths is at :398 not :405; removed fragile line numbers for insist=False fallback call

@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

Current rendered markdown

Copy link
Member

@chassing chassing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well written ADR ❤️!

gitlab-housekeeping is a beast, not just because of the complex code but also because of the many edge cases that must be considered. Maybe a simpler approach (comparing labels) would be easier to implement and to maintain.


1. **Redefine `limit` as a per-repo cap, not per-run.** Currently `limit` controls how many MRs are rebased *per reconcile run*, with the counter resetting each invocation. Since `gitlab-housekeeping` runs frequently (~1 min cycles), a `limit: 2` still results in many MRs being rebased across runs, each triggering a pipeline -- most of which are wasted when only one MR merges per cycle. The fix: change the semantics so `limit` means "at most N MRs should be in a rebased-and-pipeline-pending state at any time for this repo." Before rebasing, count how many MRs already have running or recent pipelines and only rebase up to `limit - already_active`. Keep `limit` small (2-3) to minimize wasted CI. This is a prerequisite for Phase 1: with per-repo semantics, bumping `limit` to 6 for multi-merge becomes safe because it controls the *steady-state pipeline concurrency*, not the per-run rebase burst.

2. **Filter MRs with failed pipelines from the merge queue.** Move the pipeline-success check from `merge_merge_requests` into `preprocess_merge_requests` so MRs without a passing last pipeline are excluded before sorting and slot allocation. This prevents failed MRs from consuming rebase slots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very dangerous, and we need to monitor that or ping the IC. In case of flaky integrations, e.g., old non-qontract-api-based slack-usergroups, a single failed build would disable retries, and an MR would be stuck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We have too many transient errors as of now in MR checks. Retesting is crucial to move forward. Maybe at least have some amount of retry budget - but tracking that would need to happen through label or commit comment, increasing complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use rebase to handle transient errors is not reliable, 2 cases:

  • high priority approved mr, but pipeline run failed, it will stick at top and keep rebasing, this is my major toil to set block label and notify mr author
  • transient error but no other mr merged, like in APAC time, the failed mr can stay there for the whole day

The proper way is treat pipeline run status like healthcheck probe, if last N pipelines failed, then it's dead, move to error queue for investigation. When it's fixed with a success run, move back to merge queue.

Optional we can auto /retest on failed pipeline for N times, but that count tracking is not easy as rebase triggered one, /retest will rerun last pipeline, rebase will create a new pipeline. So this can wait as this kind of mr can wait longer, if it's urgent, the author will manual /retest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in v4 (1c53883). Phase 0 now uses a healthcheck-probe model exactly as @hemslo described:

  • Track consecutive pipeline failures per MR IID in State
  • After N consecutive failures (configurable, default 3), apply merge-error label and exclude from merge queue
  • Auto-recover: when a new pipeline succeeds, reset the failure counter and remove the label

This balances retesting for flaky CI (@fishi0x01's point) with preventing zombie MRs from blocking the queue indefinitely (@chassing's concern). Simple pipeline-success filtering is explicitly called out as dangerous in the ADR for exactly the reasons you all identified.

The /retest auto-retry is deferred as @hemslo suggested -- count tracking across rebase vs retest pipelines adds complexity for limited benefit since urgent MRs will be manually retested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few simplifications on implementation:

  • we already fetch all pipelines for a given mr via gl.get_merge_request_pipelines(mr), every rebase should create a new pipeline, we can just check last N pipelines for healthcheck, no need to maintain another state.
  • error queue just a mental modal, since we also have rendered markdown files / inscope plugins to show merge / review queues , it's easier to add a new flag in merge queue to indicate this mr has errors, won't be merged or considered until errors fixed. IC can check error mrs in the merge queue page.


- **Bundle-local:** Load the bundle JSON directly in `gitlab_housekeeping` and traverse datafiles to build the ref graph. This is self-contained but adds memory and startup cost.
- **qontract-server query:** Query the GraphQL API for synthetic backref fields and forward refs. This reuses existing infrastructure but adds network calls.
- **Pre-computed lookup table:** Build the reference graph as a periodic job (or as part of bundle validation) and store it as a JSON artifact. `gitlab_housekeeping` loads this artifact at startup. This is the most efficient at runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't get it. gitlab-housekeeping has access to the prod qontract-server bundle with the state of master. How would this help to identify $refs for new files added in an MR? E.g. new user in MR-A, and MR-B changes roles/permissions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this is purely about saving computation effort. In terms of overlap identification accuracy, there is no difference in whether you use the qontract-server bundle or the precomputed table. The precomputed table is a transformation of the bundle to have a more efficient lookup structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's clear, but the prod bundle doesn't know anything about the new files introduced by an MR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the fatal flaw that killed the ref-graph approach. Addressed in v4 (1c53883) by pivoting entirely to label-based overlap detection.

@chassing you were right -- the prod bundle cannot know about $refs in new files from MR branches. Fixing it would require fetching each MR's branch content and rebuilding the graph per-MR, which negates all the simplicity advantages.

The ref-graph approach is preserved as Alternative 5 in the ADR with this fatal flaw documented. The selected approach now uses tenant-* labels (already applied by gitlab_labeler), which are computed per-MR from its actual changed files -- so they inherently capture what each MR touches regardless of what's on master.

Copy link
Contributor

@fishi0x01 fishi0x01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome idea and write up! Thank you Tyler! ❤️

My main concern would be to unleash that instantly on all MRs. E.g., what about qr-bump MRs or any hack-script changes? It would very likely make sense to exclude some types of MRs from this. Might also make sense to only include certain types of MRs initially, like only self-servicable MRs.


1. **Redefine `limit` as a per-repo cap, not per-run.** Currently `limit` controls how many MRs are rebased *per reconcile run*, with the counter resetting each invocation. Since `gitlab-housekeeping` runs frequently (~1 min cycles), a `limit: 2` still results in many MRs being rebased across runs, each triggering a pipeline -- most of which are wasted when only one MR merges per cycle. The fix: change the semantics so `limit` means "at most N MRs should be in a rebased-and-pipeline-pending state at any time for this repo." Before rebasing, count how many MRs already have running or recent pipelines and only rebase up to `limit - already_active`. Keep `limit` small (2-3) to minimize wasted CI. This is a prerequisite for Phase 1: with per-repo semantics, bumping `limit` to 6 for multi-merge becomes safe because it controls the *steady-state pipeline concurrency*, not the per-run rebase burst.

2. **Filter MRs with failed pipelines from the merge queue.** Move the pipeline-success check from `merge_merge_requests` into `preprocess_merge_requests` so MRs without a passing last pipeline are excluded before sorting and slot allocation. This prevents failed MRs from consuming rebase slots.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We have too many transient errors as of now in MR checks. Retesting is crucial to move forward. Maybe at least have some amount of retry budget - but tracking that would need to happen through label or commit comment, increasing complexity.


- **Bundle-local:** Load the bundle JSON directly in `gitlab_housekeeping` and traverse datafiles to build the ref graph. This is self-contained but adds memory and startup cost.
- **qontract-server query:** Query the GraphQL API for synthetic backref fields and forward refs. This reuses existing infrastructure but adds network calls.
- **Pre-computed lookup table:** Build the reference graph as a periodic job (or as part of bundle validation) and store it as a JSON artifact. `gitlab_housekeeping` loads this artifact at startup. This is the most efficient at runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this is purely about saving computation effort. In terms of overlap identification accuracy, there is no difference in whether you use the qontract-server bundle or the precomputed table. The precomputed table is a transformation of the bundle to have a more efficient lookup structure.


The recommended approach is the pre-computed lookup table, built during bundle validation and stored alongside the bundle. This adds zero runtime overhead to the merge loop beyond a dictionary lookup.

The expansion depth should be configurable (default: 1 hop). For most app-interface patterns, 1-hop expansion (direct refs and backrefs) is sufficient. Deeper transitive expansion increases safety but also increases the overlap rate, reducing throughput.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I would tend to go very conservative way and start with at least 2 hops, then observe MR statistics https://grafana.app-sre.devshift.net/d/xNTPSl-Vk/appsre-overview?orgId=1&from=now-28d&to=now&timezone=utc for performance. I would only tune to less hops if it is required by performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable -- the ref-graph and its hop-depth config were dropped entirely in v4 (1c53883) in favor of label-based overlap detection. There's no expansion depth to tune since we're comparing tenant-* label sets, not traversing a reference graph.

Your conservative instinct was well-placed though -- it was one of the signals that the ref-graph approach was getting complex. The label approach sidesteps all of that.

Replace reference-graph overlap detection with tenant-* label comparison.
The ref-graph approach has a fatal flaw: the production bundle cannot
capture $ref crossrefs from new files introduced by MR branches. Labels
are coarser but provably safe (service-level isolation), require zero
API calls, and are ~50 lines of Python.

Key changes:
- Decision/Key Points: label-based overlap + self-serviceable gate
- Phase 0: healthcheck-probe model for pipeline failures (retry budget)
- Phase 1: label comparison replaces ref-graph code entirely
- Phase 2: change-type coverage refinement (conditional on metrics)
- Alternative 5: ref-graph preserved for future reference
- Alternative 6: change-type coverage overlap (mafriedm)
- Throughput projections revised for label-based approach
- Reviewers expanded with all feedback contributors

Assisted-by: Claude (Anthropic)
Made-with: Cursor
@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

ADR-019 v4: Pivot to Label-Based Overlap Detection

Major revision based on feedback from @jfchevrette, @BumbleFeng, @hemslo, @chassing, @fishi0x01, @mafriedm, @kfischer.

What changed

Overlap detection: ref-graph → tenant-* labels

The reference-graph approach had a fatal flaw identified by @chassing: the production bundle doesn't contain files introduced by MR branches. If MR-A adds a new file with a $ref to an existing role, the ref graph built from master won't capture that forward ref. This undermines the core safety guarantee.

The new approach uses tenant-* labels (already applied by gitlab_labeler) as the overlap boundary. Two MRs with non-overlapping tenant-* labels touch different services and are safe to multi-merge. This is coarser (same-service MRs are serialized) but:

  • Zero API calls (labels already present)
  • Safe by construction (service-level isolation captures crossref deps)
  • ~50 lines of Python vs hundreds for ref-graph
  • Already available in production

Self-serviceable eligibility gate (per @fishi0x01)

Only MRs with self-serviceable label are eligible for optimistic merge. This excludes qr-bump, hack-script, and global changes from the multi-merge path.

Phase 0: Healthcheck-probe model (per @chassing, @fishi0x01, @hemslo)

Simple pipeline-failure filtering replaced with retry budget: track consecutive failures per MR in State. After N failures (default 3), move to error queue (merge-error label). Auto-recover when a new pipeline succeeds. Balances retesting for flaky CI with preventing zombie MRs.

Phase 2: Change-type refinement (per @mafriedm, @kfischer)

Replaced "Speculative Stacking" with conditional change-type coverage overlap. If Phase 1 metrics show high same-service overlap, use change-type coverage to unlock finer-grained multi-merge. Only pursue if optimistic_merge_rejected_total{reason="overlap"} > 30%.

Ref-graph preserved as Alternative 5 for future reference, with the fatal flaw documented. Change-type overlap added as Alternative 6.

Section-by-section changes

Section Change
Decision tenant-* labels + self-serviceable gate
Key Points Label-based overlap, healthcheck-probe, eligibility gate
Why file-level overlap is not enough Reframed as motivation for labels (not ref-graph)
Rationale Updated for label-based advantages
Relationship to Existing Analysis gitlab_labeler is now primary mechanism
Reference-Aware Path Expansion Removed (moved to Alt 5)
Alternative 4 (Selected) Updated to label-based
Alternative 5 (New) Ref-graph with documented fatal flaw
Alternative 6 (New) Change-type coverage overlap
Consequences Updated for label-based pros/cons
Phase 0 Healthcheck-probe model added
Phase 1 Label comparison replaces ref-graph code
Phase 2 Change-type refinement replaces speculative stacking
Throughput Revised: ~10-20 MRs/hr (conservative ~12, optimistic ~24)
Checklist Updated for all three phases
References Added gitlab_labeler, change_owners/, State

The self-serviceable label was too restrictive -- many non-self-serviceable
MRs are safe for optimistic multi-merge. The real gate is having tenant-*
labels (so the overlap check is meaningful). MRs without labels fall back
to serial; MRs touching many services are naturally serialized by overlap.

Assisted-by: Claude (Anthropic)
Made-with: Cursor
@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

ADR-019 v5: Relax eligibility gate from self-serviceable to tenant-* labels

Small but important fix -- the self-serviceable label was too restrictive as an eligibility gate. Plenty of non-self-serviceable MRs are safe for optimistic multi-merge (e.g., a non-self-serviceable MR that only touches tenant-foo files).

What changed:

The eligibility gate is now simply: MR must have at least one tenant-* label. The overlap check handles the rest naturally:

  • MRs with no tenant-* labels (global infrastructure, cross-cutting refactors) have no basis for comparison → fall back to serial processing
  • MRs with many tenant-* labels (qr-bump touching dozens of services) have high overlap with everything → naturally serialized by the overlap check itself
  • MRs with 1-2 tenant-* labels (the common case) → eligible, overlap check determines independence

This is strictly more permissive than the previous self-serviceable gate while remaining safe. The self-serviceable concept is about approval routing, not merge safety -- those are orthogonal concerns.

@TGPSKI
Copy link
Contributor Author

TGPSKI commented Mar 6, 2026

ADR-019 v6: Empirical Throughput Baseline from Production Pod Log

The throughput estimates in the ADR were previously based on theoretical calculations (~10 min/cycle → ~6 MRs/hour). This update anchors them to empirical measurements from a production pod log captured during ADR development.

Measurement Methodology

A full pod log from gitlab-housekeeping-4-68bff97fd6-dzls2-int was downloaded on 2026-03-03 and analyzed as part of the planning process. The log covers 14:28–21:21 UTC (6h 52m, 9,326 lines) during a period of sustained queue pressure.

Merge events were extracted by grepping for ['merge', 'app-interface', ...] log lines. Inter-merge intervals, reconcile loop cadence, and rebase churn were computed from timestamps.

Key Findings

Metric Value
Total app-interface merges 53
Merge window 6h 40m (14:38–21:19 UTC)
Measured throughput ~7.9 MRs/hour
Median inter-merge interval ~7.7 min (range: 4m–17m)
Reconcile loop cadence ~1.5–2 min/cycle
"rebase limit reached" events 2,142 (wasted rebase cycles)
"unable to merge" errors 267 (unhandled edge cases)

MR Starvation Case Study (MR 177008)

MR 177008 had an lgtm label and was eligible for merge. Over the 6h 52m observation window:

  • Appeared 12+ times as "rebase limit reached for this reconcile loop. will try next time"
  • First appeared at 17:30, still being deferred at 21:21
  • Never merged during the entire observation window
  • Meanwhile, 53 other MRs merged ahead of it

This is the exact starvation pattern the ADR addresses.

What Changed in the ADR

  • Throughput Analysis section now opens with the empirical measurement methodology and data
  • Context section updated from "~7 MRs/hour" to "~7.9 MRs/hour (empirically measured)" with starvation case reference
  • Projections re-anchored against ~8 MRs/hour measured baseline (conservative ~12, optimistic ~24, realistic ~15)
  • Alternative 3 serial ceiling updated to match measured rate
  • All references to "~6 MRs/hour" replaced with the measured ~8 MRs/hour where appropriate

The rebase churn (2,142 events) and merge error count (267) further motivate Phase 0 work (per-repo limit cap, error-queue labeling, broadened exception handling).

@kfischer
Copy link

kfischer commented Mar 7, 2026

Please check kfischer - you certainly meant someone else :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants