Skip to content

Commit 22a94d9

Browse files
author
Tyler Pate
committed
ADR-019: optimistic non-overlapping multi-merge for gitlab-housekeeping
1 parent d800f4a commit 22a94d9

File tree

2 files changed

+311
-0
lines changed

2 files changed

+311
-0
lines changed
Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
# ADR-019: Merge Queue Acceleration via Optimistic Non-Overlapping Multi-Merge
2+
3+
**Status:** Proposed
4+
**Date:** 2026-03-03
5+
**Authors:** @TGPSKI
6+
7+
## Context
8+
9+
The `gitlab-housekeeping` integration manages merge request lifecycle for app-interface and other repositories. For repos with `rebase: true` (including app-interface), the current merge flow is strictly serial:
10+
11+
1. Rebase up to `limit` MRs onto the target branch
12+
2. Wait for the first pipeline to succeed (`insist` retry loop)
13+
3. Merge exactly one MR
14+
4. `return` immediately -- all other rebased MRs are now invalid
15+
16+
After merging MR-A, the target branch HEAD changes. The `is_rebased()` check compares the MR's SHA against the target branch HEAD, so every other rebased MR fails this check and must be re-rebased, triggering new pipelines. The result is one merge per pipeline-duration cycle (~10 minutes), regardless of the `limit` config.
17+
18+
With app-interface's `limit: 2` and ~8 minute pipelines, the maximum throughput is ~7 MRs/hour. During sustained load (10+ MRs queued), low-priority MRs (e.g. `lgtm` label) experience starvation -- they are perpetually pushed to future reconcile loops by higher-priority MRs that consume all rebase and merge slots.
19+
20+
### Why increasing the limit alone does not help
21+
22+
The serial dependency is structural, not configurational. Bumping `limit` from 2 to 8 means 8 MRs get rebased instead of 2, but after the first merge changes the target branch, the other 7 are invalidated and need re-rebase. The throughput remains 1 merge per cycle.
23+
24+
### How GitLab merge trains solve this
25+
26+
GitLab merge trains run MR-B's pipeline against `target + A + B` speculatively, in parallel with MR-A's pipeline. When MR-A merges, MR-B's pipeline is already validated against the correct base. This requires GitLab Premium licensing and GitLab CI runners, which would mean replacing the current Jenkins infrastructure.
27+
28+
## Decision
29+
30+
**Implement optimistic non-overlapping multi-merge in `gitlab_housekeeping.py`: after the first merge in a loop, merge subsequent MRs without re-rebase if their changed files do not overlap with any already-merged MR's changed files.**
31+
32+
### Key Points
33+
34+
- The first MR in each loop iteration still requires `is_rebased()` and a passing pipeline (unchanged safety)
35+
- Subsequent MRs skip the `is_rebased()` check if their changed file paths have zero intersection with the union of all previously merged MRs' changed file paths
36+
- The `if rebase: return` exit after a single merge is removed, allowing multiple merges per loop
37+
- `rebase_limit` is separated from `merge_limit` so more MRs can be kept pipeline-ready simultaneously
38+
- A circuit breaker prevents zombie MRs (repeated pipeline failures) from blocking the queue
39+
- New Prometheus metrics track optimistic merge success rate and overlap frequency
40+
41+
### Rationale
42+
43+
Most app-interface MRs modify independent YAML datafiles under different service directories. When MR-A changes `data/services/foo/app.yml` and MR-B changes `data/services/bar/deploy.yml`, MR-B's pipeline result against `old_master` is identical to its result against `old_master + MR-A`. The file contents, schema validation, and integration dry-runs are all independent.
44+
45+
This approach gets 70-80% of the throughput benefit of GitLab merge trains with no infrastructure changes and ~100 lines of code.
46+
47+
### Relationship to Existing Changed-Path Analysis
48+
49+
Several integrations already analyze which files an MR changes. This ADR's overlap detection reuses an existing utility rather than introducing a new mechanism:
50+
51+
- **`GitLabApi.get_merge_request_changed_paths()`** (`reconcile/utils/gitlab_api.py:405`) is the shared method that calls the GitLab MR changes API. It is already used by `gitlab_owners` (for CODEOWNERS-style approval routing), `gitlab_labeler` (for `tenant-*` label assignment), and `openshift_saas_deploy_change_tester` (for filtering saas-file diffs). This ADR's overlap detection reuses the same method.
52+
53+
- **`gitlab_labeler`** already applies `tenant-*` labels (e.g. `tenant-foo`) based on which `data/services/` directories an MR touches. These labels are present on the MR by the time `gitlab_housekeeping` runs. A coarse overlap check could compare `tenant-*` labels between MRs with zero additional API calls. However, tenant labels are too imprecise: two MRs touching different files under the same service directory (e.g. `app.yml` vs `deploy.yml`) would incorrectly appear to overlap.
54+
55+
- **change-owners** uses a different mechanism entirely: the qontract-server `/diff` endpoint compares two bundle SHAs. This provides deep semantic diff information (per-field changes, change-type coverage), but requires a running qontract-server with loaded bundles and is not available in the housekeeping merge loop context.
56+
57+
The decision is to use `get_merge_request_changed_paths()` directly for exact file-level overlap detection. This adds one GitLab API call per MR candidate in the merge loop (at most `merge_limit` calls, typically 6). If API call volume becomes a concern, a two-tier approach could first check `tenant-*` labels (zero API calls, coarse filter) and only call the API for MRs with overlapping tenant labels.
58+
59+
## Alternatives Considered
60+
61+
### Alternative 1: GitLab Merge Trains (Native)
62+
63+
Migrate from Jenkins CI to GitLab CI runners and enable GitLab merge trains. GitLab speculatively validates each MR against the cumulative changes of all preceding MRs.
64+
65+
**Pros:**
66+
67+
- Maximum throughput (50+ MRs/hour)
68+
- Zero risk of broken master (full speculative validation)
69+
- No custom merge logic to maintain
70+
71+
**Cons:**
72+
73+
- Requires GitLab Premium licensing
74+
- Requires migrating CI from Jenkins to GitLab CI runners
75+
- Requires managing a private GitLab runner fleet to replace Jenkins EC2 spot autoscaling
76+
- 4-8 week implementation timeline for infrastructure migration
77+
78+
### Alternative 2: DIY Speculative Merge Train
79+
80+
Build a full merge train in `gitlab_housekeeping.py` by creating temporary branches that stack MRs (`train/pos-0` = master + A, `train/pos-1` = master + A + B) and triggering Jenkins jobs against those branches.
81+
82+
**Pros:**
83+
84+
- Full speculative validation (same safety as GitLab merge trains)
85+
- Uses existing Jenkins infrastructure
86+
87+
**Cons:**
88+
89+
- Significant complexity: temporary branch lifecycle, train state persistence across reconcile loops, cascade invalidation when mid-train MRs fail
90+
- Requires Jenkins job parameterization to accept arbitrary branches
91+
- Requires changes to `pr_check.sh` to validate against speculative branches instead of master
92+
- High ongoing maintenance burden
93+
94+
### Alternative 3: Optimistic Non-Overlapping Multi-Merge (Selected)
95+
96+
After the first merge, allow subsequent non-overlapping MRs to merge without re-rebase.
97+
98+
**Pros:**
99+
100+
- No infrastructure changes (keeps Jenkins, no licensing changes)
101+
- Minimal code changes (~100 lines in `gitlab_housekeeping.py`)
102+
- Uses existing `GitLabApi.get_merge_request_changed_paths()` API
103+
- Safe fallback: overlapping MRs are deferred to re-rebase (no regression)
104+
- Projected 15-30 MRs/hour for typical non-overlapping workloads
105+
- 1-2 week implementation timeline
106+
- New metrics provide data to justify further investment if needed
107+
108+
**Cons:**
109+
110+
- Overlapping MRs fall back to serial behavior
111+
- **Mitigation:** Most app-interface MRs are non-overlapping; the `optimistic_merge_rejected_total` metric quantifies overlap rate
112+
- Theoretical risk of broken master when non-overlapping MRs have semantic dependencies (e.g. schema change + new datafile using that schema)
113+
- **Mitigation:** This is rare in app-interface. The post-merge master pipeline catches it immediately. This is the same risk model as "merge when pipeline succeeds" workflows.
114+
115+
## Consequences
116+
117+
### Positive
118+
119+
- Throughput increase from ~7 MRs/hour to ~15-30 MRs/hour for typical workloads (see throughput analysis below)
120+
- Eliminates priority starvation: lower-priority MRs progress within the same loop iteration
121+
- Separated `rebase_limit` / `merge_limit` gives operators fine-grained control over queue behavior
122+
- Circuit breaker prevents zombie MRs from blocking the queue
123+
- New metrics provide visibility into merge queue health and overlap rates
124+
- No infrastructure changes, licensing costs, or CI migration
125+
126+
### Negative
127+
128+
- Optimistic merges introduce a small risk of broken master for semantically dependent but file-disjoint changes
129+
- **Mitigation:** Post-merge master pipeline detects this. The risk is equivalent to any "merge when pipeline succeeds" workflow and is very low for app-interface's YAML-file-per-service structure.
130+
- Additional GitLab API calls per MR (`get_merge_request_changed_paths`) in the merge loop
131+
- **Mitigation:** One API call per MR candidate. With `merge_limit: 6`, this is at most 6 additional calls per loop.
132+
- If overlap rate exceeds 30%, the approach degrades toward serial behavior
133+
- **Mitigation:** The `optimistic_merge_rejected_total` metric tracks this. If overlap rate is high, escalate to Alternative 2 (speculative train).
134+
135+
## Implementation Guidelines
136+
137+
### Phase 0: Config Bump (immediate)
138+
139+
Increase `limit` from 2 to 6 in app-interface `app.yml`. While this alone doesn't break the serial dependency, it keeps more MRs rebased in the queue, reducing wait time for the first merge each loop. This is a prerequisite for Phase 1 to have enough pipeline-ready MRs to merge in batch.
140+
141+
### Phase 1: Optimistic Multi-Merge (1-2 weeks)
142+
143+
#### 1. Add `rebase_limit` to the housekeeping schema in qontract-schemas
144+
145+
The schema must support an optional `rebase_limit` field alongside the existing `limit`. When absent, `rebase_limit` defaults to `limit`.
146+
147+
#### 2. Implement overlap detection helper
148+
149+
```python
150+
def get_changed_paths(mr: ProjectMergeRequest, gl: GitLabApi) -> set[str]:
151+
return set(gl.get_merge_request_changed_paths(mr))
152+
153+
def has_overlapping_changes(
154+
mr_paths: set[str], merged_paths: set[str]
155+
) -> bool:
156+
return bool(mr_paths & merged_paths)
157+
```
158+
159+
#### 3. Modify `merge_merge_requests()` merge loop
160+
161+
This is the most complex change. There are three interacting concerns:
162+
163+
**The `@retry` / `InsistOnPipelineError` interaction.** The current function is decorated with `@retry(max_attempts=10)`. When `insist=True` and a pipeline is still running, it raises `InsistOnPipelineError`, which restarts the entire function from the top. Any local state (`merged_paths`, `first_merge_done`) is lost on retry. The `insist` mechanism is effectively a polling loop that waits for the *first* rebased MR's pipeline to complete.
164+
165+
The solution: the `insist` wait-and-retry behavior only applies *before* the first merge. Once we have merged at least one MR, we stop waiting for running pipelines and only consider MRs whose pipelines have already completed successfully. This avoids the retry blowing away the multi-merge state, because we never raise `InsistOnPipelineError` after the first merge. Concretely, `wait_for_pipeline` with `insist` is only checked when `first_merge_done is False`.
166+
167+
**Merge rejection on non-rebased MRs.** After the first merge changes the target branch, subsequent MRs are no longer rebased. Calling `mr.merge()` on a non-rebased MR may succeed (merge commit projects) or fail (fast-forward projects). The python-gitlab library raises `GitlabMergeError` (or a subclass) on rejection. The exception handler must catch broadly -- not just `GitlabMRClosedError` but any `GitlabMergeError` -- and treat rejection as a skip (the MR will be re-rebased next loop).
168+
169+
**Caching changed paths.** `get_merge_request_changed_paths()` calls the GitLab API. The overlap check and the `merged_paths` update both need the paths for the same MR. Cache the result to avoid a redundant API call.
170+
171+
```python
172+
merged_paths: set[str] = set()
173+
first_merge_done = False
174+
paths_cache: dict[int, set[str]] = {}
175+
176+
for merge_request in merge_requests:
177+
mr = merge_request["mr"]
178+
179+
if rebase:
180+
if first_merge_done:
181+
if mr.iid not in paths_cache:
182+
paths_cache[mr.iid] = get_changed_paths(mr, gl)
183+
if has_overlapping_changes(paths_cache[mr.iid], merged_paths):
184+
optimistic_merge_rejected.labels(
185+
project_id=mr.target_project_id, reason="overlap"
186+
).inc()
187+
continue
188+
else:
189+
if not is_rebased(mr, gl):
190+
continue
191+
192+
pipelines = gl.get_merge_request_pipelines(mr)
193+
if not pipelines:
194+
continue
195+
196+
# Pipeline timeout cleanup (unchanged) ...
197+
198+
if not first_merge_done and wait_for_pipeline:
199+
running_pipelines = [
200+
p for p in pipelines if p.status == PipelineStatus.RUNNING
201+
]
202+
if running_pipelines:
203+
if insist:
204+
reload_toggle.reload = True
205+
raise InsistOnPipelineError(...)
206+
continue
207+
elif first_merge_done:
208+
# After first merge, only consider MRs with completed pipelines.
209+
# Do NOT raise InsistOnPipelineError -- that would reset our state.
210+
if pipelines[0].status == PipelineStatus.RUNNING:
211+
continue
212+
213+
last_pipeline_result = pipelines[0].status
214+
if last_pipeline_result != PipelineStatus.SUCCESS:
215+
continue
216+
217+
if not dry_run and merges < merge_limit:
218+
try:
219+
squash = (
220+
gl.project.squash_option == SQUASH_OPTION_ALWAYS
221+
) or mr.squash
222+
mr.merge(squash=squash)
223+
if mr.iid not in paths_cache:
224+
paths_cache[mr.iid] = get_changed_paths(mr, gl)
225+
merged_paths.update(paths_cache[mr.iid])
226+
if first_merge_done:
227+
optimistic_merges.labels(mr.target_project_id).inc()
228+
first_merge_done = True
229+
merges += 1
230+
# ... existing metrics (merged_merge_requests, time_to_merge) ...
231+
except gitlab.exceptions.GitlabMRClosedError as e:
232+
logging.error(f"unable to merge {mr.iid}: {e}")
233+
except gitlab.exceptions.GitlabMergeError as e:
234+
# GitLab rejected the merge (e.g. fast-forward not possible
235+
# because MR is not rebased onto current target). Skip this MR;
236+
# it will be re-rebased next loop.
237+
logging.warning(
238+
f"optimistic merge rejected for {mr.iid}: {e}"
239+
)
240+
optimistic_merge_rejected.labels(
241+
project_id=mr.target_project_id, reason="merge_rejected"
242+
).inc()
243+
244+
merge_batch_size_histogram.labels(gl.project.id).observe(merges)
245+
```
246+
247+
**Key design constraints:**
248+
249+
- `InsistOnPipelineError` is only raised before the first merge (`not first_merge_done`). After the first merge, running pipelines are silently skipped. This ensures the retry decorator never resets multi-merge state.
250+
- The `insist=False` fallback call (lines 739-753 in `run()`) works correctly: it never raises `InsistOnPipelineError` regardless, so the multi-merge loop runs without interruption.
251+
- `GitlabMergeError` is the base class for merge-related failures in python-gitlab. Catching it covers fast-forward rejection, pipeline-gated rejection, and other server-side refusals.
252+
- `paths_cache` avoids calling `get_merge_request_changed_paths()` twice for the same MR (once for the overlap check, once for updating `merged_paths` after merge).
253+
254+
#### 4. Separate `rebase_limit` from `merge_limit` in `run()`
255+
256+
```python
257+
merge_limit = hk.get("limit") or default_limit
258+
rebase_limit = hk.get("rebase_limit") or merge_limit
259+
```
260+
261+
#### 5. Add circuit breaker for zombie MRs
262+
263+
Track consecutive pipeline failures per MR IID in `State`. After 3 consecutive failures, skip the MR for the current loop and log a warning. Reset the counter when the MR's pipeline succeeds or a new commit is pushed.
264+
265+
#### 6. Add Prometheus metrics
266+
267+
- `optimistic_merges_total` (Counter, labels: `project_id`) -- MRs merged via the optimistic non-overlapping path
268+
- `optimistic_merge_rejected_total` (Counter, labels: `project_id`, `reason`) -- MRs skipped due to file overlap or GitLab merge rejection
269+
- `merge_batch_size` (Histogram, labels: `project_id`) -- number of MRs merged per loop iteration
270+
271+
### Throughput Analysis
272+
273+
**Current state:** 1 merge per cycle. Cycle time = pipeline duration (~8 min) + insist retry overhead (~1 min) + reconcile sleep (~1 min) = ~10 min. Throughput: ~6 MRs/hour.
274+
275+
**With optimistic multi-merge:** The first merge still takes ~10 min (waiting for pipeline via insist). After the first merge, additional MRs whose pipelines have already completed can merge immediately (seconds per merge). Since all rebased MRs start pipelines at roughly the same time, most finish within a narrow window around the 8-minute mark.
276+
277+
With `rebase_limit: 10` and an 80% non-overlapping rate (typical for app-interface), expect ~8 MRs eligible for optimistic merge per cycle. Not all will have completed pipelines -- pipeline times vary and some may still be running when the first MR merges. Conservatively assume 3-5 optimistic merges succeed per cycle.
278+
279+
- **Conservative estimate:** 1 (first) + 3 (optimistic) = 4 merges per 10-min cycle = **~24 MRs/hour**
280+
- **Optimistic estimate:** 1 + 5 = 6 per cycle = **~36 MRs/hour**
281+
- **Realistic midpoint:** ~15-30 MRs/hour, depending on pipeline time variance and overlap rate
282+
283+
These projections assume Jenkins has sufficient executor capacity for 10 parallel pipelines. Jenkins spot autoscaling already handles this.
284+
285+
### Phase 2: Speculative Stacking (conditional)
286+
287+
Only pursue if Phase 1 metrics show overlap rate exceeding 30%. This phase involves temporary branch creation, Jenkins job parameterization, and cascade invalidation logic. Defer design until Phase 1 data is available.
288+
289+
### Checklist
290+
291+
- [ ] Add `rebase_limit` to housekeeping schema in qontract-schemas
292+
- [ ] Implement `get_changed_paths()` and `has_overlapping_changes()` in `gitlab_housekeeping.py`
293+
- [ ] Replace the single-merge `return` with overlap-aware multi-merge loop
294+
- [ ] Separate `rebase_limit` from `merge_limit` in `run()` (backward compatible)
295+
- [ ] Add circuit breaker with `State` tracking
296+
- [ ] Add `optimistic_merges_total`, `optimistic_merge_rejected_total`, `merge_batch_size` metrics
297+
- [ ] Add unit tests for overlap detection and multi-merge behavior
298+
- [ ] Set `limit: 6`, `rebase_limit: 10` in app-interface `app.yml`
299+
- [ ] Monitor `optimistic_merge_rejected_total` for overlap rate after rollout
300+
301+
## References
302+
303+
- Implementation: `reconcile/gitlab_housekeeping.py` -- merge loop, rebase logic, priority sorting
304+
- Implementation: `reconcile/utils/gitlab_api.py:405` -- `get_merge_request_changed_paths()`
305+
- Implementation: `reconcile/utils/state.py` -- `State` for circuit breaker persistence
306+
- App-interface config: `data/services/app-interface/app.yml` -- `gitlabHousekeeping.limit`
307+
- CI pipeline: `hack/pr_check.sh`, `hack/manual_reconcile.sh`, `hack/select-integrations.py`
308+
- External: [GitLab merge trains](https://docs.gitlab.com/ci/pipelines/merge_trains/)
309+
- External: [GitLab merge_ref API](https://docs.gitlab.com/api/merge_requests/#merge-to-default-merge-ref-path)

docs/adr/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ An Architecture Decision Record (ADR) documents an important architecture decisi
2828
| [ADR-016](ADR-016-two-tier-cache.md) | Two-Tier Cache Architecture (Memory + Redis) | Accepted |
2929
| [ADR-017](ADR-017-factory-pattern.md) | Factory Pattern | Accepted |
3030
| [ADR-018](ADR-018-event-driven-communication.md) | Event-Driven Communication Pattern | Accepted |
31+
| [ADR-019](ADR-019-merge-queue-acceleration.md) | Merge Queue Acceleration via Optimistic Multi-Merge | Proposed |
3132

3233
## ADR Categories
3334

@@ -41,6 +42,7 @@ An Architecture Decision Record (ADR) documents an important architecture decisi
4142
### Execution Patterns
4243

4344
- [ADR-003](ADR-003-async-only-api-with-blocking-get.md) - Async-only API with blocking GET pattern
45+
- [ADR-019](ADR-019-merge-queue-acceleration.md) - Optimistic non-overlapping multi-merge for gitlab-housekeeping
4446

4547
### Cross-Cutting Concerns
4648

0 commit comments

Comments
 (0)