|
| 1 | +# Add pytest-xdist Support to Flaky Detection |
| 2 | + |
| 3 | +**Linear:** MRGFY-6296 |
| 4 | +**Status:** Approved |
| 5 | +**Date:** 2026-03-19 |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +The flaky detection system does not support `pytest-xdist`: |
| 10 | + |
| 11 | +1. `flaky_detector._test_metrics` lives in-process memory, but xdist spawns separate worker processes. |
| 12 | +2. `pytest_collection_finish` does not run on the controller under xdist. |
| 13 | + |
| 14 | +## Decision Summary |
| 15 | + |
| 16 | +- **Approach:** Controller-orchestrated with pre-computed per-test deadlines. |
| 17 | +- **IPC:** xdist built-in `workerinput`/`workeroutput`. |
| 18 | +- **Budget model:** Global budget, static per-test allocation under xdist. Dynamic deadlines preserved for non-xdist. |
| 19 | +- **Scheduling:** Target `load` (default) mode. Other modes should not crash. Under `each` mode (every test runs on every worker), flaky detection is disabled to avoid duplicated budgets. |
| 20 | + |
| 21 | +## Architecture |
| 22 | + |
| 23 | +``` |
| 24 | +Controller Workers (gw0, gw1, ...) |
| 25 | +──────────────────────────────── ──────────────────────────────── |
| 26 | +fetch flaky context from API |
| 27 | + │ |
| 28 | + ├─── workerinput ──────────► receive context as plain dict |
| 29 | + │ build FlakyDetector (no API call) |
| 30 | + │ collect tests (same list) |
| 31 | + │ compute budget (same result) |
| 32 | + │ run tests + reruns |
| 33 | + │ ◄── workeroutput ───────────┤ |
| 34 | +aggregate metrics |
| 35 | +print terminal summary |
| 36 | +``` |
| 37 | + |
| 38 | +All workers collect the same full test list (xdist verifies this). Budget computation is deterministic, so each worker independently arrives at the same global budget and per-test allocation. No mid-run coordination. |
| 39 | + |
| 40 | +## Controller Responsibilities |
| 41 | + |
| 42 | +### 1. Fetch context and distribute (`pytest_configure_node`) |
| 43 | + |
| 44 | +- Fetch `_FlakyDetectionContext` from API **once** (cache it). |
| 45 | +- Serialize as plain dict into `node.workerinput["flaky_detection_context"]`. |
| 46 | +- Also set `node.workerinput["flaky_detection_mode"]`. |
| 47 | + |
| 48 | +### 2. Collect worker metrics (`pytest_testnodedown`) |
| 49 | + |
| 50 | +- Read `node.workeroutput["flaky_detection_metrics"]`. |
| 51 | +- Merge into controller-side aggregated metrics dict. |
| 52 | +- Workers run distinct tests under `load` scheduling, so no overlap. |
| 53 | + |
| 54 | +### 3. Terminal summary (`pytest_terminal_summary`) |
| 55 | + |
| 56 | +- Build report from aggregated metrics using same format as today. |
| 57 | + |
| 58 | +## Worker Responsibilities |
| 59 | + |
| 60 | +### 1. Initialization |
| 61 | + |
| 62 | +- Read `config.workerinput["flaky_detection_context"]` if present. |
| 63 | +- Construct `FlakyDetector` via new `from_context()` classmethod (skips API call). |
| 64 | + |
| 65 | +### 2. Session preparation (`pytest_collection_finish`) |
| 66 | + |
| 67 | +- Call `prepare_for_session(session)` as today. |
| 68 | + |
| 69 | +### 3. Test execution (`pytest_runtest_protocol`) |
| 70 | + |
| 71 | +- Identical to current logic: initial run, set deadline, rerun loop. |
| 72 | +- `set_test_deadline` uses static allocation: `total_budget / global_num_tests_to_process` where the denominator is the **global** count of tests to process (computed from the full collection, not from the worker's assigned subset). Workers don't know upfront which tests they'll run (xdist dispatches dynamically), but the per-test budget is the same regardless. |
| 73 | + |
| 74 | +### 4. Metrics export (`pytest_sessionfinish`) |
| 75 | + |
| 76 | +- Serialize `_test_metrics`, `_over_length_tests`, `_debug_logs` into `config.workeroutput["flaky_detection_metrics"]`. |
| 77 | + |
| 78 | +## Data Flow |
| 79 | + |
| 80 | +### workerinput (controller -> worker) |
| 81 | + |
| 82 | +```python |
| 83 | +node.workerinput["flaky_detection_context"] = { |
| 84 | + "budget_ratio_for_new_tests": float, |
| 85 | + "budget_ratio_for_unhealthy_tests": float, |
| 86 | + "existing_test_names": list[str], |
| 87 | + "existing_tests_mean_duration_ms": int, |
| 88 | + "unhealthy_test_names": list[str], |
| 89 | + "max_test_execution_count": int, |
| 90 | + "max_test_name_length": int, |
| 91 | + "min_budget_duration_ms": int, |
| 92 | + "min_test_execution_count": int, |
| 93 | +} |
| 94 | +node.workerinput["flaky_detection_mode"] = "new" | "unhealthy" |
| 95 | +``` |
| 96 | + |
| 97 | +### workeroutput (worker -> controller) |
| 98 | + |
| 99 | +```python |
| 100 | +config.workeroutput["flaky_detection_metrics"] = { |
| 101 | + "test_metrics": { |
| 102 | + "tests/test_foo.py::test_bar": { |
| 103 | + "rerun_count": int, |
| 104 | + "total_duration_ms": float, |
| 105 | + "initial_setup_duration_ms": float, |
| 106 | + "initial_call_duration_ms": float, |
| 107 | + "initial_teardown_duration_ms": float, |
| 108 | + "prevented_timeout": bool, |
| 109 | + }, |
| 110 | + }, |
| 111 | + "over_length_tests": list[str], |
| 112 | + "debug_logs": list[dict], |
| 113 | +} |
| 114 | +``` |
| 115 | + |
| 116 | +The three initial duration sub-fields are needed because `make_report` uses `initial_duration` (their sum) and `is_test_too_slow` compares it against remaining time. Serializing them separately preserves full fidelity. |
| 117 | + |
| 118 | +## FlakyDetector Changes |
| 119 | + |
| 120 | +### New classmethod |
| 121 | + |
| 122 | +`FlakyDetector.from_context(context_dict, mode)` is a `@classmethod` that constructs a `FlakyDetector` from a serialized context dict, skipping `_fetch_context()`. It sets `token`, `url`, and `full_repository_name` to empty strings (the dataclass fields remain required, but these values are unused on workers). The `_context` field is populated directly from the dict. |
| 123 | + |
| 124 | +On the controller side, `FlakyDetector` is **not** instantiated. The controller only holds the raw context dict (for `workerinput`) and aggregated metrics (from `workeroutput`). The report is generated via `make_report_from_aggregated`, which is a standalone function that operates on plain dicts. |
| 125 | + |
| 126 | +### Deadline computation |
| 127 | + |
| 128 | +- **Non-xdist (unchanged):** Dynamic `remaining_budget / remaining_tests`. |
| 129 | +- **xdist:** Static `total_budget / num_tests_to_process`. |
| 130 | + |
| 131 | +Branch via a single `if` in `set_test_deadline`. |
| 132 | + |
| 133 | +### Report from aggregated data |
| 134 | + |
| 135 | +`make_report_from_aggregated(context, mode, metrics, over_length_tests, debug_logs)` runs on the controller from deserialized worker data. |
| 136 | + |
| 137 | +## Error Handling |
| 138 | + |
| 139 | +- **Worker crash:** `workeroutput` may be missing. Controller skips that worker's data and shows partial report. |
| 140 | +- **Context fetch fails:** No context sent to workers, workers skip flaky detection. Same as today. |
| 141 | +- **No context in workerinput:** Worker skips flaky detection gracefully. |
| 142 | + |
| 143 | +## Testing Strategy |
| 144 | + |
| 145 | +### Unit tests |
| 146 | + |
| 147 | +- `from_context()` construction from plain dict. |
| 148 | +- Static deadline computation. |
| 149 | +- `make_report_from_aggregated()` output from deserialized metrics. |
| 150 | + |
| 151 | +### Integration tests |
| 152 | + |
| 153 | +- `pytester` with `-n 2`: end-to-end flaky detection under xdist. |
| 154 | +- Metrics aggregation across workers (check terminal summary). |
| 155 | +- Budget respected across workers. |
| 156 | + |
| 157 | +### Edge cases |
| 158 | + |
| 159 | +- Single worker (`-n 1`). |
| 160 | +- Worker crash: partial report, no controller crash. |
| 161 | +- No tests to process. |
| 162 | +- xdist not installed: no import errors. |
| 163 | + |
| 164 | +### Regression |
| 165 | + |
| 166 | +All existing non-xdist tests must keep passing unchanged. |
0 commit comments