Skip to content

Commit df13d57

Browse files
committed
Harden MATLAB parity retries and add parity runbook
1 parent 2ee1c6a commit df13d57

File tree

6 files changed

+240
-2
lines changed

6 files changed

+240
-2
lines changed

.github/workflows/matlab-parity-gate.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,11 @@ jobs:
2525
ACTIONS_RUNNER_SVC: "1"
2626
NSTAT_MATLAB_EXTRA_ARGS: -maca64 -nodisplay -noFigureWindows -softwareopengl
2727
NSTAT_FORCE_M_HELP_SCRIPTS: "1"
28+
NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS: "2"
2829
NSTAT_PARITY_RETRY_TIMEOUT_BLOCKS: "1"
2930
NSTAT_PARITY_TIMEOUT_RETRY_BLOCKS: timeout_front
31+
NSTAT_PARITY_RETRY_RECOVERABLE_BLOCKS: "1"
32+
NSTAT_PARITY_RECOVERABLE_RETRY_BLOCKS: graphics_mid,heavy_tail
3033
steps:
3134
- name: Prepare runner directories
3235
run: |

python/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,12 @@ Notes:
8686
`python/tools/run_parity_ladder.sh core_smoke timeout_front`.
8787
- Ladder writes retry telemetry to `python/reports/parity_retry_summary.json` (block, attempt count, retry reason, timeout-topic list).
8888
- Retry behavior is controlled by `NSTAT_PARITY_RETRY_TIMEOUT_BLOCKS` and `NSTAT_PARITY_TIMEOUT_RETRY_BLOCKS`.
89+
- Set `NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS=2` to retry per-topic MATLAB timeouts/crashes once before failing.
90+
- Set `NSTAT_PARITY_RETRY_RECOVERABLE_BLOCKS=1` and `NSTAT_PARITY_RECOVERABLE_RETRY_BLOCKS` to retry block failures caused by recoverable MATLAB failures (timeouts/crash signatures).
8991
- Preflight topic selection can be overridden with `NSTAT_PARITY_PREFLIGHT_STAGEB_TOPICS`.
9092

93+
See `python/docs/parity_runbook.rst` for the exact locally validated parity command set.
94+
9195
Use targeted blocks to debug delays locally before running remote CI:
9296

9397
```bash

python/docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Standalone Python port of nSTAT with MATLAB-help topic coverage and executable n
88

99
api
1010
help_topics
11+
parity_runbook

python/docs/parity_runbook.rst

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
Parity Runbook
2+
==============
3+
4+
Use this runbook for reproducible local/CI MATLAB parity checks on this machine.
5+
6+
Validated environment
7+
---------------------
8+
9+
- MATLAB binary: ``/Applications/MATLAB_R2025b.app/bin/matlab``
10+
- MATLAB args: ``-maca64 -nodisplay -noFigureWindows -softwareopengl``
11+
- Runner service mode: ``ACTIONS_RUNNER_SVC=1``
12+
- Force ``.m`` help scripts: ``NSTAT_FORCE_M_HELP_SCRIPTS=1``
13+
- Per-topic MATLAB retries: ``NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS=2``
14+
- Ladder timeout-only retry: ``NSTAT_PARITY_RETRY_TIMEOUT_BLOCKS=1``
15+
- Ladder recoverable retry: ``NSTAT_PARITY_RETRY_RECOVERABLE_BLOCKS=1``
16+
17+
Preflight + staged parity
18+
-------------------------
19+
20+
Run Stage A + selected Stage B preflight:
21+
22+
.. code-block:: bash
23+
24+
ACTIONS_RUNNER_SVC=1 \
25+
NSTAT_FORCE_M_HELP_SCRIPTS=1 \
26+
NSTAT_MATLAB_EXTRA_ARGS='-maca64 -nodisplay -noFigureWindows -softwareopengl' \
27+
NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS=2 \
28+
python/tools/run_parity_preflight.sh
29+
30+
Run the staged ladder (core -> timeout -> graphics -> heavy-tail):
31+
32+
.. code-block:: bash
33+
34+
ACTIONS_RUNNER_SVC=1 \
35+
NSTAT_FORCE_M_HELP_SCRIPTS=1 \
36+
NSTAT_MATLAB_EXTRA_ARGS='-maca64 -nodisplay -noFigureWindows -softwareopengl' \
37+
NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS=2 \
38+
NSTAT_PARITY_RETRY_TIMEOUT_BLOCKS=1 \
39+
NSTAT_PARITY_TIMEOUT_RETRY_BLOCKS=timeout_front \
40+
NSTAT_PARITY_RETRY_RECOVERABLE_BLOCKS=1 \
41+
NSTAT_PARITY_RECOVERABLE_RETRY_BLOCKS='graphics_mid,heavy_tail' \
42+
python/tools/run_parity_ladder.sh core_smoke timeout_front graphics_mid heavy_tail
43+
44+
Full Stage C gate
45+
-----------------
46+
47+
Run full-suite parity gate report:
48+
49+
.. code-block:: bash
50+
51+
ACTIONS_RUNNER_SVC=1 \
52+
NSTAT_FORCE_M_HELP_SCRIPTS=1 \
53+
NSTAT_MATLAB_EXTRA_ARGS='-maca64 -nodisplay -noFigureWindows -softwareopengl' \
54+
NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS=2 \
55+
python3 python/tools/verify_python_vs_matlab_similarity.py \
56+
--enforce-gate \
57+
--matlab-max-attempts 2 \
58+
--report-path python/reports/python_vs_matlab_similarity_report.json
59+
60+
Useful outputs
61+
--------------
62+
63+
- Full report: ``python/reports/python_vs_matlab_similarity_report.json``
64+
- Ladder retry telemetry: ``python/reports/parity_retry_summary.json``
65+
- Block reports: ``python/reports/parity_block_*.json``
66+
- Summary helper:
67+
68+
.. code-block:: bash
69+
70+
python3 python/tools/summarize_parity_report.py \
71+
python/reports/python_vs_matlab_similarity_report.json --json

python/tools/run_parity_ladder.sh

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ SET_ACTIONS_RUNNER_SVC="${NSTAT_SET_ACTIONS_RUNNER_SVC:-1}"
1010
RUNTIME_MULTIPLIER="${NSTAT_PARITY_RUNTIME_MULTIPLIER:-2.5}"
1111
RETRY_TIMEOUT_BLOCKS="${NSTAT_PARITY_RETRY_TIMEOUT_BLOCKS:-0}"
1212
TIMEOUT_RETRY_BLOCKS="${NSTAT_PARITY_TIMEOUT_RETRY_BLOCKS:-timeout_front}"
13+
RETRY_RECOVERABLE_BLOCKS="${NSTAT_PARITY_RETRY_RECOVERABLE_BLOCKS:-1}"
14+
RECOVERABLE_RETRY_BLOCKS="${NSTAT_PARITY_RECOVERABLE_RETRY_BLOCKS:-graphics_mid,heavy_tail,full_suite}"
1315
RETRY_SUMMARY_PATH="${NSTAT_PARITY_RETRY_SUMMARY_PATH:-python/reports/parity_retry_summary.json}"
1416

1517
DEFAULT_BLOCKS=(core_smoke timeout_front graphics_mid heavy_tail full_suite)
@@ -40,6 +42,16 @@ block_retry_enabled() {
4042
return 1
4143
}
4244

45+
block_recoverable_retry_enabled() {
46+
local block="$1"
47+
[[ "${RETRY_RECOVERABLE_BLOCKS}" == "1" ]] || return 1
48+
local token
49+
for token in ${RECOVERABLE_RETRY_BLOCKS//,/ }; do
50+
[[ "${token}" == "${block}" ]] && return 0
51+
done
52+
return 1
53+
}
54+
4355
is_timeout_only_regression() {
4456
local report_path="$1"
4557
"${PYTHON_BIN}" - "${report_path}" <<'PY'
@@ -108,8 +120,52 @@ raise SystemExit(0)
108120
PY
109121
}
110122

123+
retryable_failure_topics_csv() {
124+
local report_path="$1"
125+
"${PYTHON_BIN}" - "${report_path}" <<'PY'
126+
import json
127+
import sys
128+
from pathlib import Path
129+
130+
path = Path(sys.argv[1])
131+
if not path.exists():
132+
raise SystemExit(1)
133+
payload = json.loads(path.read_text(encoding="utf-8"))
134+
rows = payload.get("helpfile_similarity", {}).get("rows", [])
135+
if not rows:
136+
raise SystemExit(1)
137+
failed = [r for r in rows if not bool(r.get("matlab_ok"))]
138+
if not failed:
139+
raise SystemExit(1)
140+
141+
markers = (
142+
"matlab_timeout",
143+
"matlab is exiting because of fatal error",
144+
"fatal error",
145+
"mathworkscrashreporter",
146+
"crash report has been saved",
147+
"libmwhandle_graphics",
148+
)
149+
150+
def retryable(err: str) -> bool:
151+
e = (err or "").strip().lower()
152+
if e == "matlab_timeout":
153+
return True
154+
return any(m in e for m in markers)
155+
156+
if not all(retryable(str(r.get("matlab_error", ""))) for r in failed):
157+
raise SystemExit(1)
158+
159+
topics = [str(r.get("topic", "")).strip() for r in failed if str(r.get("topic", "")).strip()]
160+
if not topics:
161+
raise SystemExit(1)
162+
print(",".join(topics))
163+
raise SystemExit(0)
164+
PY
165+
}
166+
111167
init_retry_summary() {
112-
"${PYTHON_BIN}" - "${RETRY_SUMMARY_ABS}" "${RETRY_TIMEOUT_BLOCKS}" "${TIMEOUT_RETRY_BLOCKS}" <<'PY'
168+
"${PYTHON_BIN}" - "${RETRY_SUMMARY_ABS}" "${RETRY_TIMEOUT_BLOCKS}" "${TIMEOUT_RETRY_BLOCKS}" "${RETRY_RECOVERABLE_BLOCKS}" "${RECOVERABLE_RETRY_BLOCKS}" <<'PY'
113169
import json
114170
import sys
115171
from datetime import datetime, timezone
@@ -121,6 +177,8 @@ payload = {
121177
"generated_at_utc": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
122178
"retry_timeout_blocks_enabled": sys.argv[2] == "1",
123179
"timeout_retry_blocks": [b for b in sys.argv[3].replace(",", " ").split() if b],
180+
"retry_recoverable_blocks_enabled": sys.argv[4] == "1",
181+
"recoverable_retry_blocks": [b for b in sys.argv[5].replace(",", " ").split() if b],
124182
"events": [],
125183
}
126184
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
@@ -175,6 +233,7 @@ echo "[ladder] matlab args: ${MATLAB_EXTRA_ARGS}"
175233
echo "[ladder] blocks: ${BLOCKS[*]}"
176234
echo "[ladder] runtime multiplier: ${RUNTIME_MULTIPLIER} (<=0 disables runtime regression checks)"
177235
echo "[ladder] retry timeout-only blocks: ${RETRY_TIMEOUT_BLOCKS} (blocks: ${TIMEOUT_RETRY_BLOCKS})"
236+
echo "[ladder] retry recoverable-failure blocks: ${RETRY_RECOVERABLE_BLOCKS} (blocks: ${RECOVERABLE_RETRY_BLOCKS})"
178237
echo "[ladder] retry summary path: ${RETRY_SUMMARY_PATH}"
179238

180239
for block in "${BLOCKS[@]}"; do
@@ -186,7 +245,7 @@ for block in "${BLOCKS[@]}"; do
186245
echo "[ladder] running block: ${block}"
187246
report_path="${REPO_ROOT}/python/reports/parity_block_${block}.json"
188247
max_attempts=1
189-
if block_retry_enabled "${block}"; then
248+
if block_retry_enabled "${block}" || block_recoverable_retry_enabled "${block}"; then
190249
max_attempts=2
191250
fi
192251
attempt=1
@@ -295,6 +354,13 @@ PY
295354
attempt=$((attempt + 1))
296355
continue
297356
fi
357+
if [[ "${rc}" -eq 10 ]] && [[ "${attempt}" -lt "${max_attempts}" ]] && retry_topics_csv="$(retryable_failure_topics_csv "${report_path}")"; then
358+
echo "[ladder] retrying block ${block} after recoverable MATLAB failures (attempt ${attempt}/${max_attempts}); topics=${retry_topics_csv}"
359+
append_retry_summary_event "retry_scheduled" "${block}" "${attempt}" "${max_attempts}" "retry" "${rc}" "recoverable_matlab_failures" "${retry_topics_csv}"
360+
warmup_matlab
361+
attempt=$((attempt + 1))
362+
continue
363+
fi
298364
reason="block_failure"
299365
if [[ "${rc}" -eq 10 ]]; then
300366
reason="regression_gate_failure"

python/tools/verify_python_vs_matlab_similarity.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,17 @@
6666
"nSTATPaperExamples",
6767
}
6868
DEFAULT_HELP_TOPIC_TIMEOUT_S = 120
69+
try:
70+
DEFAULT_MATLAB_MAX_ATTEMPTS = max(1, int(os.environ.get("NSTAT_MATLAB_TOPIC_MAX_ATTEMPTS", "1")))
71+
except ValueError:
72+
DEFAULT_MATLAB_MAX_ATTEMPTS = 1
73+
CRASH_ERROR_MARKERS = (
74+
"matlab is exiting because of fatal error",
75+
"fatal error",
76+
"mathworkscrashreporter",
77+
"crash report has been saved",
78+
"libmwhandle_graphics",
79+
)
6980
DEFAULT_TOPIC_TIMEOUT_OVERRIDES: dict[str, int] = {
7081
"SignalObjExamples": 180,
7182
"CovariateExamples": 180,
@@ -128,6 +139,32 @@ def _cleanup_runner_matlab_processes() -> None:
128139
time.sleep(0.5)
129140

130141

142+
def _matlab_warmup(timeout_s: int = 90) -> None:
143+
if not MATLAB_BIN.exists():
144+
return
145+
try:
146+
_run_matlab_batch_logged("disp(version); exit", timeout_s=timeout_s)
147+
except Exception:
148+
return
149+
150+
151+
def _is_retryable_matlab_failure(payload: dict[str, Any]) -> bool:
152+
if bool(payload.get("ok")):
153+
return False
154+
error = str(payload.get("error", "")).strip()
155+
if error == "matlab_timeout":
156+
return True
157+
combined = " ".join(
158+
[
159+
error,
160+
str(payload.get("error_report", "")),
161+
str(payload.get("fallback_error", "")),
162+
str(payload.get("fallback_error_report", "")),
163+
]
164+
).lower()
165+
return any(marker in combined for marker in CRASH_ERROR_MARKERS)
166+
167+
131168
def _kill_process_group(pid: int) -> None:
132169
try:
133170
os.killpg(pid, signal.SIGKILL)
@@ -494,6 +531,10 @@ def run_script_path(path: Path, timeout: int, source_label: str | None = None) -
494531
"end; exit(0);"
495532
)
496533

534+
# In runner-service mode, enforce a clean MATLAB process slate before each topic.
535+
if _runner_service_mode():
536+
_cleanup_runner_matlab_processes()
537+
497538
t0 = time.time()
498539
try:
499540
run = _run_matlab_batch_logged(cmd, timeout)
@@ -615,6 +656,7 @@ def _help_similarity(
615656
topics: list[tuple[str, str]],
616657
default_timeout_s: int = DEFAULT_HELP_TOPIC_TIMEOUT_S,
617658
topic_timeout_overrides: dict[str, int] | None = None,
659+
matlab_max_attempts: int = DEFAULT_MATLAB_MAX_ATTEMPTS,
618660
) -> dict[str, Any]:
619661
rows: list[dict[str, Any]] = []
620662

@@ -649,7 +691,41 @@ def _help_similarity(
649691

650692
py = _run_python_topic(stem)
651693
timeout_s = topic_timeouts.get(stem, default_timeout_s)
694+
ml_attempt_history: list[dict[str, Any]] = []
652695
ml = _run_matlab_help_script(script_rel, timeout_s=timeout_s)
696+
ml_attempt_history.append(
697+
{
698+
"attempt": 1,
699+
"ok": bool(ml.get("ok")),
700+
"error": str(ml.get("error", "")),
701+
"runtime_s": float(ml.get("runtime_s") or 0.0),
702+
"script_used": str(ml.get("script_used", script_rel)),
703+
}
704+
)
705+
attempt = 1
706+
while (
707+
attempt < matlab_max_attempts
708+
and not bool(ml.get("ok"))
709+
and _is_retryable_matlab_failure(ml)
710+
):
711+
next_attempt = attempt + 1
712+
print(
713+
f"[help retry {next_attempt}/{matlab_max_attempts}] {stem} "
714+
f"after retryable MATLAB failure: {ml.get('error', '')}",
715+
flush=True,
716+
)
717+
_matlab_warmup()
718+
ml = _run_matlab_help_script(script_rel, timeout_s=timeout_s)
719+
ml_attempt_history.append(
720+
{
721+
"attempt": next_attempt,
722+
"ok": bool(ml.get("ok")),
723+
"error": str(ml.get("error", "")),
724+
"runtime_s": float(ml.get("runtime_s") or 0.0),
725+
"script_used": str(ml.get("script_used", script_rel)),
726+
}
727+
)
728+
attempt = next_attempt
653729

654730
if py.get("ok"):
655731
summary["python_ok"] += 1
@@ -697,6 +773,9 @@ def _help_similarity(
697773
"matlab_fallback_script_used": ml.get("fallback_script_used", ""),
698774
"matlab_runtime_s": ml.get("runtime_s"),
699775
"matlab_timeout_s": timeout_s,
776+
"matlab_attempts": len(ml_attempt_history),
777+
"matlab_retry_applied": len(ml_attempt_history) > 1,
778+
"matlab_attempt_history": ml_attempt_history,
700779
"matlab_timeout_snapshot_before_cleanup": ml.get("timeout_process_snapshot_before_cleanup", ""),
701780
"matlab_timeout_snapshot_after_cleanup": ml.get("timeout_process_snapshot_after_cleanup", ""),
702781
"matlab_runner_service_cleanup": bool(ml.get("runner_service_cleanup", False)),
@@ -878,6 +957,15 @@ def _parse_args(argv: list[str] | None = None) -> argparse.Namespace:
878957
default=[],
879958
help="Override per-topic MATLAB timeout using TOPIC=SECONDS (repeatable).",
880959
)
960+
parser.add_argument(
961+
"--matlab-max-attempts",
962+
type=int,
963+
default=DEFAULT_MATLAB_MAX_ATTEMPTS,
964+
help=(
965+
"Maximum MATLAB attempts per help topic for retryable failures "
966+
f"(default: {DEFAULT_MATLAB_MAX_ATTEMPTS})."
967+
),
968+
)
881969
parser.add_argument(
882970
"--report-path",
883971
default="python/reports/python_vs_matlab_similarity_report.json",
@@ -892,6 +980,9 @@ def main(argv: list[str] | None = None) -> int:
892980
if args.default_topic_timeout <= 0:
893981
print("--default-topic-timeout must be positive", file=sys.stderr)
894982
return 2
983+
if args.matlab_max_attempts <= 0:
984+
print("--matlab-max-attempts must be positive", file=sys.stderr)
985+
return 2
895986
try:
896987
requested_topics = _parse_topics_arg(args.topics)
897988
topics = _resolve_topics(requested_topics)
@@ -910,6 +1001,7 @@ def main(argv: list[str] | None = None) -> int:
9101001
"default_timeout_s": args.default_topic_timeout,
9111002
"topic_timeout_overrides": topic_timeout_overrides,
9121003
"force_m_help_scripts": FORCE_M_HELP_SCRIPTS,
1004+
"matlab_max_attempts": args.matlab_max_attempts,
9131005
}
9141006

9151007
print("[class] running Python/MATLAB class checks", flush=True)
@@ -934,6 +1026,7 @@ def main(argv: list[str] | None = None) -> int:
9341026
topics=topics,
9351027
default_timeout_s=args.default_topic_timeout,
9361028
topic_timeout_overrides=topic_timeout_overrides,
1029+
matlab_max_attempts=args.matlab_max_attempts,
9371030
)
9381031
contract_topics = None if full_suite else set(selected_topic_stems)
9391032
report["parity_contract"] = _evaluate_parity_contract(report["helpfile_similarity"]["rows"], topics_filter=contract_topics)

0 commit comments

Comments
 (0)