Mercury CLI Architecture (1.0.0-beta.1 Pre-Release Scope, Rust-First Repair + Scoped Experimental TypeScript Support)
This document describes the current runtime and trust boundaries for Mercury CLI in the 1.0.0-beta.1 pre-release branch scope, with repair quality centered on Rust, scoped experimental selected TypeScript verifier-path support, and current observability and hardening surfaces.
Mercury CLI is not a generic autonomous coding shell. The implemented product wedge is a Rust direct cargo verifier repair beta:
- start from a failing direct allowlisted verifier command, with Rust as the primary repair-quality target and selected experimental TypeScript commands supported in scoped
fixand CI flows - attempt bounded repair with Mercury models
- verify locally in isolation before acceptance
- emit a reviewable evidence bundle
- optionally open or update a draft PR through GitHub Actions
mercury-cli fix "<goal>" runs a repair loop with planning, candidate generation, verifier execution, and artifact emission under .mercury/runs/<run-id>/.
mercury-cli watch "<command>" --repair auto-repair is intentionally limited to direct Rust verifier commands:
cargo test ...cargo check ...cargo clippy ...- optional env-prefix variants that still resolve directly to those commands
Composed shell commands (&&, pipes, redirection, shell wrappers like make test or just test) are rejected at watch command parsing time and do not start a watch cycle.
mercury-cli status --live provides candidate-level runtime observability with configurable refresh (--interval-ms, minimum 250).
- In a TTY it renders the existing heatmap/agent/budget dashboard and appends a rolling live-event pane for candidate launches, status changes, phase activation, and runtime-state updates.
- When stdout is piped, it emits the same redacted runtime feed as JSONL so CI and local tooling can inspect candidate/phase/runtime events without scraping the terminal dashboard.
For CI-safe logs, fix and watch also support --noninteractive, and the CI workflow invokes fix --noninteractive.
.github/workflows/repair.yml (Mercury CI Auto-Repair Draft PR) performs:
- checkout and build
mercury-cli - isolated baseline failure reproduction in a detached git worktree
- Mercury repair attempt (
fix) - post-repair verifier rerun
- evidence bundle validation and upload
- draft PR creation/update only when repair is verified,
dry_run != true, and the workflow can push to the same repository with required permissions; verified reruns targeting the same base ref and failure command reuse the same repair branch/PR head - final workflow status is blocking for orchestration failures (
baseline_not_reproduced,missing_api_key,setup_failed,internal_error) even though artifacts are still uploaded
The 1.0.0-beta.1 pre-release safety boundary is workflow-first and evidence-first.
Repair attempts run in disposable repo-copy/worktree isolation (.mercury/worktrees/ locally, detached worktree in CI workflow). This is not a container/VM sandbox claim.
Rejected candidates are discarded with their isolated repo copy/worktree. Accepted candidates are copied back after verification gates succeed.
No patch is considered CI draft-PR eligible unless all are true:
- baseline failure reproduced
- run metadata indicates final bundle verification
- repair marked applied
- post-repair verifier exit is zero
- non-empty non-
.mercurydiff exists
Runs are expected to emit inspectable evidence for replay and audit for the execution path that actually ran.
Every run bundle now includes audit.log with JSONL event records (run start, plan readiness, execution result, completion, and watch-cycle milestones).
Runtime output written into artifacts is redacted for known API-key markers and configured API-key env names.
The workflow validates a minimum artifact contract before summary publishing:
summary.mddecision.jsonenvironment.jsonpr-body.mdrepair.diffrepair.diffstat.txtlogs/baseline.stdout.loglogs/baseline.stderr.log
When repair executes, bundle logs also include repair and post-repair verifier outputs; setup/init logs are included when those steps run.
If a nested Mercury run directory is available, it is copied into mercury-run/ inside the uploaded bundle.
- Workflow decision/environment payloads are JSON with stable keys used by docs/tests.
- Eval harnesses (
evals/v0for Rust andevals/v1_typescriptfor TypeScript lane) are manifest-driven and emit schema/version metadata in reports. - Rust repository analysis is parser-backed; the current TypeScript lane relies on token-aware repository scanning plus failure parsing rather than a real TypeScript parser.
- Planner critique text remains advisory prose and should not be treated as a strict machine contract.
- Verifier allowlist enforces direct Rust cargo verifier commands and selected direct experimental TypeScript verifier invocations by default (including supported env-prefix forms).
- Shell composition in verifier commands is blocked unless
MERCURY_ALLOW_UNSAFE_VERIFIER_COMMANDS=1is set explicitly. - Noninteractive mode is available for CI-oriented output surfaces.
- End-to-end
fixand CI repair targeting support allowlisted Rust direct verifier commands plus scoped selected experimental TypeScript direct verifier commands; localwatch --repairremains Rust-only.
- Local
watch --repairremains Rust-only. --max-agentsmaterially affects phased runtime dispatch and isolated candidate fanout. The repo now publishes scoped Rust benchmark speedup data underdocs/benchmarks/, but it still does not claim broad overlapping-edit convergence from that setting.- TypeScript support is intentionally scoped and experimental: selected direct verifier commands are supported in
fix/CI flows, while watch-based auto-repair and broader command classes are still limited; the repo does not ship a real TypeScript parser, so this is not parity with the Rust repair surface. - Live observability now exposes candidate/phase/runtime events, but it is still not a full conflict-telemetry or merge-decision explanation surface.
- CI automation is draft-PR oriented, not autonomous merge.
- Public benchmark reporting now exists under
docs/benchmarks/for the selected Rust corpus, emitted by the dedicated repair benchmark workflow and publisher. Those checked-in numbers are bounded to the exact run ids and corpus selection documented there, with published repair outcome, tier, verifier-class, candidate-lineage, failure-attribution, and execution-diagnostics slices for that lane. - TypeScript harness fixtures currently validate deterministic expected-red script outputs; this is useful for corpus/reporter contract checks but not a replacement for full benchmark-backed repair reporting.
Reproducible operator flows are documented in:
docs/case-studies/local-red-to-green.mddocs/case-studies/ci-draft-pr-flow.md
Treat those files as the primary runbooks. This architecture document describes invariants and boundaries they rely on.