diff --git a/rfcs/0007-e2e-qa-lab-scorecard-consolidation.md b/rfcs/0007-e2e-qa-lab-scorecard-consolidation.md new file mode 100644 index 0000000..599a8b1 --- /dev/null +++ b/rfcs/0007-e2e-qa-lab-scorecard-consolidation.md @@ -0,0 +1,331 @@ +--- +title: E2E and QA Lab scorecard consolidation +authors: + - Dallin Romney +created: 2026-06-07 +last_updated: 2026-06-10 +status: accepted +issue: https://github.com/openclaw/openclaw/issues/91883 +rfc_pr: https://github.com/openclaw/rfcs/pull/10 +--- + +# Proposal: E2E and QA Lab scorecard consolidation + +## Summary + +OpenClaw already has the pieces for serious e2e coverage: Vitest e2e shards, QA Lab scenarios, Docker lanes, live transport checks, and a few sibling repos that have explored release-style proof. This RFC makes QA Lab the place where product-level e2e evidence is collected: normal CI gets a small deterministic suite with mock services and mock model responses, release CI runs the same scenario classes against real providers, real channel credentials, package installs, and upgrade paths, and maturity scorecards consume the resulting summaries. + +## Motivation + +The maturity scorecard process defines product surfaces and category-level maturity. Its rendered scorecard says promotion toward Stable needs release gates, troubleshooting paths, repeated real-world proof, and scenario proof across expected environments. OpenClaw now has a checked-in maturity taxonomy snapshot in `taxonomy.yaml` and a score snapshot in `docs/maturity-scores.yaml`. This RFC adds the executable layer on top of that taxonomy: `taxonomy-mappings.yaml`, coverage IDs, profile membership, evidence summaries, CI gates, and release artifacts. + +OpenClaw already has many valuable non-unit checks: + +- `pnpm test:e2e` runs Vitest e2e shards across `test/**/*.e2e.test.ts`, `src/**/*.e2e.test.ts`, `packages/**/*.e2e.test.ts`, selected Gateway integration tests, bundled-plugin e2e globs, and Control UI e2e tests. +- `pnpm test:live` runs the live Vitest shard for source, test, and bundled-plugin live tests. It covers provider, agent runtime, gateway, media, plugin, and native/live integration paths that QA Lab does not own directly. +- `qa/scenarios/**` contains YAML-backed user scenarios with coverage IDs, docs refs, code refs, config patches, runtime parity metadata, and executable `qa-flow` blocks. +- `extensions/qa-lab` can start isolated QA Lab runs, a synthetic `qa-channel`, mock providers, live provider modes, runtime-pair parity, live transport lanes for Telegram, Discord, Slack, and WhatsApp, Mantis before/after live verification, and local channel runs backed by the `openclaw/crabline` messaging SDK. +- `extensions/qa-matrix` is a dedicated live transport runner for Matrix. It provisions a disposable Tuwunel homeserver in Docker, runs the real Matrix plugin in a child Gateway, and has release-gate lanes for transport, media, and E2EE coverage. +- `scripts/e2e/**`, Docker lanes, Parallels lanes, and release scripts exercise install, package, upgrade, plugin, live model, MCP, cross-OS, and release journeys. +- `test/scripts/**` covers script-level release tooling, Docker plans, package checks, perf/RTT helpers, live proof helpers, and CI guards. Some of those tests are product-flow candidates; many are still script/tooling tests. +- PTY/TUI tests cover a fake-backend terminal loop and an opt-in local TUI smoke, but the scoped TUI guide explicitly says the fake-backend PTY lane does not prove Gateway transport, embedded backend runtime, providers, session persistence, or live streaming. +- Sibling repos add useful concepts: Kova has release-shaped scenarios and performance attribution; plugin-inspector has offline plugin compatibility and synthetic contract probes; Crabpot has a many-plugin fixture corpus; Kitchen Sink is the credential-free plugin fixture; openclaw-rtt stores normalized timing history; Crabbox provides remote runner capacity; and openclaw/releases keeps release evidence. + +Today there is no single summary entry that says: this product surface was exercised like a user would use it, on this ref, with this model provider, channel or UI surface, runner substrate, and live/mock upstream state, and the result counts for this scorecard category. CI can pass while a maturity category has no realistic user-flow proof. It can also cover a plugin contract without covering the visible workflow that depends on it. + +## Goals + +- Map scorecard surfaces to executable coverage IDs and publish run summaries that CI can enforce. +- Provide a harness-neutral QA wrapper so CI can request evidence by surface and category instead of knowing whether the proof comes from Vitest e2e, QA Lab scenarios, live transport lanes, Docker, the `openclaw/crabline` channel SDK, or release scripts. +- Keep normal PR CI deterministic and credential-free while still exercising realistic user flows. +- Run release gates against real providers, real channel credentials, package installs, upgrade paths, and supported platforms. +- Keep scenario behavior portable across model providers, channels, runner substrates, and live/mock upstreams. +- Move script-level checks only when a clearer e2e, QA Lab, integration, release-helper, or tooling-test home exists. + +## Non-Goals + +- This RFC does not add every missing test. +- Historical RTT storage, broad external plugin fixtures, and durable release ledgers stay in their existing repos. +- Normal PR CI does not require live credentials. +- Unit tests and low-level integration tests remain useful, but they do not count as maturity evidence unless they exercise a user-visible path. + +## Proposal + +### 1. Standardize the evidence artifact + +Add a scorecard-aware e2e evidence model in `openclaw/openclaw`. QA Lab does not have to run every check. Its summary format covers the checks that matter for release confidence: e2e shards, Docker lanes, live transport checks, release jobs, and scorecard reports. + +Each run writes a summary JSON artifact into its QA artifact directory: the directory passed with `--output-dir`, or the command default under `.artifacts/qa-e2e/-`. `qa suite` writes `qa-suite-summary.json`; local `openclaw/crabline` channel runs write the same summary shape; live transport lanes write lane-specific summaries such as `telegram-qa-summary.json`, `discord-qa-summary.json`, `slack-qa-summary.json`, `whatsapp-qa-summary.json`, and `matrix-qa-summary.json`. Release CI uploads those files as workflow artifacts first, then can mirror or import the same summaries into `openclaw/releases` once the schema settles. + +Each summary includes one scenario entry per executable user-flow scenario with: + +- scenario id +- coverage IDs +- scorecard surface and category IDs +- profile: `smoke-ci` or `release` +- model provider, model live mode, and provider fixture or auth mode, for example `provider: openai`, `model_live: false`, `provider_fixture: tool-call-streaming` +- channel or surface, channel live mode, channel SDK/driver, and runner substrate, for example `channel: telegram`, `channel_live: false`, `channel_driver: crabline`, `runner: host` +- OpenClaw ref, package spec, OS, Node version, runner, and artifact paths +- pass/fail/blocked status plus failure reason +- optional timing fields: p50/p95 when repeated, single-run RTT otherwise + +Evidence summaries do not copy taxonomy provenance. `taxonomy-mappings.yaml` sits next to `taxonomy.yaml` and records the taxonomy source/version metadata, executable category IDs, profile membership, coverage IDs, docs refs, code refs, scenario refs, and release/advisory status used to join summary entries back to the maturity taxonomy. + +The profile is a named selector layered onto the taxonomy mapping that maps scorecard surfaces and categories to runnable lanes. `smoke-ci` is the deterministic PR and merge-gate profile: model providers use mock fixtures, channels use synthetic or local SDK-backed upstreams, and no live external service is required. `release` is the full Stable/LTS profile: provider, channel, package, upgrade, and platform claims need live proof where the claim depends on a real upstream or release artifact. Focused local runs can still filter a profile down to one surface or category. CI should read that profile mapping from `taxonomy-mappings.yaml` rather than carrying a second list in workflow YAML. Non-blocking evidence can appear as advisory rows in reports, but it is not a separate test profile until maintainers explicitly promote it. + +The existing QA scenario `coverage.primary` and `coverage.secondary` fields are the seed for this model. Start by extending `openclaw qa coverage` and the existing `qa-suite-summary.json` shape; then normalize live transport summaries to the same schema and join that output to `taxonomy.yaml` through `taxonomy-mappings.yaml`. The landed taxonomy is the product/maturity source of truth; RFC 0007's follow-up work should add executable mappings and profiles without creating a competing scorecard taxonomy. + +[0007/example-scorecard-checklist.md](0007/example-scorecard-checklist.md) shows a complete illustrative Stable/LTS checklist and evidence mapping. It is not the final taxonomy, but it shows the expected implementation shape: every release-blocking checklist item has a category ID, evidence requirement, and machine-readable mapping to one or more runnable lanes. The authoritative requirement-to-test mapping should land in `openclaw/openclaw` as checked-in mapping data; this RFC's example remains a shape reference. + +Maturity test docs should also live in `openclaw/openclaw` alongside `taxonomy.yaml`, `taxonomy-mappings.yaml`, and runnable coverage IDs. OpenClaw CI should not depend on external process notes to understand what a release-blocking requirement means, how it maps to code paths, or how to rerun/troubleshoot the proof. + +### 2. Add a harness-neutral QA wrapper + +Keep the harness-specific commands for local debugging and focused work, but add a wrapper that lets CI and maintainers request evidence by product surface and scorecard category: + +```sh +pnpm openclaw qa run \ + --profile smoke-ci \ + --surface channels.telegram \ + --category channels.telegram.mock \ + --provider-mode mock-openai \ + --transport telegram \ + --output-dir .artifacts/qa-e2e/channels-telegram-smoke-ci +``` + +The wrapper can dispatch to the right implementation: Vitest e2e shard, QA Lab scenario pack, live transport command, Matrix runner, Docker/package lane, `openclaw/crabline` channel SDK lane, Control UI browser run, TUI lane, or release helper. The caller should not need to know that `runtime.gateway.startup` is currently a Vitest shard while `channels.telegram.mock` is an SDK-backed QA Lab run. + +The wrapper should still write the same summary artifacts described above. A profile can map to many surface/category pairs, a category can map to multiple lanes, and the wrapper can either run all required lanes or fail early with a clear "missing runnable mapping" error. `--surface` and `--category` are filters over the selected profile, not a separate source of truth. + +### 3. Split the suites by when they run + +#### `smoke-ci` profile in normal CI + +The `smoke-ci` profile runs on every PR or normal merge gate. It is deterministic and credential-free. It still needs to hit user paths, not only package checks or internal contracts. + +Required components: + +- Gateway e2e Vitest shard for protocol, auth, sessions, and core runtime. +- Control UI e2e shard for browser/Gateway workflows with mocked Gateway and selective real WebSocket flows. +- QA Lab scenario subset through `qa-channel` and mock model providers. +- Runtime-pair parity for the standard profile, currently OpenClaw vs Codex where the scenario declares parity. +- `openclaw/crabline`-backed messaging-channel smoke lane with mock AI responses. Run it on every PR, even while coverage is still thin, so regressions force the lane to mature instead of leaving channel e2e as occasional release-only work. +- TUI fake-backend PTY lane plus an explicit local-backend smoke on the platforms where it is stable enough for CI. +- Plugin fixture smoke for built-in plugins using plugin-inspector concepts and Kitchen Sink conformance mode, but without executing arbitrary external plugin code. + +The `smoke-ci` profile needs an explicit configuration path. The current CLI spells deterministic OpenAI-compatible model fixtures as `--provider-mode mock-openai`; the proposed evidence schema records that as `provider: openai` and `model_live: false`. + +```sh +pnpm openclaw qa run \ + --profile smoke-ci \ + --provider-mode mock-openai \ + --transport qa-channel \ + --output-dir .artifacts/qa-e2e/smoke-ci + +pnpm openclaw qa run \ + --profile smoke-ci \ + --channel-driver crabline \ + --provider-mode mock-openai \ + --transport telegram \ + --output-dir .artifacts/qa-e2e/smoke-ci-crabline-telegram +``` + +#### Broader release-track e2e + +Some release-profile coverage can run before a release candidate on scheduled, maintainer-triggered, and changed-surface gates. These runs can spend more time on deterministic services and broader matrices without creating a third profile: + +- Docker install, package, plugin lifecycle, MCP, upgrade, and release-user journey lanes. +- Kova-like release-shaped runtime scenarios when they test behavior that belongs in `openclaw/openclaw`. +- Control UI browser flows across at least Chromium and mobile viewport emulation. +- TUI local-backend smoke and PTY rendering stress tests. +- More QA Lab YAML scenarios across memory, automation, media, provider, workspace, and plugin surfaces. + +#### Release and maturity scorecard suite + +Release CI runs the full suite with real model providers, real upstream services, and service credentials: + +- real OpenAI, Anthropic, Google, OpenRouter, and selected long-tail providers for provider-path maturity categories +- real Telegram bots and user-driver flows +- real Discord, Slack, and WhatsApp canaries +- real package install/update/upgrade journeys from npm tags +- Crabbox/Testbox, Docker, Linux, macOS, Windows/WSL2, and platform-specific lanes when the surface claims release support +- scorecard coverage export for each release candidate + +Release CI fails closed for scorecard categories. For the first Stable/LTS gate, anything important enough to appear on the scorecard blocks release; advisory checks can stay outside the scorecard until they are ready to be enforced. Messaging channels need live upstream proof to count as release evidence. That live proof can be added channel by channel, but a local shim alone does not satisfy a Stable/LTS claim for a channel listed as supported. + +### 4. Keep scenarios portable across providers, channels, and runners + +Make each high-level scenario portable across providers, channels, channel drivers, and runner substrates. The scenario describes the user behavior. The selected driver supplies the channel upstream behavior, while the runner supplies the execution environment, provider fixture or credentials, and package source. + +Model provider dimensions: + +- `provider`: `openai`, `anthropic`, `google`, `openrouter`, `local`, or another provider ID. +- `model_live`: `false` for deterministic provider fixtures and `true` for real provider credentials. +- `provider_fixture`: the deterministic behavior to use when `model_live` is false, such as OpenAI-compatible streaming, planned tool calls, provider timeout, malformed response, or rate-limit behavior. +- `package_source`: source checkout, packed tarball, npm tag, or release artifact. + +Use "frontier" only as informal shorthand for commercial model providers; OpenAI belongs in that group. The reason current QA Lab commands name `mock-openai` is practical: the OpenAI-compatible path is the default mock server and covers the most common chat/tools/streaming contract. Anthropic, Google, OpenRouter, and other providers should get provider-specific fixtures too, but the schema should encode them as provider IDs plus live/mock state rather than a separate `mock-frontier` mode. + +Channel and surface dimensions: + +- `channel`: `qa-channel`, `telegram`, `discord`, `slack`, `whatsapp`, `matrix`, or another channel ID. +- `channel_live`: `false` for synthetic channels or deterministic local upstream shims, and `true` for real upstream credentials. +- `channel_driver`: `native`, `crabline`, or another channel driver. `crabline` means the `openclaw/crabline` messaging SDK. +- `surface`: `gateway-rpc`, `control-ui`, `tui-pty`, `cli`, `docker-package`, `package-install`, or another non-channel surface. +- `runner`: `host`, `docker`, `crabbox`, or a release workflow runner. + +This keeps channel identity separate from liveness. A Telegram scenario can run as `channel: telegram, channel_live: false` in PR CI and as `channel: telegram, channel_live: true` in release CI. The same pattern applies to model providers with `provider: openai, model_live: false` versus `provider: openai, model_live: true`. + +Suggested env/flag contract: + +```text +OPENCLAW_QA_PROFILE=smoke-ci|release +OPENCLAW_QA_PROVIDER=openai|anthropic|google|openrouter|local|... +OPENCLAW_QA_MODEL_LIVE=0|1 +OPENCLAW_QA_PROVIDER_FIXTURE=openai-tools-streaming|timeout|rate-limit|... +OPENCLAW_QA_CHANNEL=qa-channel|telegram|discord|slack|whatsapp|matrix|... +OPENCLAW_QA_CHANNEL_LIVE=0|1 +OPENCLAW_QA_CHANNEL_DRIVER=native|crabline|... +OPENCLAW_QA_SURFACE=gateway-rpc|control-ui|tui-pty|cli|docker-package|... +OPENCLAW_QA_RUNNER=host|docker|crabbox +OPENCLAW_QA_SCORECARD=1 +``` + +Prefer CLI flags for local runs. CI can pass env vars through reusable workflows. + +### 5. Use `openclaw/crabline` for deterministic channel behavior + +The Crabline work in this RFC refers to the `openclaw/crabline` messaging SDK. It should provide deterministic local upstream shims for channel behavior so OpenClaw can test Telegram, Discord, Slack, WhatsApp, and similar transports without real service credentials in normal CI. + +Add a channel conformance runner with this shape: + +```text +qa suite --channel-driver crabline --runner host --transport telegram --profile smoke-ci --output-dir .artifacts/qa-e2e/smoke-ci-crabline-telegram + uses a deterministic provider fixture so channel failures are not mixed with live model-provider failures + starts the selected channel's local Crabline upstream shim + starts OpenClaw Gateway with selected channel plugin enabled + injects inbound DM/group/thread/media/action events + waits for Gateway/agent/channel reply + asserts outbound adapter payload and user-visible transcript + writes scenario entries into the summary artifact +``` + +First-wave mock channel adapters: + +- Telegram: DM, group mention, forum topic/thread, inline button approval, media/location input, reconnect. +- Discord: DM, guild channel mention, thread, slash/native command callback, media, reaction/action. +- Slack: DM, channel thread, Socket Mode event, slash command, button approval, file attachment. +- WhatsApp: DM, group activation, media/voice, native reaction/approval, reconnect. + +Second-wave adapters: + +- iMessage, Matrix, Google Chat, Microsoft Teams, Signal. +- Mattermost, LINE, IRC, Nextcloud Talk, Nostr, Twitch, Tlon, Synology Chat. +- Regional channels after their setup and credential constraints are documented. + +Use Kova's channel capability vocabulary as an input, not a second harness: durable final text/media/payload, reply-to/thread behavior, ack after dispatch, native platform actions, retries, source-visible delivery, and no self-trigger. Turn the useful capabilities into OpenClaw QA coverage IDs and channel mock driver assertions. + +### 6. Keep hard-coded tests and YAML scenarios + +Use hard-coded Vitest e2e when the invariant is low-level, fast, and closer to an API contract than to a user journey. Examples: + +- Gateway protocol and WebSocket handshake behavior +- session history ordering and idempotency +- provider response normalization +- plugin loader failure handling +- security gates and SSRF boundaries +- deterministic TUI rendering primitives + +Use YAML-driven QA Lab scenarios when the behavior is a product workflow: + +- "user sends message, agent replies visibly in same channel" +- "group mention does not leak to unrelated room" +- "approval button resolves one pending tool call" +- "provider timeout recovers with visible failure" +- "cron reminder arrives once" +- "media input is staged, summarized, and reply media is sent" + +The existing `qa/scenarios/channels/group-visible-reply-tool.md` is the right shape: it has coverage IDs, docs/code refs, a config patch, and a flow that asserts both mock-provider tool planning and visible outbound transcript state. The plan is to expand that pattern across scorecard surfaces and let the wrapper choose mock, live, host, Docker, Crabbox, or `openclaw/crabline` channel-driver paths. + +### 7. Turn plugin contract coverage into user-flow coverage + +Bring plugin-inspector-style checks into OpenClaw for built-in and bundled plugin confidence. Keep them labeled as contract evidence. They are useful, but they are not the same as a user workflow. + +For each built-in plugin: + +1. Keep static manifest/SDK/import/contract checks. +2. Add a Kitchen Sink or plugin-specific deterministic runtime smoke when the plugin exposes a provider/tool/channel/service surface. +3. Add at least one QA Lab user-flow scenario for the product behavior that depends on the plugin, such as a channel message, provider turn, media tool, telemetry export, or service lifecycle. +4. Emit one summary entry for the contract check and one for the user-flow check so scorecard Coverage can distinguish them. + +External plugin ecosystem coverage stays in Crabpot. OpenClaw release CI can consume Crabpot's summary as an advisory compatibility signal. Core OpenClaw CI avoids cloning or downloading broad external plugin corpora. + +### 8. Normalize performance evidence + +Reuse the timing approaches from Kova and openclaw-rtt without moving their data ledgers into core. + +Bring into `openclaw/openclaw`: + +- a normalized RTT/evidence schema for QA Lab and e2e summaries +- per-sample attempt counts and resource metrics when a lane already measures them +- direct Gateway RPC and Control UI timing measurement helpers when they prove release regressions +- p50/p95 aggregation for repeated release checks + +Keep separate: + +- `openclaw-rtt` dashboard data and historical result rows +- `openclaw/releases` release evidence ledger +- Kova's broader OCM-controlled lab reports unless a scenario is promoted into the core OpenClaw suite + +### 9. Script disposition plan + +`test/scripts` currently contains 262 script tests. Most stay where they are: they protect scripts, release tooling, package checks, and CI plumbing. A smaller set can move when the replacement has a clearer home and emits the same or stronger evidence. + +Use this rule for each file: + +- If the test proves a user journey, make it a QA Lab or e2e scenario. +- If it proves a helper, runner, planner, or report contract, move it beside that code as an integration test. +- If it proves release machinery, keep it as a script test unless the release lane itself starts emitting equivalent scenario evidence. +- If it has no explicit package-script reference, do not treat that as removal proof; Vitest discovers `*.test.ts` files. + +The initial file-level inventory lives in [0007/script-test-inventory.md](0007/script-test-inventory.md). Keep that list as migration input, not as the RFC contract. The contract is the rule above: move tests only when the new home preserves or improves the old failure signal, and remove old script tests only after the replacement emits equivalent summary evidence. + +Pick the first conversion batch by readiness and coverage gap. A useful batch should cover different risk shapes: one Gateway/runtime smoke, one telemetry path, one package or release path, one provider or tool path, and one messaging channel. + +### 10. Pull from other repos + +| Repo | Decision | +| --- | --- | +| `openclaw/kova` | Pull concepts, not the whole harness. Bring release-shaped scenario hierarchy, mock/live auth policy, process role attribution, repeated-sample p50/p95 gates, and channel capability proof vocabulary into OpenClaw e2e planning. Keep Kova as the OCM-backed runtime validation lab and broader report system. | +| `openclaw/plugin-inspector` | Pull built-in plugin inspection concepts into OpenClaw CI: static manifest/SDK/import checks, runtime capture with mocked SDK, synthetic probes, and stable finding codes. Keep the publishable package separate for external plugin authors and Crabpot. | +| `openclaw/openclaw-rtt` | Keep as data/dashboard repo. Pull only normalized timing schema and importer expectations into OpenClaw QA summaries. New channel timing sources should originate in OpenClaw and then be imported into `openclaw-rtt`. | +| `openclaw/crabbox` | No consolidation. Use it as a runner substrate for remote, cross-OS, niche, heavy, and visual proof lanes. | +| `openclaw/crabpot` | Keep separate. It pins many external plugins and consumes plugin-inspector reports. Pull only fixture categorization, contract probe backlog ideas, and summary shape for optional release advisory gates. | +| `openclaw/kitchen-sink` | Keep separate as a published example and fixture plugin. Use it in OpenClaw e2e as the canonical credential-free external plugin for conformance, adversarial, and live-provider-routing scenarios. | +| `openclaw/releases` | Keep separate. Release CI should publish normalized evidence as workflow artifacts first, then mirror or import the same summaries into `openclaw/releases` once the artifact shape settles. | + +### 11. Implementation plan + +The dependency-oriented PR plan lives in [0007/implementation-plan.md](0007/implementation-plan.md). It splits work into PRs that can start independently and PRs that should stack on earlier schema, runner, or release-artifact work. It also calls out cross-repo work in `openclaw/kova`, `openclaw/openclaw-rtt`, `openclaw/crabpot`, and `openclaw/releases`. + +## Rationale + +This keeps the repo boundaries straightforward: + +- Core product behavior and executable test harnesses stay in `openclaw/openclaw`. +- Scorecard policy, executable taxonomy, maturity test docs, and CI-owned requirement metadata live in `openclaw/openclaw`. +- The design decision stays in `openclaw/rfcs`. +- Timing history and dashboards stay in `openclaw/openclaw-rtt`. +- Durable release evidence can stay in `openclaw/releases` after CI artifacts establish the schema. +- Broad external plugin compatibility stays in `openclaw/crabpot`. +- Remote execution capacity stays in `openclaw/crabbox`. + +Normal CI stays affordable. Mock services and `openclaw/crabline` channel shims catch many regressions without credentials. Live release lanes still cover upstream API and real-account behavior before users get a release. + +Without this consolidation, the likely path is more isolated Docker scripts, live workflows, and script tests. Those checks can be useful, but the maturity scorecard still has to infer whether the product surface was actually covered. + +The machine-readable summaries are necessary for CI, but they should not become the only maintainer interface. A fast-follow report should start from scorecard categories, show blocking/missing/stale/advisory status, link to artifacts and rerun commands, and keep raw scenario rows behind the main view. Otherwise the system risks producing large reports that technically contain the answer but are hard to review. + +## Unresolved questions + +None at the moment. diff --git a/rfcs/0007/example-scorecard-checklist.md b/rfcs/0007/example-scorecard-checklist.md new file mode 100644 index 0000000..9a356f5 --- /dev/null +++ b/rfcs/0007/example-scorecard-checklist.md @@ -0,0 +1,121 @@ +# Example Stable/LTS Scorecard Checklist + +This is a complete example, not the final OpenClaw Stable/LTS taxonomy. OpenClaw now has the maturity taxonomy snapshot in `taxonomy.yaml`, the executable overlay in `taxonomy-mappings.yaml`, and the score snapshot in `docs/maturity-scores.yaml`; this example shows the RFC 0007 mapping shape on top of that data. Every checklist row has a category ID, a blocking rule, evidence requirements, and at least one example mapping to executable evidence. The real mapping can rename categories, split rows, or add surfaces. Whatever the final mapping contains, it should preserve the same property: no release-blocking checklist item exists without a machine-readable evidence mapping. + +## Example checklist + +| ID | Surface | Category | Stable/LTS requirement | Blocks release | Evidence required | +| --- | --- | --- | --- | --- | --- | +| runtime.gateway.startup | Runtime and Gateway | Startup and protocol readiness | Gateway starts from source and package builds, exposes protocol endpoints, authenticates clients, and rejects bad handshakes predictably. | yes | Passing host e2e and package/Docker e2e on target ref. | +| runtime.gateway.restart | Runtime and Gateway | Restart and in-flight recovery | Gateway restart preserves or safely terminates active runs with visible terminal state and no duplicate final reply. | yes | Host e2e plus Docker/package restart lane. | +| runtime.agent.turns | Runtime and Gateway | Agent turn lifecycle | A normal user turn streams, calls tools when needed, finishes once, and records the final transcript. | yes | Core QA Lab scenario plus runtime e2e shard. | +| runtime.context.compaction | Runtime and Gateway | Context and compaction recovery | Long sessions compact or recover without replaying unsafe mutating work. | yes | QA Lab long-context scenario plus focused runtime e2e. | +| runtime.tools.core | Tools | Core tool execution | File, shell, patch, search, and session tools execute through the production tool path with correct result rendering. | yes | QA Lab tool scenarios plus host e2e shard. | +| runtime.tools.approval | Tools | Approval and denial flows | Approval prompts, denial, retry, and terminal stop paths work through portable actions. | yes | QA Lab approval scenario plus channel action scenario. | +| runtime.observability.trace | Observability | Trace and run diagnostics | User-visible runs emit trace/run diagnostics with enough IDs to debug release failures without leaking secrets. | yes | QA Lab telemetry scenario plus OTEL smoke. | +| runtime.performance.budget | Observability | Startup and request latency | Gateway startup, core turn RTT, and channel reply RTT stay inside release budget or produce an explicit waiver. | yes | Repeated RTT lane plus startup benchmark summary. | +| providers.openai | Model providers | OpenAI provider path | OpenAI auth, catalog, streaming, tool calls, web search, media-adjacent calls, and error handling work with live credentials. | yes | Live provider lane and provider/tool e2e. | +| providers.anthropic | Model providers | Anthropic provider path | Anthropic auth, catalog, streaming, tool calls, long-context behavior, and error handling work with live credentials. | yes | Live provider lane and runtime parity scenario. | +| providers.google | Model providers | Google provider path | Google auth, catalog, text turn, tool behavior where supported, and error handling work with live credentials. | yes | Live provider lane. | +| providers.openrouter | Model providers | OpenRouter provider path | OpenRouter routing, model selection, tool support detection, and error handling work with live credentials. | yes | Live provider lane. | +| providers.local | Model providers | Local provider path | Local provider setup, catalog, text turn, and failure diagnostics work without cloud credentials. | yes | Host e2e or Docker local-provider lane. | +| channels.qa | Channels | Synthetic QA channel | The synthetic channel still covers portable channel contract behavior used by normal PR CI. | yes | Core QA Lab `qa-channel` lane. | +| channels.telegram.mock | Channels | Telegram deterministic channel proof | Telegram adapter handles DM, group mention, thread/topic, approval button, media metadata, reconnect, and outbound transcript with local upstream shims. | yes | `openclaw/crabline` SDK-backed mock-channel lane. | +| channels.telegram.live | Channels | Telegram live upstream proof | Telegram bot and user-driver flows work against live Telegram for every release that claims Telegram support. | yes | Live Telegram lane. | +| channels.discord.mock | Channels | Discord deterministic channel proof | Discord adapter handles DM, guild mention, thread, native callback, reaction/action, media metadata, and outbound transcript with local upstream shims. | yes | `openclaw/crabline` SDK-backed mock-channel lane. | +| channels.discord.live | Channels | Discord live upstream proof | Discord canary works against live Discord for every release that claims Discord support. | yes | Live Discord lane. | +| channels.slack.mock | Channels | Slack deterministic channel proof | Slack adapter handles DM, channel thread, Socket Mode event, slash command, button approval, file attachment, and outbound transcript with local upstream shims. | yes | `openclaw/crabline` SDK-backed mock-channel lane. | +| channels.slack.live | Channels | Slack live upstream proof | Slack canary works against live Slack for every release that claims Slack support. | yes | Live Slack lane. | +| channels.whatsapp.mock | Channels | WhatsApp deterministic channel proof | WhatsApp adapter handles DM, group activation, media/voice metadata, native reaction/approval, reconnect, and outbound transcript with local upstream shims. | yes | `openclaw/crabline` SDK-backed mock-channel lane. | +| channels.whatsapp.live | Channels | WhatsApp live upstream proof | WhatsApp canary works against live WhatsApp for every release that claims WhatsApp support. | yes | Live WhatsApp lane. | +| channels.matrix.live | Channels | Matrix live release lane | Matrix transport, media, and E2EE lanes pass against the disposable Matrix release runner. | yes | Matrix QA release lane. | +| memory.recall | Memory and sessions | Memory recall | User preference and session memory recall work through normal conversation flow and do not cross thread/channel boundaries. | yes | QA Lab memory scenarios. | +| memory.failure | Memory and sessions | Memory failure behavior | Memory store failure produces visible fallback behavior without losing the active turn. | yes | QA Lab memory failure scenario. | +| sessions.persistence | Memory and sessions | Session persistence | Session identity, transcript ordering, and resume state survive restart and package lanes. | yes | Runtime e2e plus package restart lane. | +| automation.cron | Automation | Cron lifecycle | Natural-language cron creation, due-run execution, duplicate prevention, delivery, and run history work end to end. | yes | QA Lab scheduling scenarios. | +| automation.reminders | Automation | Reminders and commitments | Reminder or commitment flows create the right scheduled work and deliver once to the expected target. | yes | QA Lab personal reminder and heartbeat scenarios. | +| automation.webhooks | Automation | Webhook and hook ingress | HTTP/plugin hook ingress validates auth, size, idempotency, routing, and visible run dispatch. | yes | Hook integration plus QA Lab hook scenario. | +| plugins.manifest | Plugins | Manifest and SDK contract | Bundled plugin manifests, SDK imports, public barrels, and static contracts pass without core-boundary leaks. | yes | Plugin inspection summary. | +| plugins.runtime | Plugins | Runtime plugin behavior | Built-in provider/tool/channel/service plugins have at least one user-flow scenario for the behavior users depend on. | yes | QA Lab plugin scenarios plus Kitchen Sink conformance. | +| plugins.external.compat | Plugins | External plugin compatibility | External plugin corpus remains advisory unless the release explicitly claims ecosystem compatibility for that package set. | no | Crabpot/plugin-inspector advisory summary. | +| media.input | Media | Media input understanding | Image/audio/file input is staged, summarized, redacted where needed, and usable in a normal agent turn. | yes | QA Lab media scenarios plus live media lane. | +| media.output | Media | Media generation output | Image/TTS/media output reaches the requested channel or artifact path with usable metadata. | yes | QA Lab media generation scenario plus channel media lane. | +| ui.control | UI and CLI | Control UI workflow | Control UI can start, observe, and recover Gateway-backed runs with browser proof on supported viewports. | yes | Control UI e2e shard. | +| ui.tui | UI and CLI | TUI workflow | TUI fake-backend PTY and local-backend smoke cover rendering, input, streaming, and terminal state. | yes | TUI PTY lane plus local-backend smoke where stable. | +| cli.commands | UI and CLI | CLI setup and command surface | `openclaw` setup, qa, doctor, version, and package commands work from source and package installs. | yes | CLI/package e2e lane. | +| install.npm | Install and upgrade | npm install and update | Fresh npm install, update, and package candidate resolution work on supported Node versions. | yes | Package acceptance workflow. | +| install.docker | Install and upgrade | Docker package workflow | Docker build, install, gateway startup, plugin lifecycle, and release-user journey work from the package artifact. | yes | Docker e2e lane. | +| install.desktop | Install and upgrade | Desktop and OS-specific packaging | macOS, Windows/WSL2, and Linux package smoke lanes pass for platforms claimed by the release. | yes | Parallels/Testbox/Crabbox platform lanes. | +| upgrade.config | Install and upgrade | Upgrade and config migration | Existing supported config/state upgrades through `doctor --fix` or package update without runtime fallback shims. | yes | Upgrade survivor and doctor migration lanes. | +| security.secrets | Security | Secret redaction | Logs, traces, summaries, transcripts, diagnostics, and PR artifacts do not leak credentials. | yes | QA Lab redaction scenario plus secret scanning gate. | +| security.network | Security | Network and SSRF boundaries | Gateway, media, webhook, and tool paths enforce network policy and unsafe source rejection. | yes | Security e2e/integration shard. | +| security.permissions | Security | Approval and permission boundaries | Tool approval, trusted tools, plugin hooks, and channel callback actions stay distinguishable before transport encoding. | yes | QA Lab approval and plugin hook scenarios. | +| docs.troubleshooting | Docs and operations | Troubleshooting path | Every release-blocking surface has a public or maintainer troubleshooting path that explains likely failures and rerun commands. | yes | Docs check plus release checklist summary. | +| release.artifacts | Release operations | Evidence artifact publication | Release CI publishes normalized summaries, scorecard report, artifact manifest, and known proof gaps without raw secrets. | yes | Release workflow artifact step. | +| release.waivers | Release operations | Waiver handling | Upstream-provider outages fail loudly, classify likely upstream cause, and require human maintainer or release-owner waiver. | yes | Release scorecard gate. | + +## Example mapping + +`Required profiles` uses only the initial `smoke-ci` and `release` profiles. `none` means the row is advisory evidence outside the blocking profile set unless maintainers later promote it. + +| Checklist ID | Coverage IDs | Example scenarios or lanes | Required profiles | Required live proof | Freshness rule | +| --- | --- | --- | --- | --- | --- | +| runtime.gateway.startup | `runtime.gateway.startup`, `runtime.protocol.auth` | `pnpm test:e2e` Gateway shard; Docker gateway smoke; package acceptance workflow | smoke-ci, release | package lane for release | target ref and release package | +| runtime.gateway.restart | `runtime.gateway.restart`, `runtime.run.recovery` | `qa/scenarios/jsonl-replay/gateway-restart-recovery.jsonl`; `qa/scenarios/runtime/gateway-restart-inflight-run.md` | smoke-ci, release | package lane for release | target ref and release package | +| runtime.agent.turns | `runtime.agent.turn`, `runtime.streaming.final` | `qa/scenarios/channels/dm-chat-baseline.md`; `qa/scenarios/runtime/streaming-final-integrity.md` | smoke-ci | no | target ref | +| runtime.context.compaction | `runtime.context.compaction`, `runtime.replay.safe` | `qa/scenarios/runtime/compaction-retry-mutating-tool.md`; `qa/scenarios/runtime/long-context-progress-watchdog.md` | smoke-ci, release | no | target ref | +| runtime.tools.core | `runtime.tools.core`, `runtime.tools.files`, `runtime.tools.exec` | `qa/scenarios/runtime/tools/fs-read.md`; `qa/scenarios/runtime/tools/fs-write.md`; `qa/scenarios/runtime/tools/exec.md`; `qa/scenarios/runtime/tools/apply-patch.md` | smoke-ci | no | target ref | +| runtime.tools.approval | `runtime.tools.approval`, `runtime.actions.approval` | `qa/scenarios/runtime/approval-turn-tool-followthrough.md`; `qa/scenarios/personal/approval-denial-stop.md` | smoke-ci, release | live channel action for channel claims | target ref and release package | +| runtime.observability.trace | `runtime.observability.otel`, `runtime.trace.visibility` | `qa/scenarios/runtime/otel-trace-smoke.md`; `qa/scenarios/runtime/qa-bus-tool-trace-visibility.md` | smoke-ci, release | no | target ref | +| runtime.performance.budget | `runtime.performance.rtt`, `runtime.performance.startup` | RTT harness summary; CLI/Gateway startup benchmark summary | release | live channel RTT for channel claims | release candidate | +| providers.openai | `providers.openai.live`, `providers.openai.tools`, `providers.openai.web_search` | `qa/scenarios/models/openai-native-web-search-live.md`; OpenAI tools client e2e | release | yes | release candidate | +| providers.anthropic | `providers.anthropic.live`, `providers.anthropic.long_context` | `qa/scenarios/models/anthropic-opus-api-key-smoke.md`; `qa/scenarios/models/anthropic-opus-setup-token-smoke.md` | release | yes | release candidate | +| providers.google | `providers.google.live` | Live provider shard for Google | release | yes | release candidate | +| providers.openrouter | `providers.openrouter.live`, `providers.routing.catalog` | Live provider shard for OpenRouter | release | yes | release candidate | +| providers.local | `providers.local.live`, `providers.local.diagnostics` | Local-provider Docker or host e2e lane | release | local runtime, no cloud | target ref and release package | +| channels.qa | `channels.qa.baseline`, `channels.portable.contract` | `qa/scenarios/channels/channel-chat-baseline.md`; `qa/scenarios/channels/dm-chat-baseline.md` | smoke-ci | no | target ref | +| channels.telegram.mock | `channels.telegram.mock`, `channels.actions.approval`, `channels.media.metadata` | `openclaw/crabline` Telegram mock upstream lane | smoke-ci | no | target ref | +| channels.telegram.live | `channels.telegram.live`, `channels.telegram.user_driver` | Telegram live QA lane; Telegram user-driver Crabbox proof | release | yes | release candidate | +| channels.discord.mock | `channels.discord.mock`, `channels.threading`, `channels.actions.native` | `openclaw/crabline` Discord mock upstream lane | smoke-ci | no | target ref | +| channels.discord.live | `channels.discord.live` | Discord live QA lane | release | yes | release candidate | +| channels.slack.mock | `channels.slack.mock`, `channels.threading`, `channels.file_attachment` | `openclaw/crabline` Slack mock upstream lane | smoke-ci | no | target ref | +| channels.slack.live | `channels.slack.live` | Slack live QA lane | release | yes | release candidate | +| channels.whatsapp.mock | `channels.whatsapp.mock`, `channels.media.voice`, `channels.reconnect` | `openclaw/crabline` WhatsApp mock upstream lane | smoke-ci | no | target ref | +| channels.whatsapp.live | `channels.whatsapp.live` | WhatsApp live QA lane | release | yes | release candidate | +| channels.matrix.live | `channels.matrix.live`, `channels.matrix.e2ee`, `channels.matrix.media` | Matrix QA transport/media/E2EE release lane | release | yes | release candidate | +| memory.recall | `memory.recall`, `memory.thread_isolation` | `qa/scenarios/memory/memory-recall.md`; `qa/scenarios/memory/thread-memory-isolation.md`; `qa/scenarios/personal/memory-preference-recall.md` | smoke-ci | no | target ref | +| memory.failure | `memory.failure.fallback` | `qa/scenarios/memory/memory-failure-fallback.md` | smoke-ci | no | target ref | +| sessions.persistence | `sessions.persistence`, `sessions.resume` | `qa/scenarios/jsonl-replay/recovery-partial-session.jsonl`; runtime e2e shard | smoke-ci, release | package lane for release | target ref and release package | +| automation.cron | `automation.cron.lifecycle`, `automation.cron.dedupe` | `qa/scenarios/scheduling/cron-natural-fire-no-duplicate.md`; `qa/scenarios/scheduling/cron-single-run-no-duplicate.md`; `qa/scenarios/scheduling/cron-one-minute-ping.md` | smoke-ci, release | live delivery for claimed live channels | target ref and release package | +| automation.reminders | `automation.reminders`, `automation.heartbeat` | `qa/scenarios/personal/reminder-roundtrip.md`; `qa/scenarios/memory/commitments-heartbeat-target-none.md` | smoke-ci | no | target ref | +| automation.webhooks | `automation.webhooks.ingress`, `automation.hooks.dispatch` | Hook integration shard; future QA Lab webhook scenario | release | live tunnel only if release claims hosted ingress | release candidate | +| plugins.manifest | `plugins.manifest.contract`, `plugins.sdk.boundary` | Plugin inspection summary; plugin contract integration tests | smoke-ci | no | target ref | +| plugins.runtime | `plugins.runtime.user_flow`, `plugins.kitchen_sink.conformance` | `qa/scenarios/plugins/kitchen-sink-live-openai.md`; `qa/scenarios/plugins/mcp-plugin-tools-call.md`; `qa/scenarios/plugins/plugin-manifest-contract-health.md` | smoke-ci, release | live provider only for provider plugin claims | target ref and release package | +| plugins.external.compat | `plugins.external.compat.advisory` | Crabpot/plugin-inspector advisory summary | none | no | latest advisory run | +| media.input | `media.input.image`, `media.input.attachment` | `qa/scenarios/media/image-understanding-attachment.md`; live media shard | smoke-ci, release | live provider for release | release candidate | +| media.output | `media.output.image`, `media.output.tts` | `qa/scenarios/media/image-generation-roundtrip.md`; `qa/scenarios/media/native-image-generation.md`; `qa/scenarios/runtime/tools/tts.md` | release | live provider/channel where claimed | release candidate | +| ui.control | `ui.control.gateway`, `ui.control.browser` | Control UI e2e shard; mobile viewport browser run | smoke-ci, release | no | target ref | +| ui.tui | `ui.tui.pty`, `ui.tui.local_backend` | TUI fake-backend PTY lane; local-backend smoke where stable | smoke-ci, release | no | target ref | +| cli.commands | `cli.commands.package`, `cli.doctor`, `cli.qa` | Package acceptance workflow; CLI startup and doctor smoke | smoke-ci, release | no | target ref and release package | +| install.npm | `install.npm.fresh`, `install.npm.update` | Package acceptance workflow; npm update smoke | release | package registry or release artifact | release candidate | +| install.docker | `install.docker.package`, `install.docker.gateway` | Docker e2e lane; Docker package gateway smoke | release | package artifact | release candidate | +| install.desktop | `install.desktop.macos`, `install.desktop.windows`, `install.desktop.linux` | Parallels macOS/Windows/Linux smoke; Crabbox/Testbox platform lanes | release | platform lane | release candidate | +| upgrade.config | `upgrade.config.doctor`, `upgrade.state.migration` | `qa/scenarios/config/config-apply-restart-wakeup.md`; `qa/scenarios/runtime/auth-profile-doctor-migration-safety.md`; upgrade survivor lane | release | package upgrade lane | release candidate | +| security.secrets | `security.secrets.redaction`, `security.artifacts.redaction` | `qa/scenarios/security/secret-redaction-tool-logs.md`; `qa/scenarios/personal/redaction-no-secret-leak.md`; secret scanning gate | smoke-ci, release | no | target ref and release package | +| security.network | `security.network.boundaries`, `security.media.ssrfsafe` | Security e2e/integration shard; media unsafe-source checks | smoke-ci, release | no | target ref | +| security.permissions | `security.permissions.actions`, `security.plugin_hooks.policy` | `qa/scenarios/personal/tool-safety-followthrough.md`; plugin hook health scenario | smoke-ci, release | live channel action for channel claims | target ref and release package | +| docs.troubleshooting | `docs.troubleshooting.surface`, `docs.release.rerun` | Docs check; release checklist summary with rerun commands | release | no | release candidate | +| release.artifacts | `release.artifacts.summary`, `release.artifacts.manifest` | Release workflow artifact upload step; scorecard report artifact | release | release workflow | release candidate | +| release.waivers | `release.waiver.upstream`, `release.failure.classification` | Scorecard gate failure classification; maintainer waiver record | release | live upstream outage path | release candidate | + +## Example generated gap report + +The scorecard gate can generate a compact report from the mapping above and the collected summary artifacts. Example shape: + +| Checklist ID | Status | Evidence artifact | Detail | +| --- | --- | --- | --- | +| runtime.gateway.startup | pass | `.artifacts/qa-e2e/smoke-ci/qa-suite-summary.json` | host e2e and package lane passed for target ref | +| channels.telegram.live | pass | `.artifacts/qa-e2e/telegram-20260607/telegram-qa-summary.json` | live Telegram bot and user-driver proof passed | +| providers.google | fail | `.artifacts/qa-e2e/live-providers/qa-suite-summary.json` | live Google provider scenario failed with provider outage classification | +| plugins.external.compat | advisory | `crabpot-summary.json` | advisory-only compatibility signal, not release-blocking | +| release.waivers | pending | `release-scorecard-summary.json` | maintainer or release-owner waiver required before promotion | diff --git a/rfcs/0007/implementation-plan.md b/rfcs/0007/implementation-plan.md new file mode 100644 index 0000000..9efd1c7 --- /dev/null +++ b/rfcs/0007/implementation-plan.md @@ -0,0 +1,256 @@ +# Implementation Plan + +This sidecar is the implementation work plan for the RFC. It is organized by dependency shape, not PR number. The point is to make it clear what can start immediately, what should stack on earlier schema or runner work, and which repos need follow-up work. + +## Initial or Independent PR Candidates + +These can start before the full scorecard gate exists. Some benefit from coordination, but they do not need to wait for the final release-gating path. + +### Evidence schema and coverage inventory + +- Repo: `openclaw/openclaw` +- Depends on: none +- Active PR: `openclaw/openclaw#91484` adds normalized QA evidence summaries. It emits `profile`, surface/category IDs, live/mock state, channel driver, runner substrate, artifact paths, failure class, and timing fields. It does not copy taxonomy provenance into each run artifact; evidence joins through the checked-in mapping and taxonomy files. +- Work: + - Extend `qa-suite-summary.json` with scorecard surface/category IDs, profile, provider ID, model live mode, provider fixture, channel ID, channel live mode, surface ID, runner, package source, artifact paths, failure class, and timing fields. + - Extend `openclaw qa coverage` to report coverage IDs, source paths, runtime parity metadata, docs refs, code refs, and declared maturity categories. +- Result: existing QA runs produce richer summary entries without changing release policy yet. + +### Executable taxonomy profile and coverage mapping + +- Repo: `openclaw/openclaw` +- Depends on: landed maturity taxonomy +- Landed PR: `openclaw/openclaw#91512` adds the root `taxonomy.yaml` maturity taxonomy snapshot and `docs/maturity-scores.yaml` score snapshot. +- Active PR: `openclaw/openclaw#91500` adds QA coverage taxonomy validation through root `taxonomy-mappings.yaml`, next to `taxonomy.yaml`. It layers executable coverage IDs and profile membership on top of the landed taxonomy, rather than creating a competing scorecard source of truth. It uses only the RFC's initial `smoke-ci` and `release` profiles; advisory compatibility remains report-only, not its own test profile. +- Work: + - Treat `taxonomy.yaml` as the authoritative maturity surface/category/level snapshot. + - Store the executable overlay in `taxonomy-mappings.yaml` so other OpenClaw docs, tools, and scorecard workflows can reuse the same mapping. + - Add or derive stable executable category IDs where needed for QA evidence joins. + - Map current `qa/scenarios/**` coverage IDs to taxonomy category IDs. + - Define profile membership in the executable taxonomy layer, starting with only `smoke-ci` and `release` mappings over surface/category IDs. + - Mark release-blocking categories separately from advisory categories. +- Result: CI and reports can join QA evidence to the landed maturity taxonomy without maintaining a second scorecard inventory. + +### Maturity test docs in OpenClaw + +- Repo: `openclaw/openclaw` +- Depends on: landed `taxonomy.yaml`; docs can still land before every executable mapping exists +- Active PR: `openclaw/openclaw#91483` adds the first maturity test docs skeleton in OpenClaw. It should explain how the landed `taxonomy.yaml`, `docs/maturity-scores.yaml`, and the executable profile/coverage mapping relate to each other. It should stay aligned with the two-profile model: `smoke-ci` for deterministic no-live proof and `release` for the full live/package/platform gate. +- Work: + - Move or recreate the maturity test docs in OpenClaw, near the taxonomy and QA docs. + - Document what each requirement means, why it blocks or does not block release, what code paths it covers, and how to rerun or troubleshoot the proof. + - Treat policy/process notes as separate from the public executable contract in OpenClaw. + - Coordinate with the scorecard mapping workflow so `taxonomy.yaml`, `taxonomy-mappings.yaml`, maturity docs, and generated scorecard views stay aligned. +- Result: OpenClaw contains the docs needed to understand and maintain its own release-blocking test requirements. + +### Harness-neutral QA wrapper + +- Repo: `openclaw/openclaw` +- Depends on: evidence schema and executable taxonomy profile mapping are useful, but the first wrapper can start with a small mapping table +- Active PR: `openclaw/openclaw#91587` adds `pnpm openclaw qa run --profile smoke-ci|release` on top of the scorecard evidence map and dispatches mapped categories through the existing `qa suite` runner. +- Work: + - Add a wrapper command such as `pnpm openclaw qa run --profile `, with optional `--surface ` and `--category ` filters. + - Dispatch by mapping rather than by caller knowledge of the harness. A profile expands to surface/category mappings, and a category can route to Vitest e2e, QA Lab scenarios, live transport lanes, Matrix, Docker/package lanes, the `openclaw/crabline` channel SDK, Control UI, TUI, or release helper scripts. + - Write the same summary artifacts as direct harness runs. + - Fail with a clear missing-mapping error when a requested surface/category has no runnable lane. + - Keep `qa/AGENTS.md` as the agent-readable contract for QA scenarios, profile dispatch, evidence shape, and focused validation commands. +- Result: CI and maintainers can ask for evidence by product/maturity concept without knowing which test harness currently owns that proof. + +### `smoke-ci` QA Lab profile + +- Repo: `openclaw/openclaw` +- Depends on: useful after the evidence schema, but the first cut can start before it +- Work: + - Add `qa run --profile smoke-ci` as the profile selector over deterministic mapped scenarios, dispatching to `qa suite` for the first cut. + - Include Gateway, `qa-channel`, provider/tool, memory/session, automation, plugin, telemetry, and Control UI coverage. + - Wire a normal CI command only after the lane is reliable enough not to create broad flake noise. +- Result: a credential-free `smoke-ci` QA Lab lane with deterministic provider fixtures. + +### Crabline channel SDK foundation + +- Repos: `openclaw/crabline`, `openclaw/openclaw` +- Depends on: none +- Active PRs: `openclaw/crabline#1` adds the first SDK-backed local channel driver; `openclaw/openclaw#91502` adds the OpenClaw QA Lab channel-driver seam that can consume it. #91502 should use Crabline naming for the channel driver before merge. +- Work: + - Build or expose deterministic local upstream drivers in the `openclaw/crabline` messaging SDK. + - Add a QA Lab channel-driver seam such as `--channel-driver crabline` that can select one SDK-backed channel by channel ID. + - Start with one channel per run rather than an opaque `all-mock-channels` mode. + - Use deterministic provider fixtures so failures isolate channel, Gateway, adapter, event/action, transcript, and SDK behavior. + - VM, Docker, Crabbox, or Testbox runners can execute the lane later, but they are not the channel coverage mechanism. + - Wire the first stable channel path into normal PR CI as soon as it is deterministic, even if initial coverage is thin. +- Result: an SDK-backed Crabline channel lane that can prove a single messaging channel with local upstream shims on every PR. + +### Crabline coverage expansion + +- Repo: `openclaw/openclaw` +- Depends on: Crabline channel SDK foundation +- Work: + - Expand from the first Telegram/Discord paths to Slack and WhatsApp. + - Add richer scenario coverage per channel: DM, group or guild mention, thread/topic, media metadata, reconnect, native approval/action, and outbound transcript assertions. + - Add a coverage matrix so missing channel capabilities are visible instead of hidden behind a green smoke. +- Result: `openclaw/crabline` becomes an actual channel maturity lane, not just a transport smoke. + +### Live transport summary normalization + +- Repo: `openclaw/openclaw` +- Depends on: evidence schema +- Work: + - Normalize Telegram, Discord, Slack, WhatsApp, and Matrix live summary entries to the same scorecard evidence schema as the mock transport lane. + - Preserve credential leasing and redaction rules. +- Result: live transport artifacts can be consumed by the same scorecard report as mock lanes. + +### Script migration and cleanup slices + +- Repo: `openclaw/openclaw` +- Depends on: none, though the evidence schema helps when replacing e2e-like scripts +- Work: + - Use [script-test-inventory.md](script-test-inventory.md) as the detailed file-level migration input; do not duplicate that script list here. + - Migrate or clean up scripts one file at a time or in small batches when the new home is obvious. + - Move user-flow scripts to QA Lab or e2e scenarios. + - Move helper, runner, planner, and report tests beside their code. + - Keep package, release, CI, and tooling scripts under `test/scripts` unless the script itself moves. + - Remove an old script test only after the replacement preserves the old failure signal and emits equivalent evidence when relevant. +- Result: incremental cleanup without a giant migration PR. + +### Built-in plugin inspection checks + +- Repo: `openclaw/openclaw` +- Depends on: none +- Work: + - Bring plugin-inspector-style static checks into OpenClaw for bundled plugins. + - Cover manifest, SDK imports, public barrels, runtime capture where useful, and stable finding codes. + - Keep this labeled as contract evidence. +- Result: bundled plugin contract confidence in core CI. + +### Built-in plugin user-flow scenarios + +- Repo: `openclaw/openclaw` +- Depends on: `smoke-ci` QA Lab profile helps +- Work: + - Add user-flow scenarios for plugin families users experience directly: provider, tool, channel, diagnostics, service, media, memory, and web. + - Add Kitchen Sink conformance install/run as a credential-free external plugin fixture. +- Result: scorecard reporting can distinguish plugin contract evidence from user-flow evidence. + +### Kova scenario vocabulary extraction + +- Repos: `openclaw/kova`, `openclaw/openclaw` +- Depends on: none +- Work: + - In Kova, document or export the scenario concepts that OpenClaw should consume: release-shaped scenario hierarchy, process role attribution, repeated-sample p50/p95 gates, and channel capability vocabulary. + - In OpenClaw, import the vocabulary as coverage IDs and planning terms, not as a second harness. +- Result: OpenClaw gets reusable concepts without moving the Kova harness. + +### openclaw-rtt importer shape check + +- Repos: `openclaw/openclaw-rtt`, `openclaw/openclaw` +- Depends on: evidence schema +- Work: + - Update or document the RTT importer expectations for normalized QA summary timing fields. + - In OpenClaw, make new channel timing sources originate in QA summaries before import. +- Result: timing history can consume new evidence without owning OpenClaw runtime lanes. + +### Crabpot advisory summary shape + +- Repos: `openclaw/crabpot`, `openclaw/openclaw` +- Depends on: built-in plugin inspection checks are useful but not required +- Work: + - Define the advisory external-plugin compatibility summary shape that release CI can optionally consume. + - Keep broad fixture corpus ownership in Crabpot. +- Result: external plugin compatibility remains advisory and separate from core release blockers unless explicitly promoted. + +## Stacked PR Candidates + +These should wait for one or more initial pieces. Landing them too early would hard-code temporary shapes or produce reports that no lane can satisfy. + +### Scorecard gap report + +- Repo: `openclaw/openclaw` +- Stacks on: evidence schema, executable taxonomy profile mapping, maturity test docs +- Work: + - Add a report command that joins summary artifacts to the taxonomy. + - Print pass, fail, missing, and advisory rows. + - Detect stale evidence by target ref, release package, live proof requirement, and freshness rule. +- Result: maintainers can see exactly which scorecard items have no fresh evidence. + +### Release scorecard artifact + +- Repo: `openclaw/openclaw` +- Stacks on: scorecard gap report, live transport summary normalization +- Work: + - Add release workflow steps that upload normalized summaries, a scorecard report, an artifact manifest, known proof gaps, and redacted failure classifications. + - Keep raw logs, prompts, transcripts, and credentials out of release artifacts. +- Result: release CI publishes reviewable evidence before any releases-repo handoff. + +### Release-blocking scorecard gate + +- Repo: `openclaw/openclaw` +- Stacks on: release scorecard artifact +- Work: + - Make release CI fail when a release-blocking scorecard category has no fresh passing evidence for the target ref/package. + - Keep advisory rows outside the blocking scorecard until they are intentionally promoted. +- Result: Stable/LTS release gates become enforceable. + +### Human-readable scorecard report + +- Repo: `openclaw/openclaw` +- Stacks on: scorecard gap report; can start after the basic report shape is stable +- Priority: fast follow, not required for v1 gating +- Work: + - Add a maintainer-readable report view that starts from scorecard category status instead of raw scenario rows. + - Group evidence by blocking, failed, missing, stale, advisory, and informational status. + - Include artifact links, selected profile, surface/category filters, live/mock state, runner, and rerun commands. + - Keep the raw JSON and full row detail available, but do not make that the primary human interface. + - Consider a static HTML report or lightweight local web UI once the category-level report has settled. +- Result: release readiness and coverage gaps are reviewable without reading giant machine-oriented reports. + +### Waiver classification and release-owner override + +- Repo: `openclaw/openclaw` +- Stacks on: release-blocking scorecard gate +- Work: + - Add failure classification for likely upstream provider or messaging-service outages. + - Require a maintainer or release-owner waiver record before promotion when a blocking live lane fails for likely upstream reasons. +- Result: live upstream outages fail loudly but have an explicit human-controlled release path. + +### releases ledger handoff + +- Repos: `openclaw/releases`, `openclaw/openclaw` +- Stacks on: release scorecard artifact exercised in normal releases +- Work: + - Import or mirror release scorecard summaries from CI artifacts into `openclaw/releases` after the schema has survived real release validation. + - Avoid raw secrets and raw transcripts. +- Result: durable release evidence ledger without making `openclaw/releases` define the CI taxonomy. + +### Performance and RTT release convergence + +- Repos: `openclaw/openclaw`, `openclaw/openclaw-rtt` +- Stacks on: evidence schema, openclaw-rtt importer shape check +- Work: + - Add repeated-sample aggregation to selected QA Lab lanes. + - Add direct Gateway RPC timing where it proves release behavior. + - Add Control UI timing where useful. + - Emit importable channel RTT summaries. +- Result: performance evidence becomes part of release confidence instead of a separate ad hoc dashboard. + +### Full Stable/LTS category closure + +- Repo: `openclaw/openclaw` +- Stacks on: scorecard gap report, harness-neutral QA wrapper, `smoke-ci` profile, Crabline channel SDK expansion, live transport normalization, plugin scenarios, release artifacts +- Work: + - Iterate through missing scorecard categories until every release-blocking category has at least one fresh executable evidence path. + - Add a troubleshooting or rerun path for every release-blocking category. + - Use the scorecard requirement-to-test mapping against `taxonomy.yaml` as the source of truth when deciding whether a category is actually covered. + - Land this as focused PRs, not one giant closure PR. +- Result: the example checklist shape becomes a real, enforced OpenClaw taxonomy. + +## Cross-Repo Coordination Notes + +- `openclaw/openclaw` owns the maturity taxonomy snapshot, root `taxonomy-mappings.yaml` executable overlay, QA summaries, CI gates, and release workflow artifact generation. +- `openclaw/rfcs` owns the accepted design and sidecar examples only. +- `openclaw/openclaw` is the source of truth for scorecard policy that CI gates on, including the checked-in taxonomy, executable mappings, evidence summaries, and rerun/troubleshooting docs. +- The scorecard mapping workflow owns the authoritative mapping from maturity requirements to executable tests against the landed taxonomy. The example mapping in this RFC is only a scaffold for shape and review. +- `openclaw/kova` keeps the broader OCM-backed validation lab. Only vocabulary, scenario shape, and aggregation ideas should move into OpenClaw. +- `openclaw/openclaw-rtt` keeps timing history and dashboards. OpenClaw should emit normalized timing evidence that openclaw-rtt imports. +- `openclaw/crabpot` keeps the broad external plugin corpus. OpenClaw consumes its summary as advisory unless a release explicitly promotes that compatibility set to blocking. +- `openclaw/crabbox` remains runner capacity. It should not own the scorecard taxonomy. +- `openclaw/releases` receives durable release evidence after the CI artifact shape has proven stable. diff --git a/rfcs/0007/script-test-inventory.md b/rfcs/0007/script-test-inventory.md new file mode 100644 index 0000000..11ca406 --- /dev/null +++ b/rfcs/0007/script-test-inventory.md @@ -0,0 +1,49 @@ +# Script Test Inventory + +This sidecar keeps the detailed `test/scripts` disposition list out of the main RFC. The RFC text owns the policy and migration sequence; this file preserves the initial inventory that informed that plan. + +## Convert to QA Lab or e2e + +These files can become or feed executable user-flow scenarios. Remove the old `test/scripts` test after the new lane covers the behavior. + +| Target | Files | +| --- | --- | +| QA Lab scenario or release scenario | `qa-e2e.test.ts`, `qa-otel-smoke.test.ts`, `qa-lab-up.test.ts`, `openclaw-e2e-instance.test.ts`, `gateway-smoke.test.ts`, `tool-search-gateway-e2e.test.ts`, `e2e-agent-turn-output.test.ts`, `e2e-websocket-open.test.ts`, `e2e-run-with-pty.test.ts`, `webchat-auto-tts-live-proof.test.ts` | +| Messaging channel live/mock e2e | `npm-telegram-live.test.ts`, `npm-telegram-rtt-driver.test.ts`, `telegram-bot-api.test.ts`, `telegram-user-credential.test.ts`, `telegram-user-crabbox-proof.test.ts`, `test-device-pair-telegram.test.ts`, `mcp-channel-limits.test.ts`, `mcp-channels-harness.test.ts`, `gateway-network-client.test.ts` | +| Provider/tool e2e | `openai-chat-tools-client.test.ts`, `openai-image-auth-docker-client.test.ts`, `openai-web-search-minimal-client.test.ts`, `openai-web-search-minimal-assertions.test.ts`, `codex-media-path-client.test.ts`, `mock-openai-http.test.ts`, `test-live-media.test.ts`, `test-live.test.ts`, `test-live-shard.test.ts`, `test-live-cli-backend-docker.test.ts`, `test-live-codex-harness-docker.test.ts` | +| Plugin/package e2e | `kitchen-sink-plugin-assertions.test.ts`, `kitchen-sink-rpc-walk.test.ts`, `plugin-gateway-gauntlet.test.ts`, `plugin-lifecycle-probe.test.ts`, `plugin-lifecycle-measure.test.ts`, `plugin-update-unchanged-docker.test.ts`, `bundled-plugin-install-uninstall-probe.test.ts`, `live-plugin-tool-assertions.test.ts`, `plugins-assertions.test.ts` | +| Release/install/upgrade e2e | `release-user-journey-assertions.test.ts`, `release-media-memory-scenario.test.ts`, `release-scenarios-assertions.test.ts`, `upgrade-survivor-assertions.test.ts`, `upgrade-survivor-baselines.test.ts`, `upgrade-survivor-config-recipe.test.ts`, `upgrade-survivor-probe-gateway.test.ts`, `test-install-sh-docker.test.ts`, `package-openclaw-for-docker.test.ts`, `package-acceptance-workflow.test.ts` | +| Docker/MCP/service e2e | `docker-e2e-helper-cli.test.ts`, `docker-e2e-observability.test.ts`, `docker-e2e-plan.test.ts`, `docker-stats-resource-ceiling.test.ts`, `docker-build-helper.test.ts`, `cron-mcp-cleanup-docker-client.test.ts`, `mcp-code-mode-gateway-client.test.ts`, `mcp-connect-timeout.test.ts`, `mcp-websocket-open.test.ts`, `openwebui-probe.test.ts` | +| Performance evidence | `bench-cli-startup.test.ts`, `bench-gateway-child-test-support.ts`, `bench-gateway-restart.test.ts`, `bench-gateway-startup.test.ts`, `bench-test-changed.test.ts`, `cli-startup-bench-spawner.test.ts`, `limit-edge-case-live-proof.test.ts`, `measure-rpc-rtt.test.ts`, `rtt-harness.test.ts`, `openclaw-performance-source-summary.test.ts`, `openclaw-performance-workflow.test.ts`, `test-perf-budget.test.ts`, `ci-run-timings.test.ts` | + +## Convert to integration tests + +These files are closer to helper or runner integration than product e2e. Move them beside the code they exercise, or into focused integration suites, when that makes the test easier to maintain. + +| Target | Files | +| --- | --- | +| Script helper integration | `arg-utils.test.ts`, `bounded-response.test.ts`, `check-file-utils.test.ts`, `fixture-config.test.ts`, `fixtures-workspace.test.ts`, `format-generated-module.test.ts`, `managed-child-process.test.ts`, `source-file-scan-cache.test.ts`, `test-report-utils.test.ts`, `ts-guard-utils.test.ts` | +| Runner integration | `npm-runner.test.ts`, `pnpm-runner.test.ts`, `run-with-env.test.ts`, `run-vitest.test.ts`, `run-vitest-profile.test.ts`, `run-tsgo.test.ts`, `run-oxlint.test.ts`, `tsdown-build.test.ts`, `vitest-local-scheduling.test.ts`, `vitest-process-group.test.ts`, `vitest-shard-timings.test.ts` | +| CI planner integration | `changed-lanes.test.ts`, `ci-node-test-plan.test.ts`, `test-projects.test.ts`, `targeted-docker-lane-groups.test.ts`, `test-group-report.test.ts`, `test-hotspots.test.ts`, `test-force.test.ts`, `local-heavy-check-runtime.test.ts` | +| Artifact/report integration | `kova-ci-summary.test.ts`, `kova-report-gate.test.ts`, `mantis-build-telegram-evidence.test.ts`, `mantis-build-telegram-desktop-proof-evidence.test.ts`, `mantis-publish-pr-evidence.test.ts`, `real-behavior-proof-check.test.ts`, `real-behavior-proof-policy.test.ts`, `release-workflow-matrix-plan.test.ts` | +| Plugin contract integration | `channel-contract-test-plan.test.ts`, `plugin-contract-test-plan.test.ts`, `plugin-prerelease-test-plan.test.ts`, `plugin-boundary-report.test.ts`, `plugin-sdk-surface-report.test.ts`, `transitive-manifest-risk-report.test.ts`, `root-dependency-ownership-audit.test.ts`, `dependency-ownership-surface-report.test.ts` | + +## Keep as script, CI, package, release, or tooling tests + +These tests protect scripts and release tooling. Keep them under `test/scripts` unless the script they exercise moves. + +| Area | Files | +| --- | --- | +| Build, package, install, publish | `build-all.test.ts`, `build-and-run-mac.test.ts`, `build-diffs-viewer-runtime.test.ts`, `check-openclaw-package-tarball.test.ts`, `codesign-mac-app.test.ts`, `create-dmg.test.ts`, `ensure-cli-startup-build.test.ts`, `ensure-extension-memory-build.test.ts`, `ensure-playwright-chromium.test.ts`, `generate-npm-shrinkwrap.test.ts`, `install-cli.test.ts`, `install-ps1.test.ts`, `install-sh.test.ts`, `ios-node-e2e.test.ts`, `ios-pin-version.test.ts`, `ios-pull-gateway-log.test.ts`, `ios-team-id.test.ts`, `ios-version.test.ts`, `notarize-mac-artifact.test.ts`, `package-changelog.test.ts`, `package-mac-app.test.ts`, `package-mac-dist.test.ts`, `package-root-args.test.ts`, `resolve-openclaw-package-candidate.test.ts`, `resolve-openclaw-package-candidate-ip-bypass.test.ts`, `restart-mac.test.ts`, `runtime-postbuild.test.ts`, `runtime-postbuild-stamp.test.ts`, `stage-bundled-plugin-runtime.test.ts`, `verify-docker-attestations.test.ts`, `verify-plugin-npm-published-runtime.test.ts`, `verify.test.ts`, `write-cli-startup-metadata.test.ts` | +| CI and workflow guards | `audit-seams.test.ts`, `barnacle-auto-response.test.ts`, `channel-message-flows.test.ts`, `check-changelog-attributions.test.ts`, `check-cli-startup-memory.test.ts`, `check-composite-action-input-interpolation.test.ts`, `check-deadcode-unused-files.test.ts`, `check-dependency-pins.test.ts`, `check-deprecated-api-usage.test.ts`, `check-docs-i18n-glossary.test.ts`, `check-docs-mdx.test.ts`, `check-dynamic-import-warts.test.ts`, `check-extension-package-tsc-boundary.test.ts`, `check-extension-wildcard-reexports.test.ts`, `check-gateway-cpu-scenarios.test.ts`, `check-gateway-watch-regression.test.ts`, `check-memory-fd-repro.test.ts`, `check-no-conflict-markers.test.ts`, `check-no-random-messaging-tmp.test.ts`, `check-no-raw-window-open.test.ts`, `check-opengrep-rule-metadata.test.ts`, `check-package-patches.test.ts`, `check-plugin-sdk-wildcard-reexports.test.ts`, `check-release-metadata-only.test.ts`, `check-runtime-sidecar-loaders.test.ts`, `check-workflows.test.ts`, `check.test.ts`, `ci-docker-pull-retry.test.ts`, `ci-workflow-guards.test.ts`, `claude-auth-status.test.ts`, `codex-app-server-protocol-source.test.ts`, `config-reload-log-scanner.test.ts`, `dependency-changes-report.test.ts`, `dependency-guard-script.test.ts`, `dependency-guard-workflow.test.ts`, `dependency-vulnerability-gate.test.ts`, `docs-sync-publish.test.ts`, `extension-source-classifier.test.ts`, `firecrawl-compare.test.ts`, `full-release-validation-at-sha.test.ts`, `gh-read.test.ts`, `github-activity-helper.test.ts`, `label-open-issues.test.ts`, `lint-suppressions.test.ts`, `list-prod-store-packages.test.ts`, `merge-head-diff-base.test.ts`, `oxlint-config.test.ts`, `pnpm-audit-prod.test.ts`, `postinstall-bundled-plugins.test.ts`, `preinstall-package-manager-warning.test.ts`, `prepare-extension-package-boundary-artifacts.test.ts`, `prepare-git-hooks.test.ts`, `profile-extension-memory.test.ts`, `prompt-snapshots.test.ts`, `report-cli-helpers.test.ts`, `root-package-overrides.test.ts`, `run-additional-boundary-checks.test.ts`, `run-android-gradle.test.ts`, `run-opengrep.test.ts`, `sandbox-common-smoke-workflow.test.ts`, `secret-provider-integrations.test.ts`, `secret-scanning-maintainer.test.ts`, `session-log-mentions.test.ts`, `setup-pnpm-store-cache-ensure-node.test.ts`, `summarize-cpuprofile.test.ts`, `test-built-status-message-runtime.test.ts`, `test-env-mutation-report.test.ts`, `test-extension.test.ts`, `test-helpers.ts`, `ts-topology.test.ts`, `ui.test.ts`, `watch-node.test.ts`, `website-installer-sync-workflow.test.ts`, `zai-fallback-repro.test.ts` | +| Release workflow scripts | `openclaw-cross-os-release-checks.test.ts`, `openclaw-cross-os-release-workflow.test.ts`, `release-beta-smoke.test.ts`, `release-beta-verifier.test.ts`, `release-candidate-checklist.test.ts`, `release-check.test.ts`, `generate-dependency-release-evidence.test.ts`, `live-docker-auth.test.ts`, `live-docker-stage.test.ts`, `mantis-telegram-desktop-proof-workflow.test.ts`, `parallels-npm-update-smoke.test.ts`, `parallels-package-log-progress-extract.test.ts`, `parallels-smoke-model.test.ts`, `parallels-update-job-timeout.test.ts`, `print-cli-backend-live-metadata.test.ts`, `clawdock-helpers.test.ts`, `crabbox-wrapper.test.ts`, `dev-tooling-safety.test.ts`, `close-duplicate-prs-after-merge.test.ts`, `committer.test.ts` | +| Bundled plugin and static asset scripts | `bundled-plugin-assets.test.ts`, `bundled-plugin-build-entries.test.ts`, `bundled-plugin-source-utils.test.ts`, `check-channel-agnostic-boundaries.test.ts`, `check-cli-bootstrap-imports.test.ts`, `control-ui-i18n.test.ts`, `dependency-ownership-surface-report.test.ts`, `docker-all-scheduler.test.ts`, `e2e-helper-env-limits.test.ts`, `e2e-mock-config-limits.test.ts`, `e2e-shell-tempfiles.test.ts`, `e2e-temp-state-dir.test.ts`, `embedded-run-abort-leak.test.ts`, `gateway-frame-payload.test.ts`, `gateway-ws-client.test.ts`, `ios-version.test-support.ts`, `npm-onboard-channel-agent-assertions.test.ts`, `npm-verify-exec.test.ts`, `onboard-log-contains.test.ts`, `openclaw-test-state.test.ts`, `plugin-npm-package-manifest-args.test.ts`, `plugin-npm-runtime-build-args.test.ts` | + +## Remove only after replacement + +The likely removal list is conditional on the conversions above landing first: + +- Remove QA smoke scripts after `openclaw qa suite --profile smoke-ci` covers the same behavior with summary evidence. +- Remove telemetry smoke scripts after telemetry scenarios run through the normal QA Lab profile and emit scorecard evidence. +- Remove assertion-only Docker and release helpers after the Docker/release e2e lanes emit native scenario evidence and the assertions move beside the scenario runner. +- Remove old RTT-specific script tests only if their importer and historical data remain covered in `openclaw-rtt`; keep OpenClaw runtime measurement tests in `openclaw/openclaw`.