Release 0.2.0: artifact-creation API, queue layer, threads (T###), wandb decoupling, validator strictness by KadenMc · Pull Request #11 · KadenMc/agentic-experiments

KadenMc · 2026-04-25T18:02:30Z

Summary

Cuts 0.2.0 — turns aexp from "installed-but-minimal" (0.1.x shipped the H→E→F chain enforcement, signac run store, and wandb adapter) into an actually-usable research harness. 12 stacked commits, ~290 → 305 tests passing, 14 new slash commands, one new artifact kind (T###), one new CLI subcommand group (queue), and material breaking changes to undocumented surfaces — see Migration below.

Driven primarily by real-world friction reports from the electricrag-side consumer agent over 2026-04-22 → 2026-04-25.

What's in this release

Seven themes, in roughly the order they were built:

First-class artifact creation API (0a5f11b). aexp.artifacts ships new_hypothesis / new_experiment / new_finding / (later) new_thread with automatic parent-backlink patching. Templates rendered from the package, not from local copies (see 51729e7). Closes the "agents will get backlinks wrong under deadline pressure" gap.
Wandb init-ownership decoupled (561b08c). Three first-class modes: tracked_run (managed), prepare_tracker + ctx.bind(run) (BYO init), and the existing bind_tracker adapter path. Consumers like electricrag's loop.py (which already calls wandb.init per ECG) can adopt aexp's discipline without refactoring their training code.
Slash-command UX cleanup (80906d6). Renamed close-run / close-batch / new-finding → finding-from-run / finding-from-batch / finding-placeholder so the distinguishing axis (what the finding cites) is self-explaining. Added show-run, show-batch. Tracker docs reframed.
Queue + runner-script materialization (2b17ae6, 5e0b58c, 8fd2a98). Declarative pending-run registration (aexp queue add), Cartesian sweep grammar (--sweep "K=a|b, SEED=0..3"), runner-script emission (shell / slurm / manual), and aexp queue run for in-batch-script iteration. Designed for agent-on-laptop, training-on-cluster workflows. Includes sp resolution via named conditions: blocks in experiment frontmatter — drift-proof provenance.
Install safety (6307908). aexp install --force preserves user-authored kb/ + templates/ content while still refreshing tooling files. Reported as the CHALLENGE.md-clobber bug after a real user lost committed mission content; regression-guarded with a new preserved_user_modified action kind.
Validator strictness (9b3965c, 51729e7). kb_validate.validate_required_headers enforces every shipped template ## header is present (missing_template_header issue code). Templates grew ## Caveats (top-level, both E and F) and ## Intent (dual-mode pre-registered vs. exploratory, kills the fabricate-retroactive-thresholds trap). Old ## Expected Outcome → ## Intent; old ## Analysis → ## Outcome Summary with explicit E-run-level ↔ F-generalizable-claim boundary docs.
Threads (T###) — new artifact kind (cb3faa0, 3f1b363). Forward-looking research concerns broader than a single hypothesis. Parallel to H/E/F but deliberately outside the H→E→F enforcement chain. Lifecycle: PROPOSED → EXPLORING → PROMOTED | CLOSED. Hypotheses gain optional thread: frontmatter for promotion (aexp new-hypothesis --thread T###). Two real threads salvaged on the consumer side under the new schema; doc refinements landed in 3f1b363 based on that first-use friction.

Migration / breaking changes

These break undocumented surfaces. No documented surface breaks.

Slash-command renames. close-run / close-batch / new-finding → finding-from-run / finding-from-batch / finding-placeholder. Old files removed from src/aexp/slash_commands/. Consumer repos with stale copies of the old files in .claude/commands/ should re-run aexp install --yes --force --dev and manually rm the three old filenames (the install doesn't garbage-collect deleted slash commands).
Validator strictness. Existing artifacts that don't include every shipped template ## header will now fail aexp validate with missing_template_header. Fix: add the missing sections (placeholder content is fine — _None._ for Caveats on a fully-instrumented run, etc.). The validator is doing the right thing per Kaden's "stick to the templates precisely" directive.
Templates source-of-truth. aexp.artifacts._load_template reads vendored only — local templates/<kind>.md is now a reference copy, not an override. Any consumer relying on local-template overrides (none documented, none expected) will find them silently ignored. Future override mechanism likely shape: --template-file <path> flag, gated on real demand.
## Expected Outcome / ## Analysis renames in the experiment template. Existing E artifacts will fail the header check until migrated to ## Intent / ## Outcome Summary. The electricrag agent did this migration on its E001 without issue.

Testing

Full suite: 305 passed, 3 skipped (mcp + wandb extras + 1 Windows-cmd shell-escape test gated to POSIX shells; cluster target is unaffected).
Test count growth: ~50 net new tests across test_artifacts.py (threads, sp resolution, header checks), test_validate.py (template-header validator), test_queue.py (new file, 45 cases covering CRUD/sweep/sp resolution/drift-proof/render placeholders/materialize emission/run-queued lifecycle/validator schema), test_trackers_context.py (new file, 14 cases covering the three wandb modes), and test_install.py (preservation under --force).
test_package_version_matches_pyproject confirms aexp.__version__ reads 0.2.0 from package metadata after poetry install re-sync.
2 skipped tests are pre-existing extras gates (mcp, wandb); the 3rd is a deliberate Windows-only skip for a POSIX-shell-quoting edge case with a platform-independent invariant covered by an adjacent unit test.

Per-commit breakdown (for reviewers reading commit-by-commit)

Top of branch first:

392eca5 — Release 0.2.0 (version bump + CHANGELOG cut)
3f1b363 — Threads: doc refinements from first real-world salvage
cb3faa0 — Add threads (T###) — new artifact kind
51729e7 — Templates: artifact creation reads vendored only (single source of truth)
9b3965c — Templates + validator: stick to the templates precisely
6307908 — Preserve user-authored kb/ + templates/ content under aexp install --force
8fd2a98 — Clarify: --tag is pure metadata, no temporal semantics
5e0b58c — Add aexp queue run + demote slurm materialize to honest starter template
2b17ae6 — Add aexp queue: pending-run registration, runner materialization, sp resolution
80906d6 — Slash-command UX cleanup: show-run/show-batch, finding rename, tracker reframe
561b08c — Decouple wandb init ownership — add tracked_run, prepare_tracker
0a5f11b — Add H/E/F artifact creation API with automatic backlink patching

Each commit is self-contained, has a substantive message describing the bug or design driver, and was reported back to the consumer agent (electricrag-side) for real-world testing before the next stacked commit landed.

Where to read more

CHANGELOG.md [0.2.0] section — full per-feature breakdown
docs/threads.md — threads model, lifecycle, linkage, idioms
docs/queue.md — queue layer, runner-script emitters, sp resolution, cross-machine sync
docs/tracker-adapters.md — three-mode comparison table for wandb, when each fits

Test plan

CI green on Ubuntu + Windows × Py 3.11/3.12/3.13
Local smoke: aexp install --yes --force --dev in a scratch repo; confirm 18 slash commands land, kb/research/threads/ exists, validator passes
Spot-check the new artifact-creation flow: aexp new-hypothesis → aexp new-experiment --hypothesis H001 → confirm ## Caveats and ## Intent are present in the rendered E
Smoke the queue: aexp queue add --experiment E001 --sweep "condition=full|cls,seed=0..3" --tag smoke → aexp queue list → aexp queue materialize --runner shell --output run.sh → bash run.sh → confirm 8 jobs run

🤖 Generated with Claude Code

Closes the single biggest UX gap on the electricrag side: creating a finding that cites H001 + E001 required hand-editing H001 and E001 to add `- [[F001]]` to their ## Links sections or kb_validate's bidirectional link check would flag everything on the next run. Agents got this wrong under deadline pressure and there was no mechanical helper. New surface: - `aexp.artifacts.new_hypothesis` / `new_experiment` / `new_finding` — allocate the smallest unused H### / E### / F###, render the shipped template (preferring a repo-local override at `<repo>/templates/<kind>.md` when present), write atomically, and patch every parent artifact's ## Links section via `aexp.backlinks.add_backlink`. - `aexp.backlinks.add_backlink` — idempotent patcher that tolerates anchored (`[[X#sec]]`) and aliased (`[[X|alt]]`) wikilinks, and creates a ## Links section at end of file when absent. - CLI verbs `new-hypothesis`, `new-experiment`, `new-finding` and matching MCP tools, so all three agent-facing surfaces (CLI, MCP, slash) route through the same Python API. - Six new slash commands: `/aexp-new-hypothesis`, `/aexp-new-experiment`, `/aexp-new-finding`, `/aexp-status`, `/aexp-list-runs`, `/aexp-validate`. Also silence the SessionStart hook's `WARNING: kb/mission/CHALLENGE.md not found` line. Consumer repos may legitimately defer authoring the mission file; `aexp validate` still reports the absence as a structured issue when it actually matters. 32 new tests (23 unit + end-to-end `validate_kb` clean on H->E->F chain, 9 CLI/install coverage). Full suite: 202 passed, 2 skipped (mcp, wandb extras not installed in this env). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aexp.bind_tracker used to own wandb.init: consumers who already had their own wandb.init call (e.g. electricrag's loop.py::_log_to_wandb, which inits a short-lived run per ECG with a caller-specific name and custom tables) had to either replace that call with aexp's or tolerate two parallel run trees per experiment. Neither was acceptable under an active paper deadline. Three modes are now first-class: - Managed: `with aexp.tracked_run(job, project="foo") as run: ...` — aexp calls wandb.init + binds + finishes. Yields a real wandb.Run; full wandb API (log_artifact, Table, define_metric, summary, sweeps) is available. aexp is not a wrapper. - Bring your own init: `ctx = aexp.prepare_tracker(job, project="foo"); run = wandb.init(**ctx.init_kwargs, name=..., job_type=...); ctx.bind(run)` — caller owns init and finish; aexp provides the disciplined payload (deterministic group slug, auto-tags, curated Limina frame, flattened sp, workspace co-location) and stamps job.doc["tracker"]. - Legacy adapter: `bind_tracker(job, adapter, ...)` unchanged. Signature and side effects preserved. Now delegates internally to a shared `_derive_tracker_payload` used by prepare_tracker too, so both paths compute identical metadata. The TrackerAdapter ABC stays — it's still the right path for NoopAdapter and backend-agnostic code. The docs stop framing the adapter as the canonical wandb surface: docs/tracker-adapters.md rewritten with a three-mode comparison table at the top, tracked_run/prepare_tracker leading, adapter demoted to "noop / backend-agnostic." docs/quickstart.md and README updated to match. 14 new tests in tests/test_trackers_context.py cover prepare_tracker shape (6), TrackerContext.bind duck-typing (3), and tracked_run lifecycle + merge rules + ImportError handling (5). Fake wandb via sys.modules monkeypatch, so tests run locally without the [wandb] extra. Full suite: 216 passed, 2 skipped (mcp + wandb extras). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r reframe Three post-critique fixes landing together: **Tier 1: Tracker doc/slash framing honesty.** docs/tracker-adapters.md previously labeled `bind_tracker` / `WandbAdapter` as "legacy" and implied `tracked_run` / `prepare_tracker` were the only modern path. That's wrong — the CLI-adapter path is the right surface for subprocess / cluster workflows where the training script can't reach Python. Four modes now documented as equally legitimate, distinguished by who owns the `wandb.init` call site. /aexp-new-run's "next step" text rewritten to point at all three reachable paths (tracked_run, prepare_tracker, CLI bind-tracker + resume) with correct use-case framing. **Tier 2: Slash-command parity — show-run / show-batch.** Previously /aexp-list-runs existed but no /aexp-show-run for single-run detail. Added both /aexp-show-run and /aexp-show-batch as guided read-only flows over the existing CLI verbs. Brings total to 11 slash commands. **Tier 3: Finding-creation commands renamed.** The "new vs close" split across three sibling commands (new-finding / close-run / close-batch) conflated lifecycle framing with what the finding cites, forcing users to decode the distinction from doc text. Renamed to a parallel finding-from-<source> form, self-explaining: /aexp-new-finding → /aexp-finding-placeholder (no runs cited) /aexp-close-run → /aexp-finding-from-run (one job cited) /aexp-close-batch → /aexp-finding-from-batch (batch selector cited) Each command now routes through `aexp new-finding` (shipped earlier in this Unreleased cycle) for id allocation + automatic parent-backlink patching — the old flows hand-wrote the markdown and manually edited backlinks. Old close-run-warns-backlinks test replaced with a positive invariant: every finding command must invoke `aexp new-finding`. Cross-references updated: README (slash-command row, stats blurb), docs/cli.md (install-slash-commands listing), docs/quickstart.md (workflow example), src/aexp/cli.py (new_finding_cmd docstring), tests/test_install.py + tests/test_e2e_smoke.py (slash-command filenames). Full suite: 216 passed, 2 skipped (mcp + wandb extras). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resolution Two problems solved together: **Cross-machine batch execution.** The agent (MCP host, often a laptop) and the compute environment (often an HPC cluster the agent can't reach) are different machines. Registering 8 pending jobs via `create_run` in a loop doesn't help when the agent can't execute them. The queue lets the agent declare intent on one machine and materialize a runner script (shell / slurm / manual) that a user submits wherever execution belongs. Status reconciliation happens through signac's on-disk `job.doc` — git is the sync substrate. **Drift-proof provenance for condition labels.** Bare `--sp condition=full` leaks provenance when training code changes — two batches queued weeks apart with the same label can be running different experiments while wandb shows identical config. Fix: experiments declare named `conditions:` blocks in their frontmatter; aexp resolves the label against the block at queue-time and merges the full config into sp before signac freezes it. Later edits to `conditions.full` cannot retroactively change what queued-then-ran. API surface: aexp queue add [--sp K=V,...] [--sweep "K=V|V,K=a..b"] [--tag T] aexp queue list [--experiment E###] [--tag T] aexp queue remove <job_id> aexp queue clear [--tag T] [--yes] aexp queue materialize [--runner shell|slurm|manual] [--output PATH] [--slurm-time T] [--slurm-mem M] [--slurm-gpus N] aexp run-queued <job_id> [--force] [--dry-run] Sweep grammar: `condition=full|classify, seed=0..3` → Cartesian product. MCP tools mirror the CLI except `run_queued` (deliberately not exposed — execution on the MCP host is almost always the wrong env). Runner command templates support `{key}` (any sp field), `{sp_json}` (full resolved sp as compact JSON), and `{job_id}` placeholders. `AEXP_JOB_ID` and `AEXP_JOB_WORKSPACE` are injected into the subprocess env so training scripts can find their own job without argv threading. Runner scripts are idempotent: `run-queued` skips jobs in a terminal state unless `--force`. Failed jobs capture stderr tail + returncode into `job.doc["queue"]["last_error"]` for later forensics via `queue list --include-terminal`. Validator extension: `conditions_schema` issue flags malformed `conditions:` blocks (non-dict values, non-primitive leaves). Experiments without a `conditions:` field are unaffected. Full suite: 272 passed, 3 skipped (mcp/wandb extras + 1 Windows-shell escape test that relies on POSIX single-quote grouping). New file `docs/queue.md` is the canonical reference; `docs/quickstart.md` and `docs/cli.md` cross-link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…plate You can't write a drag-and-drop slurm script when you have no idea what the user's partition, account, module loads, env activation, or container setup is. The previous `--runner slurm` output pretended to be turn-key (baked #SBATCH directives, hardcoded job ids in a bash array) while knowing nothing about the user's site. That's dishonest. **New primitive: `aexp queue run`** — iterates the pending queue (filtered by --experiment / --tag) and executes each matching job via `run_queued` semantics. Designed to live *inside* whatever batch script the user already has working for their cluster. aexp owns the iteration and status reconciliation; the user's script owns everything else. Two shapes: # Sequential (single-node): aexp queue run --tag overnight # Array-parallel (one queued job per slurm array task): #SBATCH --array=0-7 aexp queue run --tag overnight --index "$SLURM_ARRAY_TASK_ID" Jobs enumerate in stable order (list_queue-ordered), so `--index N` picks deterministically. `--continue-on-failure` keeps iterating past errors; default is fail-fast with SubprocessFailed. **Slurm template reframed as starter.** `materialize --runner slurm` now emits a script with explicit `# TODO` placeholders for the #SBATCH directives the user must fill in and commented-out setup lines (`# module load cuda/12.1`, `# source ~/...activate <env>`). The aexp-specific line is a single `exec aexp queue run --tag <tag> --index "$SLURM_ARRAY_TASK_ID"` — job resolution happens at task-launch time rather than materialize-time, so re-queueing between submit and execute stays coherent. Users who already have a working site script are encouraged to skip `materialize` entirely and just add the `queue run` line. Also added: `aexp.run_queue(...)` exported at the package root; docs/queue.md reframed to lead with `queue run` as the primary cluster primitive; docs/cli.md, docs/quickstart.md, CHANGELOG updated. Full suite: 285 passed, 3 skipped (mcp + wandb extras + 1 Windows-cmd shell-escape test). 11 new test cases cover `run_queue` sequential / index / out-of-range / empty-filter / fail-fast vs continue-on-failure / dry-run behavior, plus CLI integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

You asked what `--tag overnight` meant and noted the implied timing semantics were misleading. They were — `tag` is just a user-chosen label aexp stores in `job.doc["queue"]["tag"]` and filters by. Nothing else. No scheduling, no deadlines, no wall-clock awareness. Renamed all example tags in the docs from `overnight` to descriptive batch identifiers (`paper-ablation`) that tag by *what the batch is*, not *when it runs*. The previous examples risked training both users and agents to read temporal semantics into a namespacing primitive that has none. Added a dedicated "What `--tag` is (and isn't)" section in `docs/queue.md` that spells out the pure-metadata nature of the field and the naming convention. Also updated the `/aexp-queue-add` slash command to include the same guidance so agents see it inline. The three remaining `overnight` mentions across the repo are all in explicit "don't do this" passages — kept deliberately as warnings. 91 queue + CLI tests pass; 1 skipped (Windows-cmd shell-escape, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…force `aexp install --force` was clobbering user-authored scaffold content — notably `kb/mission/CHALLENGE.md`, `kb/ACTIVE.md`, `kb/DASHBOARD.md`, and `templates/*.md`. The electricrag-side agent reported this on 2026-04-24 after a committed CHALLENGE.md was reverted to the blank stub by a routine re-install. The user had to `git checkout` to recover; uncommitted work would have been lost. Root cause: `_copy_file` treated every file the same under `--force`. The verbatim trees (`_TREES_VERBATIM = ("kb", "templates")`) ship as *editable scaffold* — the consumer is meant to author or customize these — but they shared the overwrite path with tooling files (slash commands, skills, hooks) that legitimately need to refresh. Fix: - New `preserve_user_modifications` kwarg on `_copy_file` + `_copy_tree`. When set, a target that exists and differs from the shipped source is preserved (skipped with a new `preserved_user_modified` action), even under `--force`. Files that byte-match the shipped default still refresh (trivially, via the existing `skipped_identical` path). - `install_limina` opts the `_TREES_VERBATIM` tree-copy into preservation. Slash commands, skills, settings.json, .mcp.json all continue to honor `--force` — refreshing them is the whole point of the flag. - `_print_actions` shows a per-file `preserved_user_modified <path>: kept your content; shipped default not applied` line and a count in the install summary. Behaviour is now visible. - `--force` flag help text + heads-up text + docs/cli.md updated to spell out the tooling-vs-scaffold distinction. Escape hatch: `rm <file>` before re-installing if you genuinely want to reset. Tests: 5 new in `tests/test_install.py` covering kb/ preservation (no-force + --force), templates/ preservation (--force), skipped_identical when content matches, and the invariant that slash commands still refresh under --force. Two pre-existing tests that captured the old (buggy) behaviour were repurposed to assert the new semantics. Full suite: 288 passed, 3 skipped (mcp + wandb extras + Windows cmd shell-escape — unrelated). Reported-by: Claude (electricrag-side), 2026-04-24 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four issues from the electricrag-side agent's 2026-04-24 follow-up, all from real backfill friction: 1. **Validator now enforces required template headers.** New `validate_required_headers` in kb_validate parses the shipped template for each kind and checks every top-level `## ` header is present. Issue code: `missing_template_header` (error-level, not warning — Kaden's "stick to templates precisely" directive made warning-level too soft). H/E/F enforced; L/CR/SR not yet. `## Links` excluded (covered comprehensively by the existing validate_links). 2. **E's `## Analysis` renamed to `## Outcome Summary`** with the experiment-level vs claim-level boundary spelled out in template body and slash-command guidance. `## Outcome Summary` reports what happened in this specific run; generalizable claims belong in the linked Finding's prose. Resolves the semantic overlap where E's Analysis and F's claim prose competed for the same content. 3. **`## Caveats` added top-level to E and F.** Distinct sections, distinct semantics: - E: between `## Procedure` and `## Intent`. Body says `_None._` is valid for fully-instrumented runs but most experiments accumulate something worth recording at the top level. - F: between `## Evidence` and `## What Improved For Real`. Body spells out the boundary vs `## Remaining Debt`: caveats limit interpretation of the finding; remaining debt is what's still a workaround in the system. The agent had been burying caveats under `## Setup` as a sub-section — top-level visibility is the point. 4. **Smoke-test framing trap eliminated.** - H's `## Test Plan` and E's old `## Expected Outcome` (now renamed to `## Intent`) both ship as dual-mode: pre-registered sub-block (confirm/reject thresholds) + exploratory sub-block (one-line purpose, no fabricated thresholds). Authors pick the framing that's actually true and delete the other. - Template body explicitly warns: "Don't fabricate retroactive thresholds." Kaden caught the agent doing exactly this on H001 on 2026-04-24; the new templates make the trap structurally hard to fall into without ignoring an explicit warning. - Section name change `Expected Outcome` → `Intent` because the old name presupposed a confirm/reject frame that smoke tests don't fit. Slash-command guidance for /aexp-new-hypothesis, /aexp-new-experiment, /aexp-finding-from-run, /aexp-finding-from-batch, /aexp-finding-placeholder all updated to reference the new section names and document the caveats-vs-remaining-debt + outcome-summary-vs-finding-prose boundaries inline. docs/quickstart.md updated to match. Tests: - `_write_artifact` helper in test_validate.py auto-injects required template headers based on the artifact's `type` so existing tests don't break under the new check. - 3 new tests: `test_validate_flags_missing_template_header_on_experiment` (calls `kb_validate.validate_kb` directly so per-check codes survive), `test_validate_repo_surfaces_missing_template_header_in_bundled_message` (asserts the outer `validate_repo` bundle still mentions the missing header in its message), `test_validate_passes_when_all_template_headers_present` (the artifacts API renders the full template — its outputs pass). - `test_e2e_fresh_repo_full_happy_path` updated to use the artifacts API for H001 + E001 creation; the previous inline yaml-write predates that API and was missing the now-required template headers. Reported-by: Claude (electricrag-side), 2026-04-24 Full suite: 291 passed, 3 skipped (mcp + wandb extras + Windows-cmd shell-escape). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uth) `aexp.artifacts._load_template` previously preferred the consumer's repo-local `templates/<kind>.md` and fell back to vendored. The `missing_template_header` validator (added in 9b3965c) always read vendored. When local fell behind shipped — the default state any time package templates evolve — creation rendered the stale skeleton while validation rejected it for missing new shipped headers. The electricrag-side agent reported this on 2026-04-24: `mcp__aexp__new_experiment` and `mcp__aexp__new_finding` produced skeletons missing `## Caveats` / `## Intent` / `## Outcome Summary`, the PostToolUse hook then rejected the writes, and the agent had to manually rewrite each skeleton's body to add the new sections. Three artifacts in their backfill (E002, F001, F002) hit it. The install-preserve fix from `6307908` was doing the right thing (preserving on-disk templates). The bug was that creation acted on those preserved-but-stale templates while validation didn't. Fix: `_load_template` always reads vendored. Local `templates/` becomes a pure reference copy — preserved across installs, never consulted by the artifact-creation API. This is a breaking change to undocumented behavior — any consumer relying on local-template overrides for customization will find them silently ignored. No such consumers are known; the harness is pre-release. Future override mechanism likely shape: `--template-file <path>` CLI flag + `template_path=` Python kwarg, gated on real demand. 3 new tests in tests/test_artifacts.py: - `test_new_hypothesis_ignores_local_template_override` — bogus local content doesn't leak into rendered artifacts. - `test_new_experiment_skeleton_passes_validator_unmodified` — fresh E skeleton passes `validate_kb` without post-edits. - `test_new_finding_skeleton_passes_validator_unmodified` — same for F. Full suite: 294 passed, 3 skipped (mcp + wandb extras + Windows-cmd). Reported-by: Claude (electricrag-side), 2026-04-24 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reported by a second electricrag-side agent on 2026-04-24: two real research directions (hierarchy-aware scoring; long-tail rare-label training) had no home in kb/. Hypothesis form was too narrow — the agent wasn't ready to commit to a falsifiable claim with a test plan. Session notes were too ephemeral — they rot as new notes accrue. The resulting drafts lived in `drafts/aexp_pending_threads/` awaiting a design call here. Design: - T### is a new artifact kind parallel to H/E/F, living at kb/research/threads/T###-<slug>.md. - Lifecycle: PROPOSED -> EXPLORING -> PROMOTED (>=1 hypothesis spawned; thread persists as parent context) | CLOSED (decided not to pursue). Status transitions are manual — no implicit state machinery. - Required template sections (all enforced via missing_template_header): ## Statement, ## Sub-questions, ## Promotion criteria, ## Open links, ## Notes, ## Conclusion, ## Links. The ## Notes section is the running-journal that distinguishes threads from session notes — it lives WITH the thread, not in a date-stamped file that rots. - H gains an optional `thread:` frontmatter field. When set, the hypothesis must link back to a real T### (enforced by the PreToolUse enforce_hef_chain hook + kb_validate's reference check); the thread's ## Links is auto-patched with the hypothesis backlink. Threads themselves are NOT in the H->E->F chain — hypotheses without a thread parent are still fine. - Promotion: `aexp new-hypothesis --thread T###` (or the MCP equivalent). No separate promote-command — it's just creation with a parent. Surface: - aexp.artifacts: new_thread, close_thread (CLOSED/PROMOTED transitions + conclusion rewrite), new_hypothesis(thread_id=...). - CLI: new-thread, list-threads, show-thread, close-thread, plus --thread on new-hypothesis. - MCP: new_thread, list_threads, show_thread, close_thread, plus thread_id on new_hypothesis. list_threads returns a table; show_thread returns frontmatter + body. - Slash commands: /aexp-new-thread, /aexp-list-threads, /aexp-show-thread, /aexp-close-thread. /aexp-new-hypothesis updated to mention --thread. (14 -> 18 total slash commands.) - Validator: T added to CORE_ARTIFACTS with ("Status", "Created"). Required-template-header check covers all seven thread sections. H's optional `Thread` reference validated for existence and bidirectional linkage. - Docs: new docs/threads.md (canonical reference). README + docs/cli.md + CHANGELOG updated. Schema validated against the two parked drafts in electricrag: - Thread A (hierarchy-aware scoring, 4 sub-questions) fits cleanly — the 4 sub-questions map to future H###; ## Notes hosts the running journal; ## Promotion criteria names which sub-question to spawn first. - Thread B (long-tail rare-label training, technique-choice question) fits — narrower scope but same shape; ## Open links captures the KED-paper + code-path references. 11 new tests in tests/test_artifacts.py covering new_thread skeleton shape, next_artifact_id(T), new_hypothesis(thread_id=...) with backlink patching + validation gates, the full T->H->E->F chain validating clean, close_thread transitions to CLOSED/PROMOTED with conclusion rewrite, input-validation errors. Install + e2e + CLI tests extended for the new slash-command filenames. Full suite: 305 passed, 3 skipped (mcp + wandb extras + Windows-cmd shell-escape). Reported-by: Claude (electricrag-side, second agent), 2026-04-24 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The electricrag-side agent salvaged the two parked thread drafts under the shipped schema (electricrag commit 906809e, 2026-04-25). Schema held; three doc-level clarifications landed based on friction they reported back: 1. Status semantics made stricter and explicit. The agent defaulted both salvaged threads to EXPLORING because "the work had been thought about before parking." Kaden corrected to PROPOSED. Clarified distinction: - PROPOSED = thread exists but no concrete work is underway. Writing it down, occasionally thinking about it, salvaging from drafts — none of these promote to EXPLORING. - EXPLORING = someone is actively running a baseline, reviewing literature, pursuing a promotion criterion — work in flight, not intent. Clarifications added to: - Thread template blockquote metadata (status-discipline lines) - docs/threads.md lifecycle section (expanded) - docs/threads.md Idioms ("Default to PROPOSED on creation") - /aexp-new-thread slash-command flow guidance 2. Cross-linked-artifact ordering pattern documented. The agent hit a kb_write_guard block when T001's first write included [[T002]] before T002 existed. Workable in practice (write both skeletons first, then cross-link in a second pass) but undocumented. New "Cross-linked artifacts: create-then-link ordering" section in docs/threads.md; inline flag in /aexp-new-thread flow. Applies to any cross-linked artifact pair, not just threads. 3. Sub-questions vs. Promotion criteria edge case. When a sub-question has its own prerequisite ("run a baseline first"), it blurs with the thread-wide promotion-criteria section. New "Section boundary" section in docs/threads.md: prerequisite-to-any-promotion -> Promotion criteria; prerequisite-to-this-specific-sub-question -> stays with the sub-question. Minor — both salvaged threads surfaced the pattern, neither was blocked by it. No code changes; template + two doc files + one slash command. Full suite: 305 passed, 3 skipped (unchanged). Reported-by: Claude (electricrag-side), 2026-04-25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps `pyproject.toml` version 0.1.1 -> 0.2.0 and cuts the CHANGELOG [Unreleased] section to [0.2.0] - 2026-04-25 with a release summary at the top. Adds a fresh empty [Unreleased] heading for future work. 0.2.0 in one line: turns aexp from "installed-but-minimal" into an actually-usable research harness — artifact-creation API (H/E/F/T), queue + runner-script materialization, wandb init-ownership decoupled, validator strictness, install safety, and slash commands 3 -> 18. Full detail in the CHANGELOG release-summary section and the sub-sections below it. No code changes. Version-pinning test (test_package_version_matches_pyproject) confirms `aexp.__version__` == `pyproject.toml.version` after `poetry install` re-sync. Full suite: 305 passed, 3 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI failed on `poetry run ruff check .` (18 errors). Ruff was never run locally during this stack — the local pre-commit was only `pytest`. Two categories: - 14 auto-fixable: import-block sorting (I001), unused imports (F401), deprecated-import (UP035), `lru_cache(maxsize=None)` -> `cache` (UP033). Applied via `poetry run ruff check . --fix`. - 4 B008 (function-call-in-default-argument): all from `param: list = typer.Option([], ...)` patterns in CLI verbs. typer's documented API uses `OptionInfo` as a metadata sentinel — ruff's generic mutable-default check is a false positive here. Exempted via: [tool.ruff.lint.flake8-bugbear] extend-immutable-calls = ["typer.Option", "typer.Argument"] Touched files (all auto-fix from ruff except pyproject.toml): - pyproject.toml — B008 exemption. - src/aexp/__init__.py, src/aexp/cli.py, src/aexp/mcp_server.py, src/aexp/queue.py, src/aexp/kb_validate.py — import-block reformat, unused imports removed, `functools.lru_cache(maxsize=None)` -> `functools.cache` in kb_validate. - tests/test_cli.py, tests/test_queue.py, tests/test_trackers_context.py — import-block reformat, unused imports removed. Lint and tests both green locally: $ poetry run ruff check . All checks passed! $ poetry run pytest 305 passed, 3 skipped in 418.93s (3 skipped: mcp + wandb extras not installed locally; 1 deliberately skipped on Windows-cmd shell-escape.) On CI the wandb + mcp tests will run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KadenMc and others added 13 commits April 23, 2026 17:53

KadenMc merged commit 0c920f3 into main Apr 25, 2026
6 checks passed

KadenMc deleted the aexp-release-0.2.0 branch April 25, 2026 18:19

KadenMc mentioned this pull request Apr 25, 2026

README: catch up to 0.2.0 #12

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.2.0: artifact-creation API, queue layer, threads (T###), wandb decoupling, validator strictness#11

Release 0.2.0: artifact-creation API, queue layer, threads (T###), wandb decoupling, validator strictness#11
KadenMc merged 13 commits intomainfrom
aexp-release-0.2.0

KadenMc commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KadenMc commented Apr 25, 2026

Summary

What's in this release

Migration / breaking changes

Testing

Per-commit breakdown (for reviewers reading commit-by-commit)

Where to read more

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant