Conversation
Closes the single biggest UX gap on the electricrag side: creating a finding that cites H001 + E001 required hand-editing H001 and E001 to add `- [[F001]]` to their ## Links sections or kb_validate's bidirectional link check would flag everything on the next run. Agents got this wrong under deadline pressure and there was no mechanical helper. New surface: - `aexp.artifacts.new_hypothesis` / `new_experiment` / `new_finding` — allocate the smallest unused H### / E### / F###, render the shipped template (preferring a repo-local override at `<repo>/templates/<kind>.md` when present), write atomically, and patch every parent artifact's ## Links section via `aexp.backlinks.add_backlink`. - `aexp.backlinks.add_backlink` — idempotent patcher that tolerates anchored (`[[X#sec]]`) and aliased (`[[X|alt]]`) wikilinks, and creates a ## Links section at end of file when absent. - CLI verbs `new-hypothesis`, `new-experiment`, `new-finding` and matching MCP tools, so all three agent-facing surfaces (CLI, MCP, slash) route through the same Python API. - Six new slash commands: `/aexp-new-hypothesis`, `/aexp-new-experiment`, `/aexp-new-finding`, `/aexp-status`, `/aexp-list-runs`, `/aexp-validate`. Also silence the SessionStart hook's `WARNING: kb/mission/CHALLENGE.md not found` line. Consumer repos may legitimately defer authoring the mission file; `aexp validate` still reports the absence as a structured issue when it actually matters. 32 new tests (23 unit + end-to-end `validate_kb` clean on H->E->F chain, 9 CLI/install coverage). Full suite: 202 passed, 2 skipped (mcp, wandb extras not installed in this env). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aexp.bind_tracker used to own wandb.init: consumers who already had their own wandb.init call (e.g. electricrag's loop.py::_log_to_wandb, which inits a short-lived run per ECG with a caller-specific name and custom tables) had to either replace that call with aexp's or tolerate two parallel run trees per experiment. Neither was acceptable under an active paper deadline. Three modes are now first-class: - Managed: `with aexp.tracked_run(job, project="foo") as run: ...` — aexp calls wandb.init + binds + finishes. Yields a real wandb.Run; full wandb API (log_artifact, Table, define_metric, summary, sweeps) is available. aexp is not a wrapper. - Bring your own init: `ctx = aexp.prepare_tracker(job, project="foo"); run = wandb.init(**ctx.init_kwargs, name=..., job_type=...); ctx.bind(run)` — caller owns init and finish; aexp provides the disciplined payload (deterministic group slug, auto-tags, curated Limina frame, flattened sp, workspace co-location) and stamps job.doc["tracker"]. - Legacy adapter: `bind_tracker(job, adapter, ...)` unchanged. Signature and side effects preserved. Now delegates internally to a shared `_derive_tracker_payload` used by prepare_tracker too, so both paths compute identical metadata. The TrackerAdapter ABC stays — it's still the right path for NoopAdapter and backend-agnostic code. The docs stop framing the adapter as the canonical wandb surface: docs/tracker-adapters.md rewritten with a three-mode comparison table at the top, tracked_run/prepare_tracker leading, adapter demoted to "noop / backend-agnostic." docs/quickstart.md and README updated to match. 14 new tests in tests/test_trackers_context.py cover prepare_tracker shape (6), TrackerContext.bind duck-typing (3), and tracked_run lifecycle + merge rules + ImportError handling (5). Fake wandb via sys.modules monkeypatch, so tests run locally without the [wandb] extra. Full suite: 216 passed, 2 skipped (mcp + wandb extras). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r reframe Three post-critique fixes landing together: **Tier 1: Tracker doc/slash framing honesty.** docs/tracker-adapters.md previously labeled `bind_tracker` / `WandbAdapter` as "legacy" and implied `tracked_run` / `prepare_tracker` were the only modern path. That's wrong — the CLI-adapter path is the right surface for subprocess / cluster workflows where the training script can't reach Python. Four modes now documented as equally legitimate, distinguished by who owns the `wandb.init` call site. /aexp-new-run's "next step" text rewritten to point at all three reachable paths (tracked_run, prepare_tracker, CLI bind-tracker + resume) with correct use-case framing. **Tier 2: Slash-command parity — show-run / show-batch.** Previously /aexp-list-runs existed but no /aexp-show-run for single-run detail. Added both /aexp-show-run and /aexp-show-batch as guided read-only flows over the existing CLI verbs. Brings total to 11 slash commands. **Tier 3: Finding-creation commands renamed.** The "new vs close" split across three sibling commands (new-finding / close-run / close-batch) conflated lifecycle framing with what the finding cites, forcing users to decode the distinction from doc text. Renamed to a parallel finding-from-<source> form, self-explaining: /aexp-new-finding → /aexp-finding-placeholder (no runs cited) /aexp-close-run → /aexp-finding-from-run (one job cited) /aexp-close-batch → /aexp-finding-from-batch (batch selector cited) Each command now routes through `aexp new-finding` (shipped earlier in this Unreleased cycle) for id allocation + automatic parent-backlink patching — the old flows hand-wrote the markdown and manually edited backlinks. Old close-run-warns-backlinks test replaced with a positive invariant: every finding command must invoke `aexp new-finding`. Cross-references updated: README (slash-command row, stats blurb), docs/cli.md (install-slash-commands listing), docs/quickstart.md (workflow example), src/aexp/cli.py (new_finding_cmd docstring), tests/test_install.py + tests/test_e2e_smoke.py (slash-command filenames). Full suite: 216 passed, 2 skipped (mcp + wandb extras). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resolution
Two problems solved together:
**Cross-machine batch execution.** The agent (MCP host, often a laptop)
and the compute environment (often an HPC cluster the agent can't reach)
are different machines. Registering 8 pending jobs via `create_run` in a
loop doesn't help when the agent can't execute them. The queue lets the
agent declare intent on one machine and materialize a runner script
(shell / slurm / manual) that a user submits wherever execution belongs.
Status reconciliation happens through signac's on-disk `job.doc` — git
is the sync substrate.
**Drift-proof provenance for condition labels.** Bare `--sp condition=full`
leaks provenance when training code changes — two batches queued weeks
apart with the same label can be running different experiments while
wandb shows identical config. Fix: experiments declare named
`conditions:` blocks in their frontmatter; aexp resolves the label
against the block at queue-time and merges the full config into sp
before signac freezes it. Later edits to `conditions.full` cannot
retroactively change what queued-then-ran.
API surface:
aexp queue add [--sp K=V,...] [--sweep "K=V|V,K=a..b"] [--tag T]
aexp queue list [--experiment E###] [--tag T]
aexp queue remove <job_id>
aexp queue clear [--tag T] [--yes]
aexp queue materialize [--runner shell|slurm|manual] [--output PATH]
[--slurm-time T] [--slurm-mem M] [--slurm-gpus N]
aexp run-queued <job_id> [--force] [--dry-run]
Sweep grammar: `condition=full|classify, seed=0..3` → Cartesian product.
MCP tools mirror the CLI except `run_queued` (deliberately not exposed —
execution on the MCP host is almost always the wrong env).
Runner command templates support `{key}` (any sp field), `{sp_json}`
(full resolved sp as compact JSON), and `{job_id}` placeholders.
`AEXP_JOB_ID` and `AEXP_JOB_WORKSPACE` are injected into the subprocess
env so training scripts can find their own job without argv threading.
Runner scripts are idempotent: `run-queued` skips jobs in a terminal
state unless `--force`. Failed jobs capture stderr tail + returncode
into `job.doc["queue"]["last_error"]` for later forensics via
`queue list --include-terminal`.
Validator extension: `conditions_schema` issue flags malformed
`conditions:` blocks (non-dict values, non-primitive leaves).
Experiments without a `conditions:` field are unaffected.
Full suite: 272 passed, 3 skipped (mcp/wandb extras + 1 Windows-shell
escape test that relies on POSIX single-quote grouping). New file
`docs/queue.md` is the canonical reference; `docs/quickstart.md` and
`docs/cli.md` cross-link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plate You can't write a drag-and-drop slurm script when you have no idea what the user's partition, account, module loads, env activation, or container setup is. The previous `--runner slurm` output pretended to be turn-key (baked #SBATCH directives, hardcoded job ids in a bash array) while knowing nothing about the user's site. That's dishonest. **New primitive: `aexp queue run`** — iterates the pending queue (filtered by --experiment / --tag) and executes each matching job via `run_queued` semantics. Designed to live *inside* whatever batch script the user already has working for their cluster. aexp owns the iteration and status reconciliation; the user's script owns everything else. Two shapes: # Sequential (single-node): aexp queue run --tag overnight # Array-parallel (one queued job per slurm array task): #SBATCH --array=0-7 aexp queue run --tag overnight --index "$SLURM_ARRAY_TASK_ID" Jobs enumerate in stable order (list_queue-ordered), so `--index N` picks deterministically. `--continue-on-failure` keeps iterating past errors; default is fail-fast with SubprocessFailed. **Slurm template reframed as starter.** `materialize --runner slurm` now emits a script with explicit `# TODO` placeholders for the #SBATCH directives the user must fill in and commented-out setup lines (`# module load cuda/12.1`, `# source ~/...activate <env>`). The aexp-specific line is a single `exec aexp queue run --tag <tag> --index "$SLURM_ARRAY_TASK_ID"` — job resolution happens at task-launch time rather than materialize-time, so re-queueing between submit and execute stays coherent. Users who already have a working site script are encouraged to skip `materialize` entirely and just add the `queue run` line. Also added: `aexp.run_queue(...)` exported at the package root; docs/queue.md reframed to lead with `queue run` as the primary cluster primitive; docs/cli.md, docs/quickstart.md, CHANGELOG updated. Full suite: 285 passed, 3 skipped (mcp + wandb extras + 1 Windows-cmd shell-escape test). 11 new test cases cover `run_queue` sequential / index / out-of-range / empty-filter / fail-fast vs continue-on-failure / dry-run behavior, plus CLI integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
You asked what `--tag overnight` meant and noted the implied timing semantics were misleading. They were — `tag` is just a user-chosen label aexp stores in `job.doc["queue"]["tag"]` and filters by. Nothing else. No scheduling, no deadlines, no wall-clock awareness. Renamed all example tags in the docs from `overnight` to descriptive batch identifiers (`paper-ablation`) that tag by *what the batch is*, not *when it runs*. The previous examples risked training both users and agents to read temporal semantics into a namespacing primitive that has none. Added a dedicated "What `--tag` is (and isn't)" section in `docs/queue.md` that spells out the pure-metadata nature of the field and the naming convention. Also updated the `/aexp-queue-add` slash command to include the same guidance so agents see it inline. The three remaining `overnight` mentions across the repo are all in explicit "don't do this" passages — kept deliberately as warnings. 91 queue + CLI tests pass; 1 skipped (Windows-cmd shell-escape, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…force
`aexp install --force` was clobbering user-authored scaffold content —
notably `kb/mission/CHALLENGE.md`, `kb/ACTIVE.md`, `kb/DASHBOARD.md`,
and `templates/*.md`. The electricrag-side agent reported this on
2026-04-24 after a committed CHALLENGE.md was reverted to the blank
stub by a routine re-install. The user had to `git checkout` to
recover; uncommitted work would have been lost.
Root cause: `_copy_file` treated every file the same under `--force`.
The verbatim trees (`_TREES_VERBATIM = ("kb", "templates")`) ship as
*editable scaffold* — the consumer is meant to author or customize
these — but they shared the overwrite path with tooling files
(slash commands, skills, hooks) that legitimately need to refresh.
Fix:
- New `preserve_user_modifications` kwarg on `_copy_file` + `_copy_tree`.
When set, a target that exists and differs from the shipped source
is preserved (skipped with a new `preserved_user_modified` action),
even under `--force`. Files that byte-match the shipped default
still refresh (trivially, via the existing `skipped_identical`
path).
- `install_limina` opts the `_TREES_VERBATIM` tree-copy into
preservation. Slash commands, skills, settings.json, .mcp.json all
continue to honor `--force` — refreshing them is the whole point of
the flag.
- `_print_actions` shows a per-file `preserved_user_modified <path>:
kept your content; shipped default not applied` line and a count in
the install summary. Behaviour is now visible.
- `--force` flag help text + heads-up text + docs/cli.md updated to
spell out the tooling-vs-scaffold distinction. Escape hatch:
`rm <file>` before re-installing if you genuinely want to reset.
Tests: 5 new in `tests/test_install.py` covering kb/ preservation
(no-force + --force), templates/ preservation (--force),
skipped_identical when content matches, and the invariant that
slash commands still refresh under --force. Two pre-existing tests
that captured the old (buggy) behaviour were repurposed to assert
the new semantics. Full suite: 288 passed, 3 skipped (mcp + wandb
extras + Windows cmd shell-escape — unrelated).
Reported-by: Claude (electricrag-side), 2026-04-24
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four issues from the electricrag-side agent's 2026-04-24 follow-up,
all from real backfill friction:
1. **Validator now enforces required template headers.** New
`validate_required_headers` in kb_validate parses the shipped
template for each kind and checks every top-level `## ` header is
present. Issue code: `missing_template_header` (error-level, not
warning — Kaden's "stick to templates precisely" directive made
warning-level too soft). H/E/F enforced; L/CR/SR not yet. `## Links`
excluded (covered comprehensively by the existing validate_links).
2. **E's `## Analysis` renamed to `## Outcome Summary`** with the
experiment-level vs claim-level boundary spelled out in template
body and slash-command guidance. `## Outcome Summary` reports what
happened in this specific run; generalizable claims belong in the
linked Finding's prose. Resolves the semantic overlap where E's
Analysis and F's claim prose competed for the same content.
3. **`## Caveats` added top-level to E and F.** Distinct sections,
distinct semantics:
- E: between `## Procedure` and `## Intent`. Body says `_None._`
is valid for fully-instrumented runs but most experiments
accumulate something worth recording at the top level.
- F: between `## Evidence` and `## What Improved For Real`. Body
spells out the boundary vs `## Remaining Debt`: caveats limit
interpretation of the finding; remaining debt is what's still a
workaround in the system.
The agent had been burying caveats under `## Setup` as a sub-section
— top-level visibility is the point.
4. **Smoke-test framing trap eliminated.**
- H's `## Test Plan` and E's old `## Expected Outcome` (now
renamed to `## Intent`) both ship as dual-mode: pre-registered
sub-block (confirm/reject thresholds) + exploratory sub-block
(one-line purpose, no fabricated thresholds). Authors pick the
framing that's actually true and delete the other.
- Template body explicitly warns: "Don't fabricate retroactive
thresholds." Kaden caught the agent doing exactly this on H001
on 2026-04-24; the new templates make the trap structurally hard
to fall into without ignoring an explicit warning.
- Section name change `Expected Outcome` → `Intent` because the
old name presupposed a confirm/reject frame that smoke tests
don't fit.
Slash-command guidance for /aexp-new-hypothesis, /aexp-new-experiment,
/aexp-finding-from-run, /aexp-finding-from-batch, /aexp-finding-placeholder
all updated to reference the new section names and document the
caveats-vs-remaining-debt + outcome-summary-vs-finding-prose boundaries
inline. docs/quickstart.md updated to match.
Tests:
- `_write_artifact` helper in test_validate.py auto-injects required
template headers based on the artifact's `type` so existing tests
don't break under the new check.
- 3 new tests: `test_validate_flags_missing_template_header_on_experiment`
(calls `kb_validate.validate_kb` directly so per-check codes survive),
`test_validate_repo_surfaces_missing_template_header_in_bundled_message`
(asserts the outer `validate_repo` bundle still mentions the missing
header in its message), `test_validate_passes_when_all_template_headers_present`
(the artifacts API renders the full template — its outputs pass).
- `test_e2e_fresh_repo_full_happy_path` updated to use the artifacts
API for H001 + E001 creation; the previous inline yaml-write predates
that API and was missing the now-required template headers.
Reported-by: Claude (electricrag-side), 2026-04-24
Full suite: 291 passed, 3 skipped (mcp + wandb extras + Windows-cmd
shell-escape).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uth) `aexp.artifacts._load_template` previously preferred the consumer's repo-local `templates/<kind>.md` and fell back to vendored. The `missing_template_header` validator (added in 9b3965c) always read vendored. When local fell behind shipped — the default state any time package templates evolve — creation rendered the stale skeleton while validation rejected it for missing new shipped headers. The electricrag-side agent reported this on 2026-04-24: `mcp__aexp__new_experiment` and `mcp__aexp__new_finding` produced skeletons missing `## Caveats` / `## Intent` / `## Outcome Summary`, the PostToolUse hook then rejected the writes, and the agent had to manually rewrite each skeleton's body to add the new sections. Three artifacts in their backfill (E002, F001, F002) hit it. The install-preserve fix from `6307908` was doing the right thing (preserving on-disk templates). The bug was that creation acted on those preserved-but-stale templates while validation didn't. Fix: `_load_template` always reads vendored. Local `templates/` becomes a pure reference copy — preserved across installs, never consulted by the artifact-creation API. This is a breaking change to undocumented behavior — any consumer relying on local-template overrides for customization will find them silently ignored. No such consumers are known; the harness is pre-release. Future override mechanism likely shape: `--template-file <path>` CLI flag + `template_path=` Python kwarg, gated on real demand. 3 new tests in tests/test_artifacts.py: - `test_new_hypothesis_ignores_local_template_override` — bogus local content doesn't leak into rendered artifacts. - `test_new_experiment_skeleton_passes_validator_unmodified` — fresh E skeleton passes `validate_kb` without post-edits. - `test_new_finding_skeleton_passes_validator_unmodified` — same for F. Full suite: 294 passed, 3 skipped (mcp + wandb extras + Windows-cmd). Reported-by: Claude (electricrag-side), 2026-04-24 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported by a second electricrag-side agent on 2026-04-24: two real
research directions (hierarchy-aware scoring; long-tail rare-label
training) had no home in kb/. Hypothesis form was too narrow — the
agent wasn't ready to commit to a falsifiable claim with a test plan.
Session notes were too ephemeral — they rot as new notes accrue. The
resulting drafts lived in `drafts/aexp_pending_threads/` awaiting a
design call here.
Design:
- T### is a new artifact kind parallel to H/E/F, living at
kb/research/threads/T###-<slug>.md.
- Lifecycle: PROPOSED -> EXPLORING -> PROMOTED (>=1 hypothesis spawned;
thread persists as parent context) | CLOSED (decided not to pursue).
Status transitions are manual — no implicit state machinery.
- Required template sections (all enforced via missing_template_header):
## Statement, ## Sub-questions, ## Promotion criteria, ## Open
links, ## Notes, ## Conclusion, ## Links. The ## Notes section is
the running-journal that distinguishes threads from session notes —
it lives WITH the thread, not in a date-stamped file that rots.
- H gains an optional `thread:` frontmatter field. When set, the
hypothesis must link back to a real T### (enforced by the PreToolUse
enforce_hef_chain hook + kb_validate's reference check); the
thread's ## Links is auto-patched with the hypothesis backlink.
Threads themselves are NOT in the H->E->F chain — hypotheses without
a thread parent are still fine.
- Promotion: `aexp new-hypothesis --thread T###` (or the MCP
equivalent). No separate promote-command — it's just creation with
a parent.
Surface:
- aexp.artifacts: new_thread, close_thread (CLOSED/PROMOTED transitions
+ conclusion rewrite), new_hypothesis(thread_id=...).
- CLI: new-thread, list-threads, show-thread, close-thread, plus
--thread on new-hypothesis.
- MCP: new_thread, list_threads, show_thread, close_thread, plus
thread_id on new_hypothesis. list_threads returns a table;
show_thread returns frontmatter + body.
- Slash commands: /aexp-new-thread, /aexp-list-threads,
/aexp-show-thread, /aexp-close-thread. /aexp-new-hypothesis updated
to mention --thread. (14 -> 18 total slash commands.)
- Validator: T added to CORE_ARTIFACTS with ("Status", "Created").
Required-template-header check covers all seven thread sections.
H's optional `Thread` reference validated for existence and
bidirectional linkage.
- Docs: new docs/threads.md (canonical reference). README +
docs/cli.md + CHANGELOG updated.
Schema validated against the two parked drafts in electricrag:
- Thread A (hierarchy-aware scoring, 4 sub-questions) fits cleanly —
the 4 sub-questions map to future H###; ## Notes hosts the running
journal; ## Promotion criteria names which sub-question to
spawn first.
- Thread B (long-tail rare-label training, technique-choice question)
fits — narrower scope but same shape; ## Open links captures the
KED-paper + code-path references.
11 new tests in tests/test_artifacts.py covering new_thread skeleton
shape, next_artifact_id(T), new_hypothesis(thread_id=...) with
backlink patching + validation gates, the full T->H->E->F chain
validating clean, close_thread transitions to CLOSED/PROMOTED with
conclusion rewrite, input-validation errors. Install + e2e + CLI
tests extended for the new slash-command filenames. Full suite: 305
passed, 3 skipped (mcp + wandb extras + Windows-cmd shell-escape).
Reported-by: Claude (electricrag-side, second agent), 2026-04-24
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The electricrag-side agent salvaged the two parked thread drafts
under the shipped schema (electricrag commit 906809e, 2026-04-25).
Schema held; three doc-level clarifications landed based on friction
they reported back:
1. Status semantics made stricter and explicit. The agent defaulted
both salvaged threads to EXPLORING because "the work had been
thought about before parking." Kaden corrected to PROPOSED.
Clarified distinction:
- PROPOSED = thread exists but no concrete work is underway.
Writing it down, occasionally thinking about it, salvaging from
drafts — none of these promote to EXPLORING.
- EXPLORING = someone is actively running a baseline, reviewing
literature, pursuing a promotion criterion — work in flight,
not intent.
Clarifications added to:
- Thread template blockquote metadata (status-discipline lines)
- docs/threads.md lifecycle section (expanded)
- docs/threads.md Idioms ("Default to PROPOSED on creation")
- /aexp-new-thread slash-command flow guidance
2. Cross-linked-artifact ordering pattern documented. The agent hit
a kb_write_guard block when T001's first write included [[T002]]
before T002 existed. Workable in practice (write both skeletons
first, then cross-link in a second pass) but undocumented. New
"Cross-linked artifacts: create-then-link ordering" section in
docs/threads.md; inline flag in /aexp-new-thread flow. Applies
to any cross-linked artifact pair, not just threads.
3. Sub-questions vs. Promotion criteria edge case. When a
sub-question has its own prerequisite ("run a baseline first"),
it blurs with the thread-wide promotion-criteria section. New
"Section boundary" section in docs/threads.md:
prerequisite-to-any-promotion -> Promotion criteria;
prerequisite-to-this-specific-sub-question -> stays with the
sub-question. Minor — both salvaged threads surfaced the pattern,
neither was blocked by it.
No code changes; template + two doc files + one slash command. Full
suite: 305 passed, 3 skipped (unchanged).
Reported-by: Claude (electricrag-side), 2026-04-25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps `pyproject.toml` version 0.1.1 -> 0.2.0 and cuts the CHANGELOG [Unreleased] section to [0.2.0] - 2026-04-25 with a release summary at the top. Adds a fresh empty [Unreleased] heading for future work. 0.2.0 in one line: turns aexp from "installed-but-minimal" into an actually-usable research harness — artifact-creation API (H/E/F/T), queue + runner-script materialization, wandb init-ownership decoupled, validator strictness, install safety, and slash commands 3 -> 18. Full detail in the CHANGELOG release-summary section and the sub-sections below it. No code changes. Version-pinning test (test_package_version_matches_pyproject) confirms `aexp.__version__` == `pyproject.toml.version` after `poetry install` re-sync. Full suite: 305 passed, 3 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed on `poetry run ruff check .` (18 errors). Ruff was never run
locally during this stack — the local pre-commit was only `pytest`.
Two categories:
- 14 auto-fixable: import-block sorting (I001), unused imports (F401),
deprecated-import (UP035), `lru_cache(maxsize=None)` -> `cache`
(UP033). Applied via `poetry run ruff check . --fix`.
- 4 B008 (function-call-in-default-argument): all from
`param: list = typer.Option([], ...)` patterns in CLI verbs. typer's
documented API uses `OptionInfo` as a metadata sentinel — ruff's
generic mutable-default check is a false positive here. Exempted
via:
[tool.ruff.lint.flake8-bugbear]
extend-immutable-calls = ["typer.Option", "typer.Argument"]
Touched files (all auto-fix from ruff except pyproject.toml):
- pyproject.toml — B008 exemption.
- src/aexp/__init__.py, src/aexp/cli.py, src/aexp/mcp_server.py,
src/aexp/queue.py, src/aexp/kb_validate.py — import-block reformat,
unused imports removed, `functools.lru_cache(maxsize=None)` ->
`functools.cache` in kb_validate.
- tests/test_cli.py, tests/test_queue.py, tests/test_trackers_context.py
— import-block reformat, unused imports removed.
Lint and tests both green locally:
$ poetry run ruff check .
All checks passed!
$ poetry run pytest
305 passed, 3 skipped in 418.93s
(3 skipped: mcp + wandb extras not installed locally; 1 deliberately
skipped on Windows-cmd shell-escape.) On CI the wandb + mcp tests
will run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cuts 0.2.0 — turns aexp from "installed-but-minimal" (0.1.x shipped the H→E→F chain enforcement, signac run store, and wandb adapter) into an actually-usable research harness. 12 stacked commits, ~290 → 305 tests passing, 14 new slash commands, one new artifact kind (
T###), one new CLI subcommand group (queue), and material breaking changes to undocumented surfaces — see Migration below.Driven primarily by real-world friction reports from the electricrag-side consumer agent over 2026-04-22 → 2026-04-25.
What's in this release
Seven themes, in roughly the order they were built:
0a5f11b).aexp.artifactsshipsnew_hypothesis/new_experiment/new_finding/ (later)new_threadwith automatic parent-backlink patching. Templates rendered from the package, not from local copies (see51729e7). Closes the "agents will get backlinks wrong under deadline pressure" gap.561b08c). Three first-class modes:tracked_run(managed),prepare_tracker+ctx.bind(run)(BYO init), and the existingbind_trackeradapter path. Consumers like electricrag'sloop.py(which already callswandb.initper ECG) can adopt aexp's discipline without refactoring their training code.80906d6). Renamedclose-run/close-batch/new-finding→finding-from-run/finding-from-batch/finding-placeholderso the distinguishing axis (what the finding cites) is self-explaining. Addedshow-run,show-batch. Tracker docs reframed.2b17ae6,5e0b58c,8fd2a98). Declarative pending-run registration (aexp queue add), Cartesian sweep grammar (--sweep "K=a|b, SEED=0..3"), runner-script emission (shell / slurm / manual), andaexp queue runfor in-batch-script iteration. Designed for agent-on-laptop, training-on-cluster workflows. Includes sp resolution via namedconditions:blocks in experiment frontmatter — drift-proof provenance.6307908).aexp install --forcepreserves user-authoredkb/+templates/content while still refreshing tooling files. Reported as the CHALLENGE.md-clobber bug after a real user lost committed mission content; regression-guarded with a newpreserved_user_modifiedaction kind.9b3965c,51729e7).kb_validate.validate_required_headersenforces every shipped template##header is present (missing_template_headerissue code). Templates grew## Caveats(top-level, both E and F) and## Intent(dual-mode pre-registered vs. exploratory, kills the fabricate-retroactive-thresholds trap). Old## Expected Outcome→## Intent; old## Analysis→## Outcome Summarywith explicit E-run-level ↔ F-generalizable-claim boundary docs.T###) — new artifact kind (cb3faa0,3f1b363). Forward-looking research concerns broader than a single hypothesis. Parallel to H/E/F but deliberately outside the H→E→F enforcement chain. Lifecycle: PROPOSED → EXPLORING → PROMOTED | CLOSED. Hypotheses gain optionalthread:frontmatter for promotion (aexp new-hypothesis --thread T###). Two real threads salvaged on the consumer side under the new schema; doc refinements landed in3f1b363based on that first-use friction.Migration / breaking changes
These break undocumented surfaces. No documented surface breaks.
close-run/close-batch/new-finding→finding-from-run/finding-from-batch/finding-placeholder. Old files removed fromsrc/aexp/slash_commands/. Consumer repos with stale copies of the old files in.claude/commands/should re-runaexp install --yes --force --devand manuallyrmthe three old filenames (the install doesn't garbage-collect deleted slash commands).##header will now failaexp validatewithmissing_template_header. Fix: add the missing sections (placeholder content is fine —_None._for Caveats on a fully-instrumented run, etc.). The validator is doing the right thing per Kaden's "stick to the templates precisely" directive.aexp.artifacts._load_templatereads vendored only — localtemplates/<kind>.mdis now a reference copy, not an override. Any consumer relying on local-template overrides (none documented, none expected) will find them silently ignored. Future override mechanism likely shape:--template-file <path>flag, gated on real demand.## Expected Outcome/## Analysisrenames in the experiment template. Existing E artifacts will fail the header check until migrated to## Intent/## Outcome Summary. The electricrag agent did this migration on its E001 without issue.Testing
test_artifacts.py(threads, sp resolution, header checks),test_validate.py(template-header validator),test_queue.py(new file, 45 cases covering CRUD/sweep/sp resolution/drift-proof/render placeholders/materialize emission/run-queued lifecycle/validator schema),test_trackers_context.py(new file, 14 cases covering the three wandb modes), andtest_install.py(preservation under --force).test_package_version_matches_pyprojectconfirmsaexp.__version__reads0.2.0from package metadata afterpoetry installre-sync.mcp,wandb); the 3rd is a deliberate Windows-only skip for a POSIX-shell-quoting edge case with a platform-independent invariant covered by an adjacent unit test.Per-commit breakdown (for reviewers reading commit-by-commit)
Top of branch first:
392eca5— Release 0.2.0 (version bump + CHANGELOG cut)3f1b363— Threads: doc refinements from first real-world salvagecb3faa0— Add threads (T###) — new artifact kind51729e7— Templates: artifact creation reads vendored only (single source of truth)9b3965c— Templates + validator: stick to the templates precisely6307908— Preserve user-authored kb/ + templates/ content under aexp install --force8fd2a98— Clarify:--tagis pure metadata, no temporal semantics5e0b58c— Addaexp queue run+ demote slurm materialize to honest starter template2b17ae6— Add aexp queue: pending-run registration, runner materialization, sp resolution80906d6— Slash-command UX cleanup: show-run/show-batch, finding rename, tracker reframe561b08c— Decouple wandb init ownership — add tracked_run, prepare_tracker0a5f11b— Add H/E/F artifact creation API with automatic backlink patchingEach commit is self-contained, has a substantive message describing the bug or design driver, and was reported back to the consumer agent (electricrag-side) for real-world testing before the next stacked commit landed.
Where to read more
CHANGELOG.md[0.2.0]section — full per-feature breakdowndocs/threads.md— threads model, lifecycle, linkage, idiomsdocs/queue.md— queue layer, runner-script emitters, sp resolution, cross-machine syncdocs/tracker-adapters.md— three-mode comparison table for wandb, when each fitsTest plan
aexp install --yes --force --devin a scratch repo; confirm 18 slash commands land,kb/research/threads/exists, validator passesaexp new-hypothesis→aexp new-experiment --hypothesis H001→ confirm## Caveatsand## Intentare present in the rendered Eaexp queue add --experiment E001 --sweep "condition=full|cls,seed=0..3" --tag smoke→aexp queue list→aexp queue materialize --runner shell --output run.sh→bash run.sh→ confirm 8 jobs run🤖 Generated with Claude Code