feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind by AVADSA25 · Pull Request #175 · AVADSA25/codec

AVADSA25 · 2026-06-01T10:49:56Z

Summary

Phase 2 of the locked sequence (Cookbook → Compare → Email triage → Image). Compare is the thin fan-out you described: one prompt → the canonical tiers + anything Cookbook is serving → collect → return labeled or blind. It reuses Cookbook's registry (serve.list_served()) and the canonical callers (codec_llm.call, codec_ava_client) — no new HTTP.

Endpoint set

tier	source
`local`	config `llm_base_url`/`llm_model` (Qwen @ 8083) → `codec_llm.call`
`cloud-balanced`	`gemini-2.5-flash` via AVA (only when license/proxy ready)
`cloud-pro`	`gemini-2.5-pro` via AVA
`cookbook-<id>`	each healthy model from `serve.list_served()` on its 811x port

Behavior

Concurrent fan-out (ThreadPoolExecutor), per-endpoint timed, and failure-isolated — one endpoint erroring (license expired, model down) is captured as {ok: False, error} and never sinks the others.
Blind mode anonymizes display labels to Model A/B/… and returns the anon→real mapping separately; the skill prints answers first, then a "judge first, then peek" key.
Skill is thin (codec_compare + re → passes the AST gate); SKILL_MCP_EXPOSE=True (query skill, no mutation).

Flag for you — the "three-tier" interpretation

You said "the three-tier endpoints" without pinning them. I grounded it in the real config: local + two AVA cloud tiers (flash/pro), matching choose_model's fast/balanced/pro map. It's config-driven — compare.cloud_tiers in ~/.codec/config.json overrides the cloud list (e.g. to add Claude/GPT or change the tiers) with zero code change. If your "three tiers" meant something specific (e.g. flash-lite/flash/pro, or local/Claude/GPT), it's a one-line config edit, not a rebuild — tell me and I'll set the default accordingly.

Test plan

tests/test_compare.py — 19 tests, all model callers + Cookbook registry mocked (offline): query paths (openai/ava/error/system-prompt), fan-out order + blind mapping + one-failure-isolation, endpoint discovery (local always present, cloud gated on AVA-ready, config override, cookbook skips unhealthy), skill parse/format/blind-key/empty/failure
python3.13 -m pytest --ignore=tests/test_skills.py -q → 2,159 passed, 77 skipped
AST gate clean; registry discovers compare with SKILL_MCP_EXPOSE=True; manifest → 83 skills
ruff check: 0 issues

Branches off main; independent of #174 (the Cookbook port-fix re-land) — different files, clean merge in either order.

🤖 Generated with Claude Code

Phase 2 of the locked sequence (Cookbook → Compare → …). Compare sits directly on the rest of the stack and reuses the canonical callers instead of re-implementing HTTP: * OpenAI-compatible endpoints (local Qwen @ 8083, every Cookbook-served model on its 811x port) → codec_llm.call * cloud tiers (Gemini/Claude/GPT via the AVA proxy) → codec_ava_client Endpoint set = three canonical tiers + whatever Cookbook is serving: local — config llm_base_url / llm_model cloud-balanced — gemini-2.5-flash via AVA (only if license/proxy ready) cloud-pro — gemini-2.5-pro via AVA cookbook-<id> — each healthy model from codec_cookbook.serve.list_served() Reuse of Cookbook's registry is the whole point: `cookbook_list`'s served.json feeds Compare's fan-out for free. The two cloud tiers are overridable in ~/.codec/config.json:compare.cloud_tiers (defaults grounded in codec_ava_client.choose_model's fast/balanced/pro map — see PR note). Fan-out is concurrent (ThreadPoolExecutor), per-endpoint timed, and never lets one endpoint's failure sink the others (each query is try/except'd → ok/error). Blind mode anonymizes display labels to Model A/B/… and returns the anon→real mapping separately so the caller reveals it only after judging. skills/compare.py is thin (imports codec_compare + re → passes the AST gate), parses the prompt + 'blind' flag, formats labeled or blind output (+ a "judge first, then peek" key in blind mode). SKILL_MCP_EXPOSE=True — it's a query skill (no process/file mutation; same cost profile as chat). Manifest → 83. Tests: tests/test_compare.py — 19, all callers + Cookbook registry mocked (offline, no network): _query_one openai/ava/error/system paths, fan-out order + blind mapping + one-failure-isolation, endpoint discovery (local always, cloud only when AVA ready, config override, cookbook skips unhealthy), skill parse/ format/blind-key/empty/failure. Full suite: 2,159 passed / 77 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

AVADSA25 merged commit 0f00aa3 into main Jun 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind#175

feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind#175
AVADSA25 merged 1 commit into
mainfrom
compare-phase2

AVADSA25 commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AVADSA25 commented Jun 1, 2026

Summary

Endpoint set

Behavior

Flag for you — the "three-tier" interpretation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants