Skip to content

feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind#175

Merged
AVADSA25 merged 1 commit into
mainfrom
compare-phase2
Jun 1, 2026
Merged

feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind#175
AVADSA25 merged 1 commit into
mainfrom
compare-phase2

Conversation

@AVADSA25
Copy link
Copy Markdown
Owner

@AVADSA25 AVADSA25 commented Jun 1, 2026

Summary

Phase 2 of the locked sequence (Cookbook → Compare → Email triage → Image). Compare is the thin fan-out you described: one prompt → the canonical tiers + anything Cookbook is serving → collect → return labeled or blind. It reuses Cookbook's registry (serve.list_served()) and the canonical callers (codec_llm.call, codec_ava_client) — no new HTTP.

Endpoint set

tier source
local config llm_base_url/llm_model (Qwen @ 8083) → codec_llm.call
cloud-balanced gemini-2.5-flash via AVA (only when license/proxy ready)
cloud-pro gemini-2.5-pro via AVA
cookbook-<id> each healthy model from serve.list_served() on its 811x port

Behavior

  • Concurrent fan-out (ThreadPoolExecutor), per-endpoint timed, and failure-isolated — one endpoint erroring (license expired, model down) is captured as {ok: False, error} and never sinks the others.
  • Blind mode anonymizes display labels to Model A/B/… and returns the anon→real mapping separately; the skill prints answers first, then a "judge first, then peek" key.
  • Skill is thin (codec_compare + re → passes the AST gate); SKILL_MCP_EXPOSE=True (query skill, no mutation).

Flag for you — the "three-tier" interpretation

You said "the three-tier endpoints" without pinning them. I grounded it in the real config: local + two AVA cloud tiers (flash/pro), matching choose_model's fast/balanced/pro map. It's config-drivencompare.cloud_tiers in ~/.codec/config.json overrides the cloud list (e.g. to add Claude/GPT or change the tiers) with zero code change. If your "three tiers" meant something specific (e.g. flash-lite/flash/pro, or local/Claude/GPT), it's a one-line config edit, not a rebuild — tell me and I'll set the default accordingly.

Test plan

  • tests/test_compare.py19 tests, all model callers + Cookbook registry mocked (offline): query paths (openai/ava/error/system-prompt), fan-out order + blind mapping + one-failure-isolation, endpoint discovery (local always present, cloud gated on AVA-ready, config override, cookbook skips unhealthy), skill parse/format/blind-key/empty/failure
  • python3.13 -m pytest --ignore=tests/test_skills.py -q2,159 passed, 77 skipped
  • AST gate clean; registry discovers compare with SKILL_MCP_EXPOSE=True; manifest → 83 skills
  • ruff check: 0 issues

Branches off main; independent of #174 (the Cookbook port-fix re-land) — different files, clean merge in either order.

🤖 Generated with Claude Code

Phase 2 of the locked sequence (Cookbook → Compare → …). Compare sits directly
on the rest of the stack and reuses the canonical callers instead of
re-implementing HTTP:
  * OpenAI-compatible endpoints (local Qwen @ 8083, every Cookbook-served model
    on its 811x port) → codec_llm.call
  * cloud tiers (Gemini/Claude/GPT via the AVA proxy) → codec_ava_client

Endpoint set = three canonical tiers + whatever Cookbook is serving:
  local            — config llm_base_url / llm_model
  cloud-balanced    — gemini-2.5-flash via AVA   (only if license/proxy ready)
  cloud-pro         — gemini-2.5-pro via AVA
  cookbook-<id>     — each healthy model from codec_cookbook.serve.list_served()

Reuse of Cookbook's registry is the whole point: `cookbook_list`'s served.json
feeds Compare's fan-out for free. The two cloud tiers are overridable in
~/.codec/config.json:compare.cloud_tiers (defaults grounded in
codec_ava_client.choose_model's fast/balanced/pro map — see PR note).

Fan-out is concurrent (ThreadPoolExecutor), per-endpoint timed, and never lets
one endpoint's failure sink the others (each query is try/except'd → ok/error).
Blind mode anonymizes display labels to Model A/B/… and returns the anon→real
mapping separately so the caller reveals it only after judging.

skills/compare.py is thin (imports codec_compare + re → passes the AST gate),
parses the prompt + 'blind' flag, formats labeled or blind output (+ a "judge
first, then peek" key in blind mode). SKILL_MCP_EXPOSE=True — it's a query
skill (no process/file mutation; same cost profile as chat). Manifest → 83.

Tests: tests/test_compare.py — 19, all callers + Cookbook registry mocked
(offline, no network): _query_one openai/ava/error/system paths, fan-out order
+ blind mapping + one-failure-isolation, endpoint discovery (local always, cloud
only when AVA ready, config override, cookbook skips unhealthy), skill parse/
format/blind-key/empty/failure. Full suite: 2,159 passed / 77 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@AVADSA25 AVADSA25 merged commit 0f00aa3 into main Jun 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants