feat(compare): fan one prompt across model tiers (local + cloud + Cookbook), labeled or blind#175
Merged
Conversation
Phase 2 of the locked sequence (Cookbook → Compare → …). Compare sits directly
on the rest of the stack and reuses the canonical callers instead of
re-implementing HTTP:
* OpenAI-compatible endpoints (local Qwen @ 8083, every Cookbook-served model
on its 811x port) → codec_llm.call
* cloud tiers (Gemini/Claude/GPT via the AVA proxy) → codec_ava_client
Endpoint set = three canonical tiers + whatever Cookbook is serving:
local — config llm_base_url / llm_model
cloud-balanced — gemini-2.5-flash via AVA (only if license/proxy ready)
cloud-pro — gemini-2.5-pro via AVA
cookbook-<id> — each healthy model from codec_cookbook.serve.list_served()
Reuse of Cookbook's registry is the whole point: `cookbook_list`'s served.json
feeds Compare's fan-out for free. The two cloud tiers are overridable in
~/.codec/config.json:compare.cloud_tiers (defaults grounded in
codec_ava_client.choose_model's fast/balanced/pro map — see PR note).
Fan-out is concurrent (ThreadPoolExecutor), per-endpoint timed, and never lets
one endpoint's failure sink the others (each query is try/except'd → ok/error).
Blind mode anonymizes display labels to Model A/B/… and returns the anon→real
mapping separately so the caller reveals it only after judging.
skills/compare.py is thin (imports codec_compare + re → passes the AST gate),
parses the prompt + 'blind' flag, formats labeled or blind output (+ a "judge
first, then peek" key in blind mode). SKILL_MCP_EXPOSE=True — it's a query
skill (no process/file mutation; same cost profile as chat). Manifest → 83.
Tests: tests/test_compare.py — 19, all callers + Cookbook registry mocked
(offline, no network): _query_one openai/ava/error/system paths, fan-out order
+ blind mapping + one-failure-isolation, endpoint discovery (local always, cloud
only when AVA ready, config override, cookbook skips unhealthy), skill parse/
format/blind-key/empty/failure. Full suite: 2,159 passed / 77 skipped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 2 of the locked sequence (Cookbook → Compare → Email triage → Image). Compare is the thin fan-out you described: one prompt → the canonical tiers + anything Cookbook is serving → collect → return labeled or blind. It reuses Cookbook's registry (
serve.list_served()) and the canonical callers (codec_llm.call,codec_ava_client) — no new HTTP.Endpoint set
localllm_base_url/llm_model(Qwen @ 8083) →codec_llm.callcloud-balancedgemini-2.5-flashvia AVA (only when license/proxy ready)cloud-progemini-2.5-provia AVAcookbook-<id>serve.list_served()on its 811x portBehavior
ThreadPoolExecutor), per-endpoint timed, and failure-isolated — one endpoint erroring (license expired, model down) is captured as{ok: False, error}and never sinks the others.Model A/B/…and returns theanon→realmapping separately; the skill prints answers first, then a "judge first, then peek" key.codec_compare+re→ passes the AST gate);SKILL_MCP_EXPOSE=True(query skill, no mutation).Flag for you — the "three-tier" interpretation
You said "the three-tier endpoints" without pinning them. I grounded it in the real config: local + two AVA cloud tiers (flash/pro), matching
choose_model's fast/balanced/pro map. It's config-driven —compare.cloud_tiersin~/.codec/config.jsonoverrides the cloud list (e.g. to add Claude/GPT or change the tiers) with zero code change. If your "three tiers" meant something specific (e.g. flash-lite/flash/pro, or local/Claude/GPT), it's a one-line config edit, not a rebuild — tell me and I'll set the default accordingly.Test plan
tests/test_compare.py— 19 tests, all model callers + Cookbook registry mocked (offline): query paths (openai/ava/error/system-prompt), fan-out order + blind mapping + one-failure-isolation, endpoint discovery (local always present, cloud gated on AVA-ready, config override, cookbook skips unhealthy), skill parse/format/blind-key/empty/failurepython3.13 -m pytest --ignore=tests/test_skills.py -q→ 2,159 passed, 77 skippedcomparewithSKILL_MCP_EXPOSE=True; manifest → 83 skillsruff check: 0 issues🤖 Generated with Claude Code