Skip to content

feat: add Claude Code CLI as VLM provider#115

Merged
dippatel1994 merged 11 commits intollmsresearch:mainfrom
biecho:feat/claude-code-vlm-provider
Apr 7, 2026
Merged

feat: add Claude Code CLI as VLM provider#115
dippatel1994 merged 11 commits intollmsresearch:mainfrom
biecho:feat/claude-code-vlm-provider

Conversation

@biecho
Copy link
Copy Markdown
Contributor

@biecho biecho commented Mar 24, 2026

Summary

  • Adds a claude_code VLM provider that uses the locally installed claude CLI as the backend for planner, stylist, and critic agents
  • No API key needed, uses the user's existing Claude Code subscription
  • Maintains conversation context across pipeline steps via --resume, so the critic knows what the planner intended

Usage

paperbanana generate \
  --vlm-provider claude_code \
  --vlm-model sonnet \
  --image-model gemini-2.5-flash-image \
  -i input.txt -o output.png

Changes since review

Addressed all feedback from @dippatel1994:

  1. Secure temp files — replaced tempfile.mktemp() with mkstemp(); temp images cleaned up in try/finally (even on subprocess or OSError)
  2. Fixed image ordering — preamble built in-order via list, prepended once
  3. Concurrency safetyasyncio.Lock serialises generate() calls to prevent _session_id races
  4. CLI validationis_available() check in ProviderRegistry.create_vlm() with a clear error message
  5. Tests — 24 tests covering registry, JSON parsing, session chaining, prompt construction, image ordering, temp cleanup, error handling, unsupported-param warning, and concurrency
  6. temperature/max_tokens — warning logged when non-default values are passed (CLI has no flags for these)
  7. Lint — all checks pass

Test plan

  • Verified claude -p --output-format json returns structured output with session_id
  • Verified --resume <session_id> maintains conversation context
  • End-to-end test: full paperbanana pipeline (planner + stylist + 2 critic iterations) with claude_code VLM + Gemini image gen
  • 24 unit tests pass (pytest tests/test_providers/test_claude_code_vlm.py)

Copilot AI review requested due to automatic review settings March 24, 2026 08:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new VLM provider that routes PaperBanana agent calls through the locally installed claude CLI (“Claude Code”), wiring it into the provider registry so it can be selected via --vlm-provider claude_code.

Changes:

  • Introduces ClaudeCodeVLM, a VLMProvider implementation that shells out to claude -p --output-format json and tracks session_id for --resume.
  • Updates ProviderRegistry.create_vlm() to support vlm_provider == "claude_code".

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
paperbanana/providers/vlm/claude_code.py New Claude Code CLI-backed VLM provider with session resumption and image handling via temp files.
paperbanana/providers/registry.py Registers the new claude_code provider option in the VLM factory.

Comment on lines +130 to +133
elif provider == "claude_code":
from paperbanana.providers.vlm.claude_code import ClaudeCodeVLM

return ClaudeCodeVLM(model=settings.vlm_model)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProviderRegistry.create_vlm() creates ClaudeCodeVLM without validating that the claude executable is installed. If it’s missing, the first call will fail with a low-level FileNotFoundError. Consider checking shutil.which("claude") (or provider.is_available()) here and raising a helpful ValueError with installation instructions.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +133
elif provider == "claude_code":
from paperbanana.providers.vlm.claude_code import ClaudeCodeVLM

return ClaudeCodeVLM(model=settings.vlm_model)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing registry/provider creation tests (e.g., tests/test_providers/test_registry.py) but no coverage for the new claude_code branch. Add tests that (a) assert ProviderRegistry.create_vlm() returns ClaudeCodeVLM when shutil.which('claude') is patched to a path, and (b) asserts a helpful error when it’s missing (if you add the validation).

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +83 to +84
tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png"))
img.save(tmp, format="PNG")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.mktemp() is insecure and can be raced (TOCTOU/symlink) before img.save() writes the file. Use tempfile.NamedTemporaryFile(delete=False, suffix=...) or tempfile.mkstemp() to create the file securely, and write via the returned handle/path.

Suggested change
tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png"))
img.save(tmp, format="PNG")
# Use a securely created temporary file instead of mktemp
with tempfile.NamedTemporaryFile(
delete=False, suffix=f"_pb_img_{i}.png"
) as tmp_file:
img.save(tmp_file, format="PNG")
tmp = Path(tmp_file.name)

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +90
for i, img in enumerate(images):
tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png"))
img.save(tmp, format="PNG")
temp_files.append(tmp)
full_prompt = (
f"[Image {i + 1}: see file {tmp}]\n"
f"Please read the image at {tmp} before responding.\n\n"
+ full_prompt
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image preamble is prepended to full_prompt inside the loop, which reverses the image ordering (Image 2 header ends up before Image 1, etc.). Build the image preamble separately (or append in-order) so the numbering and prompt order stay consistent.

Suggested change
for i, img in enumerate(images):
tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png"))
img.save(tmp, format="PNG")
temp_files.append(tmp)
full_prompt = (
f"[Image {i + 1}: see file {tmp}]\n"
f"Please read the image at {tmp} before responding.\n\n"
+ full_prompt
)
image_preamble_parts: list[str] = []
for i, img in enumerate(images):
tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png"))
img.save(tmp, format="PNG")
temp_files.append(tmp)
image_preamble_parts.append(
f"[Image {i + 1}: see file {tmp}]\n"
f"Please read the image at {tmp} before responding.\n\n"
)
image_preamble = "".join(image_preamble_parts)
full_prompt = image_preamble + full_prompt

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +66
def __init__(self, model: str = "sonnet"):
self._model = model
self._session_id: Optional[str] = None

@property
def name(self) -> str:
return "claude_code"

@property
def model_name(self) -> str:
return f"claude-code ({self._model})"

def is_available(self) -> bool:
import shutil

return shutil.which("claude") is not None

async def generate(
self,
prompt: str,
images: Optional[list[Image.Image]] = None,
system_prompt: Optional[str] = None,
temperature: float = 1.0,
max_tokens: int = 4096,
response_format: Optional[str] = None,
) -> str:
cmd = [
"claude",
"-p",
"--output-format", "json",
"--model", self._model,
]

if self._session_id:
cmd += ["--resume", self._session_id]

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClaudeCodeVLM maintains a mutable _session_id across calls, but the pipeline can invoke vlm.generate() concurrently (e.g., InputOptimizerAgent uses asyncio.gather). Concurrent calls can race on _session_id updates and mix conversation state. Consider serializing calls with an asyncio.Lock (or using separate sessions per task / disabling session resume for parallel calls).

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +112
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()

# Clean up temp image files
for tmp in temp_files:
tmp.unlink(missing_ok=True)

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary image files are only cleaned up after proc.communicate(). If create_subprocess_exec fails (e.g., claude not found) or an exception is raised before cleanup, the temp files will leak. Wrap the subprocess section in try/finally so cleanup always runs.

Suggested change
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
# Clean up temp image files
for tmp in temp_files:
tmp.unlink(missing_ok=True)
proc = None
stdout = None
stderr = None
try:
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
finally:
# Clean up temp image files
for tmp in temp_files:
tmp.unlink(missing_ok=True)

Copilot uses AI. Check for mistakes.
temperature: float = 1.0,
max_tokens: int = 4096,
response_format: Optional[str] = None,
) -> str:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temperature and max_tokens parameters are accepted by the interface but currently unused in this provider, which can surprise callers (agents set these expecting effect). If the claude CLI supports these controls, pass them through; otherwise consider documenting the limitation and/or raising when non-default values are provided.

Suggested change
) -> str:
) -> str:
# The local `claude` CLI backend does not currently support explicit
# temperature or max_tokens controls. To avoid surprising callers,
# reject non-default values instead of silently ignoring them.
if temperature != 1.0 or max_tokens != 4096:
logger.warning(
"ClaudeCodeVLM does not support temperature/max_tokens controls; "
"received non-default values that will be rejected",
temperature=temperature,
max_tokens=max_tokens,
)
raise ValueError(
"ClaudeCodeVLM (claude CLI backend) does not currently support "
"non-default 'temperature' or 'max_tokens' values."
)

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +29
class ClaudeCodeVLM(VLMProvider):
"""VLM provider that shells out to the `claude` CLI.

Maintains a single conversation session across planner, stylist,
and critic calls so each step has full context of prior steps.
"""
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class/docstring (and PR description) says session continuity is maintained across planner/stylist/critic, but the pipeline shares the same VLM provider instance across all agents (optimizer/retriever/visualizer too). If you only want continuity for specific steps, add a way to reset/disable --resume for other agents or clarify the behavior in docs/description.

Copilot uses AI. Check for mistakes.
@dippatel1994
Copy link
Copy Markdown
Member

Thanks for this @biecho , a local claude CLI backend with JSON parsing and --resume is a nice option for users on Claude Code.

Before merge I’d like to see:

  1. Replace tempfile.mktemp() with a safe temp file API (NamedTemporaryFile(delete=False) / mkstemp) + finally cleanup so temp images are always removed on errors.
  2. Fix multi-image prompt construction — the current loop prepends each image block to full_prompt, which reverses the logical order vs image index; build a preamble then append the task prompt once.
  3. Concurrency / session_session_id is shared mutable state; if the pipeline can call generate() concurrently on the same provider, we need a lock or explicit behavior. Please confirm sequential use or serialize.
  4. Optional: validate claude in PATH in create_vlm (or __init__) for a clearer error than FileNotFoundError.
  5. Tests — at least registry + a mocked subprocess happy path for JSON parsing / session_id.

Nit: temperature / max_tokens are ignored; document or wire through if the CLI supports. Also, please check failed lint task in workflow run.

Happy to re-review after the temp-file + prompt-order + concurrency story is addressed.

@biecho
Copy link
Copy Markdown
Contributor Author

biecho commented Mar 25, 2026

Thanks for the review @dippatel1994! All points addressed in the latest push. Let me know if anything else needs attention.

Copy link
Copy Markdown
Member

@dippatel1994 dippatel1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good structure — secure temp files, concurrency lock, clean test suite. Three must-fixes:

  1. Missing @retry decorator — Every other VLM provider uses tenacity with exponential backoff. This one has none. A transient CLI failure will abort the pipeline immediately.

  2. No cost_tracker integration — The JSON response includes total_cost_usd and usage fields. Every other provider records cost via self.cost_tracker.record_vlm_call(...). This one ignores it entirely.

  3. Use --system-prompt CLI flag — Currently system instructions are embedded in the user prompt text. The claude CLI has a --system-prompt flag — use it so the model actually treats it as a system prompt.

Non-blocking: The capsys-based warning test may be fragile since structlog doesn't necessarily write to capsys-captured stdout. Consider mocking the logger instead.

biecho added 6 commits April 3, 2026 11:27
- Replace insecure tempfile.mktemp() with mkstemp(); close fd
  immediately before writing to avoid resource leak
- Build image preamble in-order via list then prepend once, fixing
  reversed Image 2 / Image 1 ordering
- Wrap subprocess call in try/finally so temp image files are always
  cleaned up, even when create_subprocess_exec raises
- Add asyncio.Lock so concurrent callers don't race on _session_id
- Log a warning when non-default temperature or max_tokens are passed,
  since the CLI has no flags for these
- Move shutil import to module level
Check is_available() in ProviderRegistry.create_vlm() and raise a
clear ValueError instead of letting the first subprocess call fail
with a cryptic FileNotFoundError.
24 tests covering: registry creation and missing-CLI error, basic text
generation, session resumption and chaining, non-JSON fallback, prompt
construction (system prompt + JSON mode + images), image ordering,
temp file cleanup on success / failure / OSError, model passthrough,
empty images list, session preservation, missing result-key fallback,
error stderr/stdout precedence, error truncation, unsupported-param
warning, and concurrent-call serialisation.
- Add @Retry with exponential backoff (3 attempts, 2–30s) matching all
  other VLM providers so transient CLI failures don't abort the pipeline
- Pass system_prompt via --system-prompt CLI flag instead of embedding
  it as text in the user prompt
- Add usage= to debug log for consistency with other providers
- Replace fragile capsys-based warning test with a logger mock
@biecho biecho force-pushed the feat/claude-code-vlm-provider branch from c29682a to a09ad94 Compare April 3, 2026 03:41
@biecho
Copy link
Copy Markdown
Contributor Author

biecho commented Apr 3, 2026

Thanks for the review! Addressed in a09ad94:

  1. @retry — Added @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30), reraise=True) on generate(), matching the pattern used by all other providers.

  2. cost_tracker — I looked into this but there's no cost_tracker class or record_vlm_call method anywhere in the codebase. All existing providers log usage via logger.debug(... usage=...) — none of them call self.cost_tracker.record_vlm_call(...). I've added usage=data.get("usage") to the debug log for consistency. Happy to integrate with a cost tracking system if one gets added, but there's nothing to wire into today.

  3. --system-prompt — switched from embedding in prompt text to the --system-prompt CLI flag.

Non-blocking: Replaced the capsys warning test with a logger mock.

Copy link
Copy Markdown
Member

@dippatel1994 dippatel1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry and --system-prompt are fixed, nice work. Two remaining things:

  1. Lint fails - import sorting error at tests/test_providers/test_claude_code_vlm.py:3. Run ruff check --fix tests/test_providers/test_claude_code_vlm.py and push.

  2. cost_tracker - still missing. If you plan to add it after PR #111 merges (like #120 is doing), that's fine, just mention it so we can track it.

Fix import sorting (ruff I001) and apply ruff format to both provider
and test files so CI passes all three checks.
@biecho
Copy link
Copy Markdown
Contributor Author

biecho commented Apr 3, 2026

Lint — Fixed import sorting and also ran ruff format on both files (there were formatting issues too that would've failed CI). Verified all three CI checks pass locally: ruff check, ruff format --check, and pytest (24/24).

cost_tracker — Confirmed there's no cost_tracker / record_vlm_call in the codebase today — all existing providers just log usage= via logger.debug, which this provider already does (including cost_usd). Happy to wire it in once the cost tracking system lands (tracking with #120).

Copy link
Copy Markdown
Member

@dippatel1994 dippatel1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint fixed, CI fully green. Retry and --system-prompt both addressed. Cost tracker can be added in a follow-up after #111 merges. LGTM.

@dippatel1994 dippatel1994 merged commit ae01fc9 into llmsresearch:main Apr 7, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants