feat: add Claude Code CLI as VLM provider#115
feat: add Claude Code CLI as VLM provider#115dippatel1994 merged 11 commits intollmsresearch:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new VLM provider that routes PaperBanana agent calls through the locally installed claude CLI (“Claude Code”), wiring it into the provider registry so it can be selected via --vlm-provider claude_code.
Changes:
- Introduces
ClaudeCodeVLM, aVLMProviderimplementation that shells out toclaude -p --output-format jsonand trackssession_idfor--resume. - Updates
ProviderRegistry.create_vlm()to supportvlm_provider == "claude_code".
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
paperbanana/providers/vlm/claude_code.py |
New Claude Code CLI-backed VLM provider with session resumption and image handling via temp files. |
paperbanana/providers/registry.py |
Registers the new claude_code provider option in the VLM factory. |
paperbanana/providers/registry.py
Outdated
| elif provider == "claude_code": | ||
| from paperbanana.providers.vlm.claude_code import ClaudeCodeVLM | ||
|
|
||
| return ClaudeCodeVLM(model=settings.vlm_model) |
There was a problem hiding this comment.
ProviderRegistry.create_vlm() creates ClaudeCodeVLM without validating that the claude executable is installed. If it’s missing, the first call will fail with a low-level FileNotFoundError. Consider checking shutil.which("claude") (or provider.is_available()) here and raising a helpful ValueError with installation instructions.
paperbanana/providers/registry.py
Outdated
| elif provider == "claude_code": | ||
| from paperbanana.providers.vlm.claude_code import ClaudeCodeVLM | ||
|
|
||
| return ClaudeCodeVLM(model=settings.vlm_model) |
There was a problem hiding this comment.
There are existing registry/provider creation tests (e.g., tests/test_providers/test_registry.py) but no coverage for the new claude_code branch. Add tests that (a) assert ProviderRegistry.create_vlm() returns ClaudeCodeVLM when shutil.which('claude') is patched to a path, and (b) asserts a helpful error when it’s missing (if you add the validation).
| tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png")) | ||
| img.save(tmp, format="PNG") |
There was a problem hiding this comment.
tempfile.mktemp() is insecure and can be raced (TOCTOU/symlink) before img.save() writes the file. Use tempfile.NamedTemporaryFile(delete=False, suffix=...) or tempfile.mkstemp() to create the file securely, and write via the returned handle/path.
| tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png")) | |
| img.save(tmp, format="PNG") | |
| # Use a securely created temporary file instead of mktemp | |
| with tempfile.NamedTemporaryFile( | |
| delete=False, suffix=f"_pb_img_{i}.png" | |
| ) as tmp_file: | |
| img.save(tmp_file, format="PNG") | |
| tmp = Path(tmp_file.name) |
| for i, img in enumerate(images): | ||
| tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png")) | ||
| img.save(tmp, format="PNG") | ||
| temp_files.append(tmp) | ||
| full_prompt = ( | ||
| f"[Image {i + 1}: see file {tmp}]\n" | ||
| f"Please read the image at {tmp} before responding.\n\n" | ||
| + full_prompt | ||
| ) |
There was a problem hiding this comment.
The image preamble is prepended to full_prompt inside the loop, which reverses the image ordering (Image 2 header ends up before Image 1, etc.). Build the image preamble separately (or append in-order) so the numbering and prompt order stay consistent.
| for i, img in enumerate(images): | |
| tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png")) | |
| img.save(tmp, format="PNG") | |
| temp_files.append(tmp) | |
| full_prompt = ( | |
| f"[Image {i + 1}: see file {tmp}]\n" | |
| f"Please read the image at {tmp} before responding.\n\n" | |
| + full_prompt | |
| ) | |
| image_preamble_parts: list[str] = [] | |
| for i, img in enumerate(images): | |
| tmp = Path(tempfile.mktemp(suffix=f"_pb_img_{i}.png")) | |
| img.save(tmp, format="PNG") | |
| temp_files.append(tmp) | |
| image_preamble_parts.append( | |
| f"[Image {i + 1}: see file {tmp}]\n" | |
| f"Please read the image at {tmp} before responding.\n\n" | |
| ) | |
| image_preamble = "".join(image_preamble_parts) | |
| full_prompt = image_preamble + full_prompt |
| def __init__(self, model: str = "sonnet"): | ||
| self._model = model | ||
| self._session_id: Optional[str] = None | ||
|
|
||
| @property | ||
| def name(self) -> str: | ||
| return "claude_code" | ||
|
|
||
| @property | ||
| def model_name(self) -> str: | ||
| return f"claude-code ({self._model})" | ||
|
|
||
| def is_available(self) -> bool: | ||
| import shutil | ||
|
|
||
| return shutil.which("claude") is not None | ||
|
|
||
| async def generate( | ||
| self, | ||
| prompt: str, | ||
| images: Optional[list[Image.Image]] = None, | ||
| system_prompt: Optional[str] = None, | ||
| temperature: float = 1.0, | ||
| max_tokens: int = 4096, | ||
| response_format: Optional[str] = None, | ||
| ) -> str: | ||
| cmd = [ | ||
| "claude", | ||
| "-p", | ||
| "--output-format", "json", | ||
| "--model", self._model, | ||
| ] | ||
|
|
||
| if self._session_id: | ||
| cmd += ["--resume", self._session_id] | ||
|
|
There was a problem hiding this comment.
ClaudeCodeVLM maintains a mutable _session_id across calls, but the pipeline can invoke vlm.generate() concurrently (e.g., InputOptimizerAgent uses asyncio.gather). Concurrent calls can race on _session_id updates and mix conversation state. Consider serializing calls with an asyncio.Lock (or using separate sessions per task / disabling session resume for parallel calls).
| proc = await asyncio.create_subprocess_exec( | ||
| *cmd, | ||
| stdout=asyncio.subprocess.PIPE, | ||
| stderr=asyncio.subprocess.PIPE, | ||
| ) | ||
| stdout, stderr = await proc.communicate() | ||
|
|
||
| # Clean up temp image files | ||
| for tmp in temp_files: | ||
| tmp.unlink(missing_ok=True) | ||
|
|
There was a problem hiding this comment.
Temporary image files are only cleaned up after proc.communicate(). If create_subprocess_exec fails (e.g., claude not found) or an exception is raised before cleanup, the temp files will leak. Wrap the subprocess section in try/finally so cleanup always runs.
| proc = await asyncio.create_subprocess_exec( | |
| *cmd, | |
| stdout=asyncio.subprocess.PIPE, | |
| stderr=asyncio.subprocess.PIPE, | |
| ) | |
| stdout, stderr = await proc.communicate() | |
| # Clean up temp image files | |
| for tmp in temp_files: | |
| tmp.unlink(missing_ok=True) | |
| proc = None | |
| stdout = None | |
| stderr = None | |
| try: | |
| proc = await asyncio.create_subprocess_exec( | |
| *cmd, | |
| stdout=asyncio.subprocess.PIPE, | |
| stderr=asyncio.subprocess.PIPE, | |
| ) | |
| stdout, stderr = await proc.communicate() | |
| finally: | |
| # Clean up temp image files | |
| for tmp in temp_files: | |
| tmp.unlink(missing_ok=True) |
| temperature: float = 1.0, | ||
| max_tokens: int = 4096, | ||
| response_format: Optional[str] = None, | ||
| ) -> str: |
There was a problem hiding this comment.
temperature and max_tokens parameters are accepted by the interface but currently unused in this provider, which can surprise callers (agents set these expecting effect). If the claude CLI supports these controls, pass them through; otherwise consider documenting the limitation and/or raising when non-default values are provided.
| ) -> str: | |
| ) -> str: | |
| # The local `claude` CLI backend does not currently support explicit | |
| # temperature or max_tokens controls. To avoid surprising callers, | |
| # reject non-default values instead of silently ignoring them. | |
| if temperature != 1.0 or max_tokens != 4096: | |
| logger.warning( | |
| "ClaudeCodeVLM does not support temperature/max_tokens controls; " | |
| "received non-default values that will be rejected", | |
| temperature=temperature, | |
| max_tokens=max_tokens, | |
| ) | |
| raise ValueError( | |
| "ClaudeCodeVLM (claude CLI backend) does not currently support " | |
| "non-default 'temperature' or 'max_tokens' values." | |
| ) |
| class ClaudeCodeVLM(VLMProvider): | ||
| """VLM provider that shells out to the `claude` CLI. | ||
|
|
||
| Maintains a single conversation session across planner, stylist, | ||
| and critic calls so each step has full context of prior steps. | ||
| """ |
There was a problem hiding this comment.
The class/docstring (and PR description) says session continuity is maintained across planner/stylist/critic, but the pipeline shares the same VLM provider instance across all agents (optimizer/retriever/visualizer too). If you only want continuity for specific steps, add a way to reset/disable --resume for other agents or clarify the behavior in docs/description.
|
Thanks for this @biecho , a local Before merge I’d like to see:
Nit: Happy to re-review after the temp-file + prompt-order + concurrency story is addressed. |
|
Thanks for the review @dippatel1994! All points addressed in the latest push. Let me know if anything else needs attention. |
dippatel1994
left a comment
There was a problem hiding this comment.
Good structure — secure temp files, concurrency lock, clean test suite. Three must-fixes:
-
Missing
@retrydecorator — Every other VLM provider uses tenacity with exponential backoff. This one has none. A transient CLI failure will abort the pipeline immediately. -
No
cost_trackerintegration — The JSON response includestotal_cost_usdandusagefields. Every other provider records cost viaself.cost_tracker.record_vlm_call(...). This one ignores it entirely. -
Use
--system-promptCLI flag — Currently system instructions are embedded in the user prompt text. TheclaudeCLI has a--system-promptflag — use it so the model actually treats it as a system prompt.
Non-blocking: The capsys-based warning test may be fragile since structlog doesn't necessarily write to capsys-captured stdout. Consider mocking the logger instead.
- Replace insecure tempfile.mktemp() with mkstemp(); close fd immediately before writing to avoid resource leak - Build image preamble in-order via list then prepend once, fixing reversed Image 2 / Image 1 ordering - Wrap subprocess call in try/finally so temp image files are always cleaned up, even when create_subprocess_exec raises
- Add asyncio.Lock so concurrent callers don't race on _session_id - Log a warning when non-default temperature or max_tokens are passed, since the CLI has no flags for these - Move shutil import to module level
Check is_available() in ProviderRegistry.create_vlm() and raise a clear ValueError instead of letting the first subprocess call fail with a cryptic FileNotFoundError.
24 tests covering: registry creation and missing-CLI error, basic text generation, session resumption and chaining, non-JSON fallback, prompt construction (system prompt + JSON mode + images), image ordering, temp file cleanup on success / failure / OSError, model passthrough, empty images list, session preservation, missing result-key fallback, error stderr/stdout precedence, error truncation, unsupported-param warning, and concurrent-call serialisation.
- Add @Retry with exponential backoff (3 attempts, 2–30s) matching all other VLM providers so transient CLI failures don't abort the pipeline - Pass system_prompt via --system-prompt CLI flag instead of embedding it as text in the user prompt - Add usage= to debug log for consistency with other providers - Replace fragile capsys-based warning test with a logger mock
c29682a to
a09ad94
Compare
|
Thanks for the review! Addressed in a09ad94:
Non-blocking: Replaced the |
dippatel1994
left a comment
There was a problem hiding this comment.
Retry and --system-prompt are fixed, nice work. Two remaining things:
-
Lint fails - import sorting error at
tests/test_providers/test_claude_code_vlm.py:3. Runruff check --fix tests/test_providers/test_claude_code_vlm.pyand push. -
cost_tracker - still missing. If you plan to add it after PR #111 merges (like #120 is doing), that's fine, just mention it so we can track it.
Fix import sorting (ruff I001) and apply ruff format to both provider and test files so CI passes all three checks.
|
Lint — Fixed import sorting and also ran cost_tracker — Confirmed there's no |
dippatel1994
left a comment
There was a problem hiding this comment.
Lint fixed, CI fully green. Retry and --system-prompt both addressed. Cost tracker can be added in a follow-up after #111 merges. LGTM.
Summary
claude_codeVLM provider that uses the locally installedclaudeCLI as the backend for planner, stylist, and critic agents--resume, so the critic knows what the planner intendedUsage
Changes since review
Addressed all feedback from @dippatel1994:
tempfile.mktemp()withmkstemp(); temp images cleaned up intry/finally(even on subprocess or OSError)asyncio.Lockserialisesgenerate()calls to prevent_session_idracesis_available()check inProviderRegistry.create_vlm()with a clear error messageTest plan
claude -p --output-format jsonreturns structured output withsession_id--resume <session_id>maintains conversation contextclaude_codeVLM + Gemini image genpytest tests/test_providers/test_claude_code_vlm.py)