Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
519b8a0
CLAUDE.md - generated with Sonnet and Haiku
teremterem Nov 8, 2025
fbe5e16
prepare to try generating CLAUDE.md again
teremterem Nov 8, 2025
04820b7
use more digits in timestamps to avoid collisions
teremterem Nov 8, 2025
1154f51
CLAUDE.md - generated with Sonnet and Haiku
teremterem Nov 8, 2025
b56c882
prepare to regenerate to_generic_streaming_chunk
teremterem Nov 9, 2025
47fe3d2
ChatCompletions_API_to_GenericStreamingChunk.md
teremterem Nov 9, 2025
372b722
ChatCompletions_API_to_GenericStreamingChunk.md
teremterem Nov 9, 2025
8dd716f
ChatCompletions_API_to_GenericStreamingChunk.md
teremterem Nov 9, 2025
9ded9cc
ChatCompletions_API_to_GenericStreamingChunk.md
teremterem Nov 9, 2025
93aa92c
remove file coverage log
teremterem Nov 9, 2025
894daaa
changes to ChatCompletions_API_to_GenericStreamingChunk.md I am not s…
teremterem Nov 9, 2025
f4b74cf
fix suspicious guide entries
teremterem Nov 9, 2025
49086e5
in progress: reimplement to_generic_streaming_chunk - start with conv…
teremterem Nov 9, 2025
d2ab0d5
in progress: reimplement to_generic_streaming_chunk - start with conv…
teremterem Nov 9, 2025
aa6dcc5
separate the milliseconds from the microseconds with a hyphen in trac…
teremterem Nov 10, 2025
fffe454
separate the milliseconds from the microseconds with an underscore in…
teremterem Nov 10, 2025
891cca6
common/utils_code_smell_but_stable.py
teremterem Nov 11, 2025
393d550
to_generic_streaming_chunk fixed?
teremterem Nov 11, 2025
498782c
temporarily disable responses api implementation
teremterem Nov 11, 2025
c7f3da7
get rid of unnecessary dict conversion logic
teremterem Nov 13, 2025
a31a1df
common/utils_better_code_but_broken.py
teremterem Nov 13, 2025
dec7b6b
model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
35658c2
restore tracing of generic chunks in markdown
teremterem Nov 13, 2025
49e12e6
in progress: model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
c522e79
in progress: model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
f6b18c4
in progress: model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
b70619a
in progress: model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
e123f90
in progress: model_response_stream_to_generic_streaming_chunk
teremterem Nov 13, 2025
d63e7a3
Convert ModelResponseStream to potentially multiple GenericStreamChun…
teremterem Nov 13, 2025
9d9a095
Merge remote-tracking branch 'origin/release-1.0.0' into gpt-5-codex
teremterem Nov 30, 2025
0f76a68
drop vibe-coded Responses API converstion completely
teremterem Nov 30, 2025
9c2b6ec
Merge branch 'release-1.0.0' into gpt-5-codex
teremterem Dec 3, 2025
2aa05a1
get rid of all the prompt engineering hacks that are meant to be fixe…
teremterem Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 0 additions & 10 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,6 @@ OPENAI_API_KEY=
#REMAP_CLAUDE_SONNET_TO=gpt-5-codex-reason-medium
#REMAP_CLAUDE_OPUS_TO=gpt-5.1-reason-high

# OPTIONAL: You can turn off the prompt injection that forces non-Claude models
# to use only one tool at a time.
#
# ATTENTION: Turning it off is NOT recommended. GPT-5, when used with medium or
# high reasoning effort and without such injection, attempts to make multiple
# tool calls at once quite often, and that causes Claude Code CLI to silently
# stop processing the request (the CLI does not support multiple tool calls in
# a single response).
#ENFORCE_ONE_TOOL_CALL_PER_RESPONSE=false

# OPTIONAL: Whether to convert ChatCompletions API requests to Responses API
# format for ALL non-Claude models (true), or only for the OpenAI models that
# don't support ChatCompletions API (false or unset, RECOMMENDED).
Expand Down
48 changes: 48 additions & 0 deletions ChatCompletions_API_to_GenericStreamingChunk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
## ChatCompletions API → GenericStreamingChunk Mapping Guide

### How to use this guide
- Review the field-by-field mapping rules below when converting `litellm.ModelResponseStream` payloads into `litellm.GenericStreamingChunk`.
- Each rule cites at least one concrete example chunk from the attached traces so you can quickly reopen the original stream capture if you need to double-check the raw data.
- Preserve nulls/omitted keys as-is unless a rule explicitly calls for a default.

### Top-level chunk fields
- `id` → copy verbatim to `GenericStreamingChunk.id`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `created` → copy to `GenericStreamingChunk.created` without transformation. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `model` → populate `GenericStreamingChunk.model` with the same string. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `object` → pass through to `GenericStreamingChunk.object`. (The examples use `"chat.completion.chunk"`; keep whatever value arrives.) Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `system_fingerprint` → copy directly to `GenericStreamingChunk.system_fingerprint`, preserving `null`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `provider_specific_fields` (top-level) → forward untouched into the corresponding `GenericStreamingChunk` field. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `citations` → expose on `GenericStreamingChunk.citations`; keep nulls if present. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `usage` → when the `ModelResponseStream` chunk includes a `usage` block, attach it to `GenericStreamingChunk.usage` without altering the numeric counters or nested detail dictionaries. Reference: `Response Chunk #106` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.

### Choices array
- Always emit a `GenericStreamingChunk.choices` list whose length matches the incoming `choices` array. Preserve the order so indexes remain aligned with the upstream stream. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- For each element, set `GenericStreamingChoice.index` equal to the incoming `index`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- Forward the `finish_reason` (including `null`) to `GenericStreamingChoice.finish_reason`. Reference: `Response Chunk #105` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- Accept non-`stop` finish signals (e.g., `"tool_calls"`) and propagate them unchanged so downstream logic can detect tool switchovers. Reference: `Response Chunk #63` in `ChatCompletions_API_streaming_examples/20251108_222758_22270_RESPONSE_STREAM.md`.
- Map any `logprobs` field—currently `null` in the traces—to `GenericStreamingChoice.logprobs` verbatim. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.

### Delta payload
- Copy the entire `delta` object into a fresh `GenericStreamingDelta` structure, mirroring the keys present in the stream.
- `delta.content` → assign to `GenericStreamingDelta.content`, concatenating downstream as needed. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- When a chunk only carries tool-call metadata, providers often emit `""` for `delta.content`; keep the empty string instead of normalizing it away so chunk ordering stays aligned. Reference: `Response Chunk #64` in `ChatCompletions_API_streaming_examples/20251108_222808_70283_RESPONSE_STREAM.md`.
- `delta.role` → populate `GenericStreamingDelta.role`, noting that later chunks often send `null`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- Subsequent deltas regularly omit the role (`null`); mirror the streamed value inside each chunk instead of injecting the previously observed role. Reference: `Response Chunk #0` vs `Response Chunk #1` in `ChatCompletions_API_streaming_examples/20251109_125816_01437_RESPONSE_STREAM.md`.
- `delta.provider_specific_fields` → carry forward unchanged onto the delta. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `delta.function_call` → forward as-is (the current capture shows `null`, but preserve the object structure if present). Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `delta.tool_calls` → preserve the list (even when `null`) for later combination with tool streaming logic. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- `delta.audio` → forward the value (currently `null`) to the delta’s audio slot so audio-capable providers remain compatible. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.

#### Tool-call specific handling
- When `delta.tool_calls` is a list of call deltas, map each entry to a `GenericStreamingToolCallDelta` while preserving the incoming ordering.
- `tool_call.id` → copy the identifier (which may be `null` in a given chunk). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
- `tool_call.type` → transfer directly (the capture shows `"function"`; preserve any other provider values). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
- `tool_call.index` → mirror the numeric slot so downstream tooling can correlate deltas. Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
- `tool_call.function.name` → forward the value (including `null` when the provider omits it in a fragment). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
- `tool_call.function.arguments` → forward the streamed arguments substring exactly as received. Reference: `Response Chunk #9` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.

### Usage payload details
- `usage` only appears on the closing chunks; keep `GenericStreamingChunk.usage` unset for intermediate emissions and populate it once the payload arrives. Reference: `Response Chunk #28` in `ChatCompletions_API_streaming_examples/20251109_125816_01437_RESPONSE_STREAM.md`.
- Copy the numeric counters (`prompt_tokens`, `completion_tokens`, `total_tokens`) directly; they already reflect request-level totals. Reference: `Response Chunk #35` in `ChatCompletions_API_streaming_examples/20251109_125816_01973_RESPONSE_STREAM.md`.
- Preserve every nested `*_tokens_details` block and cache counter exactly as provided (including zeros and `null` values) so downstream consumers retain provider-specific accounting. Reference: the `usage` block in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
- Cached-token metrics can shift between the `cache_creation_*` and `cache_read_*` counters across calls; never normalize these values. Reference: `Response Chunk #41` in `ChatCompletions_API_streaming_examples/20251109_131644_45210_RESPONSE_STREAM.md` (`cache_creation_tokens` populated) versus `Response Chunk #19` in `ChatCompletions_API_streaming_examples/20251109_131704_44443_RESPONSE_STREAM.md` (`cache_read_input_tokens` populated).
94 changes: 14 additions & 80 deletions claude_code_proxy/claude_code_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,20 @@
ResponsesAPIStreamingResponse,
)

from claude_code_proxy.proxy_config import ENFORCE_ONE_TOOL_CALL_PER_RESPONSE
from claude_code_proxy.route_model import ModelRoute
from common.config import WRITE_TRACES_TO_FILES
from common.tracing_in_markdown import (
write_request_trace,
write_response_trace,
write_streaming_chunk_trace,
write_streaming_chunks_trace,
)
from common.utils import (
ProxyError,
convert_chat_messages_to_respapi,
convert_chat_params_to_respapi,
convert_respapi_to_model_response,
# convert_chat_messages_to_respapi,
# convert_chat_params_to_respapi,
# convert_respapi_to_model_response,
generate_timestamp_utc,
to_generic_streaming_chunk,
responses_eof_finalize_chunk,
model_response_stream_to_generic_streaming_chunks,
)


Expand Down Expand Up @@ -114,45 +112,6 @@ def _adapt_complapi_for_non_anthropic_models(self) -> None:
self.messages_complapi[0][
"content"
] = "The intention of this request is to test connectivity. Please respond with a single word: OK"
return

system_prompt_items = []

# Only add the instruction if at least two tools and/or functions are present in the request (in total)
num_tools = len(self.params_complapi.get("tools") or []) + len(self.params_complapi.get("functions") or [])
if ENFORCE_ONE_TOOL_CALL_PER_RESPONSE and num_tools > 1:
# Add the single tool call instruction as the last message
# TODO Get rid of this hack after the token conversion code in
# `common/utils.py` is reimplemented. (Seems that it's not the
# Claude Code CLI that doesn't support multiple tool calls in a
# single response, it's our token conversion code that doesn't.)
system_prompt_items.append(
"* When using tools, call AT MOST one tool per response. Never attempt multiple tool calls in a "
"single response. The client does not support multiple tool calls in a single response. If multiple "
"tools are needed, choose the next best single tool, return exactly one tool call, and wait for the "
"next turn."
)

if self.model_route.use_responses_api:
# TODO A temporary measure until the token conversion code is
# reimplemented. (Right now, whenever the model tries to
# communicate that it needs to correct its course of action, it
# just stops doing the task, which I suspect is a token conversion
# issue.)
system_prompt_items.append(
"* Until you're COMPLETELY done with your task, DO NOT EXPLAIN TO THE USER ANYTHING AT ALL, even if "
"you need to correct your course of action (just use REASONING for that, which the user cannot see). "
"A summary of your work at the very end is enough."
)

if system_prompt_items:
# append the system prompt as the last message in the context
self.messages_complapi.append(
{
"role": "system",
"content": "IMPORTANT:\n" + "\n".join(system_prompt_items),
}
)


class ClaudeCodeRouter(CustomLLM):
Expand Down Expand Up @@ -348,37 +307,24 @@ def streaming(
)

for chunk_idx, chunk in enumerate[ModelResponseStream | ResponsesAPIStreamingResponse](resp_stream):
generic_chunk = to_generic_streaming_chunk(chunk)
generic_chunks = list[GenericStreamingChunk](model_response_stream_to_generic_streaming_chunks(chunk))

if WRITE_TRACES_TO_FILES:
if routed_request.model_route.use_responses_api:
respapi_chunk, complapi_chunk = chunk, None
else:
respapi_chunk, complapi_chunk = None, chunk

write_streaming_chunk_trace(
write_streaming_chunks_trace(
timestamp=routed_request.timestamp,
calling_method=routed_request.calling_method,
chunk_idx=chunk_idx,
respapi_chunk=respapi_chunk,
complapi_chunk=complapi_chunk,
generic_chunk=generic_chunk,
generic_chunks=generic_chunks,
)

yield generic_chunk

# EOF fallback: if provider ended stream without a terminal event and
# we have a pending tool with buffered args, emit once.
# TODO Refactor or get rid of the try/except block below after the
# code in `common/utils.py` is owned (after the vibe-code there is
# replaced with proper code)
try:
eof_chunk = responses_eof_finalize_chunk()
if eof_chunk is not None:
yield eof_chunk
except Exception: # pylint: disable=broad-exception-caught
# Ignore; best-effort fallback
pass
yield from generic_chunks

except Exception as e:
raise ProxyError(e) from e
Expand Down Expand Up @@ -438,39 +384,27 @@ async def astreaming(

chunk_idx = 0
async for chunk in resp_stream:
generic_chunk = to_generic_streaming_chunk(chunk)
generic_chunks = list[GenericStreamingChunk](model_response_stream_to_generic_streaming_chunks(chunk))

if WRITE_TRACES_TO_FILES:
if routed_request.model_route.use_responses_api:
respapi_chunk, complapi_chunk = chunk, None
else:
respapi_chunk, complapi_chunk = None, chunk

write_streaming_chunk_trace(
write_streaming_chunks_trace(
timestamp=routed_request.timestamp,
calling_method=routed_request.calling_method,
chunk_idx=chunk_idx,
respapi_chunk=respapi_chunk,
complapi_chunk=complapi_chunk,
generic_chunk=generic_chunk,
generic_chunks=generic_chunks,
)

yield generic_chunk
for generic_chunk in generic_chunks:
yield generic_chunk
chunk_idx += 1

# EOF fallback: if provider ended stream without a terminal event and
# we have a pending tool with buffered args, emit once.
# TODO Refactor or get rid of the try/except block below after the
# code in `common/utils.py` is owned (after the vibe-code there is
# replaced with proper code)
try:
eof_chunk = responses_eof_finalize_chunk()
if eof_chunk is not None:
yield eof_chunk
except Exception: # pylint: disable=broad-exception-caught
# Ignore; best-effort fallback
pass

except Exception as e:
raise ProxyError(e) from e

Expand Down
2 changes: 0 additions & 2 deletions claude_code_proxy/proxy_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@
REMAP_CLAUDE_SONNET_TO = os.getenv("REMAP_CLAUDE_SONNET_TO", "gpt-5-codex-reason-medium")
REMAP_CLAUDE_OPUS_TO = os.getenv("REMAP_CLAUDE_OPUS_TO", "gpt-5.1-reason-high")

ENFORCE_ONE_TOOL_CALL_PER_RESPONSE = env_var_to_bool(os.getenv("ENFORCE_ONE_TOOL_CALL_PER_RESPONSE"), "true")

# TODO Move these two constants to common/config.py ?
ALWAYS_USE_RESPONSES_API = env_var_to_bool(os.getenv("ALWAYS_USE_RESPONSES_API"), "false")
RESPAPI_ONLY_MODELS = (
Expand Down
21 changes: 11 additions & 10 deletions common/tracing_in_markdown.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import json
from typing import Optional
from typing import Collection, Optional

from litellm import ModelResponse, ResponsesAPIResponse

Expand Down Expand Up @@ -86,14 +86,14 @@ def write_response_trace(
f.write(f"```json\n{response_complapi.model_dump_json(indent=2)}\n```\n")


def write_streaming_chunk_trace(
def write_streaming_chunks_trace(
*,
timestamp: str,
calling_method: str,
chunk_idx: int,
respapi_chunk: Optional[ResponsesAPIResponse] = None,
complapi_chunk: Optional[ModelResponse] = None,
generic_chunk: Optional[dict] = None,
generic_chunks: Optional[Collection[dict]] = None,
) -> None:
TRACES_DIR.mkdir(parents=True, exist_ok=True)
file = TRACES_DIR / f"{timestamp}_RESPONSE_STREAM.md"
Expand All @@ -114,11 +114,12 @@ def write_streaming_chunk_trace(
if complapi_chunk is not None:
f.write(f"### ChatCompletions API:\n```json\n{complapi_chunk.model_dump_json(indent=2)}\n```\n\n")

if generic_chunk is not None:
# TODO Do `gen_chunk.model_dump_json(indent=2)` once it's not
# just a dict
f.write(f"### GenericStreamingChunk:\n```json\n{json.dumps(generic_chunk, indent=2)}\n```\n\n")
if generic_chunks:
f.write("### GenericStreamingChunk(s):\n")
for generic_chunk in generic_chunks:
f.write(f"```json\n{json.dumps(generic_chunk, indent=2)}\n```\n\n")

# Append text only to the text file
with text_file.open("a", encoding="utf-8") as text_f:
text_f.write(generic_chunk["text"])
if generic_chunk["text"]:
# Append the text of the chunk to the text file
with text_file.open("a", encoding="utf-8") as text_f:
text_f.write(generic_chunk["text"])
Loading