teremterem · teremterem · Nov 8, 2025 · Nov 8, 2025 · Nov 8, 2025 · Nov 8, 2025
diff --git a/.env.template b/.env.template
@@ -37,16 +37,6 @@ OPENAI_API_KEY=
 #REMAP_CLAUDE_SONNET_TO=gpt-5-codex-reason-medium
 #REMAP_CLAUDE_OPUS_TO=gpt-5.1-reason-high
 
-# OPTIONAL: You can turn off the prompt injection that forces non-Claude models
-# to use only one tool at a time.
-#
-# ATTENTION: Turning it off is NOT recommended. GPT-5, when used with medium or
-# high reasoning effort and without such injection, attempts to make multiple
-# tool calls at once quite often, and that causes Claude Code CLI to silently
-# stop processing the request (the CLI does not support multiple tool calls in
-# a single response).
-#ENFORCE_ONE_TOOL_CALL_PER_RESPONSE=false
-
 # OPTIONAL: Whether to convert ChatCompletions API requests to Responses API
 # format for ALL non-Claude models (true), or only for the OpenAI models that
 # don't support ChatCompletions API (false or unset, RECOMMENDED).

diff --git a/ChatCompletions_API_to_GenericStreamingChunk.md b/ChatCompletions_API_to_GenericStreamingChunk.md
@@ -0,0 +1,48 @@
+## ChatCompletions API → GenericStreamingChunk Mapping Guide
+
+### How to use this guide
+- Review the field-by-field mapping rules below when converting `litellm.ModelResponseStream` payloads into `litellm.GenericStreamingChunk`.
+- Each rule cites at least one concrete example chunk from the attached traces so you can quickly reopen the original stream capture if you need to double-check the raw data.
+- Preserve nulls/omitted keys as-is unless a rule explicitly calls for a default.
+
+### Top-level chunk fields
+- `id` → copy verbatim to `GenericStreamingChunk.id`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `created` → copy to `GenericStreamingChunk.created` without transformation. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `model` → populate `GenericStreamingChunk.model` with the same string. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `object` → pass through to `GenericStreamingChunk.object`. (The examples use `"chat.completion.chunk"`; keep whatever value arrives.) Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `system_fingerprint` → copy directly to `GenericStreamingChunk.system_fingerprint`, preserving `null`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `provider_specific_fields` (top-level) → forward untouched into the corresponding `GenericStreamingChunk` field. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `citations` → expose on `GenericStreamingChunk.citations`; keep nulls if present. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `usage` → when the `ModelResponseStream` chunk includes a `usage` block, attach it to `GenericStreamingChunk.usage` without altering the numeric counters or nested detail dictionaries. Reference: `Response Chunk #106` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+
+### Choices array
+- Always emit a `GenericStreamingChunk.choices` list whose length matches the incoming `choices` array. Preserve the order so indexes remain aligned with the upstream stream. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- For each element, set `GenericStreamingChoice.index` equal to the incoming `index`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- Forward the `finish_reason` (including `null`) to `GenericStreamingChoice.finish_reason`. Reference: `Response Chunk #105` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- Accept non-`stop` finish signals (e.g., `"tool_calls"`) and propagate them unchanged so downstream logic can detect tool switchovers. Reference: `Response Chunk #63` in `ChatCompletions_API_streaming_examples/20251108_222758_22270_RESPONSE_STREAM.md`.
+- Map any `logprobs` field—currently `null` in the traces—to `GenericStreamingChoice.logprobs` verbatim. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+
+### Delta payload
+- Copy the entire `delta` object into a fresh `GenericStreamingDelta` structure, mirroring the keys present in the stream.
+- `delta.content` → assign to `GenericStreamingDelta.content`, concatenating downstream as needed. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- When a chunk only carries tool-call metadata, providers often emit `""` for `delta.content`; keep the empty string instead of normalizing it away so chunk ordering stays aligned. Reference: `Response Chunk #64` in `ChatCompletions_API_streaming_examples/20251108_222808_70283_RESPONSE_STREAM.md`.
+- `delta.role` → populate `GenericStreamingDelta.role`, noting that later chunks often send `null`. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- Subsequent deltas regularly omit the role (`null`); mirror the streamed value inside each chunk instead of injecting the previously observed role. Reference: `Response Chunk #0` vs `Response Chunk #1` in `ChatCompletions_API_streaming_examples/20251109_125816_01437_RESPONSE_STREAM.md`.
+- `delta.provider_specific_fields` → carry forward unchanged onto the delta. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `delta.function_call` → forward as-is (the current capture shows `null`, but preserve the object structure if present). Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `delta.tool_calls` → preserve the list (even when `null`) for later combination with tool streaming logic. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- `delta.audio` → forward the value (currently `null`) to the delta’s audio slot so audio-capable providers remain compatible. Reference: `Response Chunk #0` in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+
+#### Tool-call specific handling
+- When `delta.tool_calls` is a list of call deltas, map each entry to a `GenericStreamingToolCallDelta` while preserving the incoming ordering.
+- `tool_call.id` → copy the identifier (which may be `null` in a given chunk). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
+- `tool_call.type` → transfer directly (the capture shows `"function"`; preserve any other provider values). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
+- `tool_call.index` → mirror the numeric slot so downstream tooling can correlate deltas. Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
+- `tool_call.function.name` → forward the value (including `null` when the provider omits it in a fragment). Reference: `Response Chunk #8` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
+- `tool_call.function.arguments` → forward the streamed arguments substring exactly as received. Reference: `Response Chunk #9` in `ChatCompletions_API_streaming_examples/20251108_222824_10592_RESPONSE_STREAM.md`.
+
+### Usage payload details
+- `usage` only appears on the closing chunks; keep `GenericStreamingChunk.usage` unset for intermediate emissions and populate it once the payload arrives. Reference: `Response Chunk #28` in `ChatCompletions_API_streaming_examples/20251109_125816_01437_RESPONSE_STREAM.md`.
+- Copy the numeric counters (`prompt_tokens`, `completion_tokens`, `total_tokens`) directly; they already reflect request-level totals. Reference: `Response Chunk #35` in `ChatCompletions_API_streaming_examples/20251109_125816_01973_RESPONSE_STREAM.md`.
+- Preserve every nested `*_tokens_details` block and cache counter exactly as provided (including zeros and `null` values) so downstream consumers retain provider-specific accounting. Reference: the `usage` block in `ChatCompletions_API_streaming_examples/20251108_222915_51732_RESPONSE_STREAM.md`.
+- Cached-token metrics can shift between the `cache_creation_*` and `cache_read_*` counters across calls; never normalize these values. Reference: `Response Chunk #41` in `ChatCompletions_API_streaming_examples/20251109_131644_45210_RESPONSE_STREAM.md` (`cache_creation_tokens` populated) versus `Response Chunk #19` in `ChatCompletions_API_streaming_examples/20251109_131704_44443_RESPONSE_STREAM.md` (`cache_read_input_tokens` populated).
diff --git a/claude_code_proxy/claude_code_router.py b/claude_code_proxy/claude_code_router.py
@@ -16,22 +16,20 @@
     ResponsesAPIStreamingResponse,
 )
 
-from claude_code_proxy.proxy_config import ENFORCE_ONE_TOOL_CALL_PER_RESPONSE
 from claude_code_proxy.route_model import ModelRoute
 from common.config import WRITE_TRACES_TO_FILES
 from common.tracing_in_markdown import (
     write_request_trace,
     write_response_trace,
-    write_streaming_chunk_trace,
+    write_streaming_chunks_trace,
 )
 from common.utils import (
     ProxyError,
-    convert_chat_messages_to_respapi,
-    convert_chat_params_to_respapi,
-    convert_respapi_to_model_response,
+    # convert_chat_messages_to_respapi,
+    # convert_chat_params_to_respapi,
+    # convert_respapi_to_model_response,
     generate_timestamp_utc,
-    to_generic_streaming_chunk,
-    responses_eof_finalize_chunk,
+    model_response_stream_to_generic_streaming_chunks,
 )
 
 
@@ -114,45 +112,6 @@ def _adapt_complapi_for_non_anthropic_models(self) -> None:
             self.messages_complapi[0][
                 "content"
             ] = "The intention of this request is to test connectivity. Please respond with a single word: OK"
-            return
-
-        system_prompt_items = []
-
-        # Only add the instruction if at least two tools and/or functions are present in the request (in total)
-        num_tools = len(self.params_complapi.get("tools") or []) + len(self.params_complapi.get("functions") or [])
-        if ENFORCE_ONE_TOOL_CALL_PER_RESPONSE and num_tools > 1:
-            # Add the single tool call instruction as the last message
-            # TODO Get rid of this hack after the token conversion code in
-            #  `common/utils.py` is reimplemented. (Seems that it's not the
-            #  Claude Code CLI that doesn't support multiple tool calls in a
-            #  single response, it's our token conversion code that doesn't.)
-            system_prompt_items.append(
-                "* When using tools, call AT MOST one tool per response. Never attempt multiple tool calls in a "
-                "single response. The client does not support multiple tool calls in a single response. If multiple "
-                "tools are needed, choose the next best single tool, return exactly one tool call, and wait for the "
-                "next turn."
-            )
-
-        if self.model_route.use_responses_api:
-            # TODO A temporary measure until the token conversion code is
-            #  reimplemented. (Right now, whenever the model tries to
-            #  communicate that it needs to correct its course of action, it
-            #  just stops doing the task, which I suspect is a token conversion
-            #  issue.)
-            system_prompt_items.append(
-                "* Until you're COMPLETELY done with your task, DO NOT EXPLAIN TO THE USER ANYTHING AT ALL, even if "
-                "you need to correct your course of action (just use REASONING for that, which the user cannot see). "
-                "A summary of your work at the very end is enough."
-            )
-
-        if system_prompt_items:
-            # append the system prompt as the last message in the context
-            self.messages_complapi.append(
-                {
-                    "role": "system",
-                    "content": "IMPORTANT:\n" + "\n".join(system_prompt_items),
-                }
-            )
 
 
 class ClaudeCodeRouter(CustomLLM):
@@ -348,37 +307,24 @@ def streaming(
                 )
 
             for chunk_idx, chunk in enumerate[ModelResponseStream | ResponsesAPIStreamingResponse](resp_stream):
-                generic_chunk = to_generic_streaming_chunk(chunk)
+                generic_chunks = list[GenericStreamingChunk](model_response_stream_to_generic_streaming_chunks(chunk))
 
                 if WRITE_TRACES_TO_FILES:
                     if routed_request.model_route.use_responses_api:
                         respapi_chunk, complapi_chunk = chunk, None
                     else:
                         respapi_chunk, complapi_chunk = None, chunk
 
-                    write_streaming_chunk_trace(
+                    write_streaming_chunks_trace(
                         timestamp=routed_request.timestamp,
                         calling_method=routed_request.calling_method,
                         chunk_idx=chunk_idx,
                         respapi_chunk=respapi_chunk,
                         complapi_chunk=complapi_chunk,
-                        generic_chunk=generic_chunk,
+                        generic_chunks=generic_chunks,
                     )
 
-                yield generic_chunk
-
-            # EOF fallback: if provider ended stream without a terminal event and
-            # we have a pending tool with buffered args, emit once.
-            # TODO Refactor or get rid of the try/except block below after the
-            #  code in `common/utils.py` is owned (after the vibe-code there is
-            #  replaced with proper code)
-            try:
-                eof_chunk = responses_eof_finalize_chunk()
-                if eof_chunk is not None:
-                    yield eof_chunk
-            except Exception:  # pylint: disable=broad-exception-caught
-                # Ignore; best-effort fallback
-                pass
+                yield from generic_chunks
 
         except Exception as e:
             raise ProxyError(e) from e
@@ -438,39 +384,27 @@ async def astreaming(
 
             chunk_idx = 0
             async for chunk in resp_stream:
-                generic_chunk = to_generic_streaming_chunk(chunk)
+                generic_chunks = list[GenericStreamingChunk](model_response_stream_to_generic_streaming_chunks(chunk))
 
                 if WRITE_TRACES_TO_FILES:
                     if routed_request.model_route.use_responses_api:
                         respapi_chunk, complapi_chunk = chunk, None
                     else:
                         respapi_chunk, complapi_chunk = None, chunk
 
-                    write_streaming_chunk_trace(
+                    write_streaming_chunks_trace(
                         timestamp=routed_request.timestamp,
                         calling_method=routed_request.calling_method,
                         chunk_idx=chunk_idx,
                         respapi_chunk=respapi_chunk,
                         complapi_chunk=complapi_chunk,
-                        generic_chunk=generic_chunk,
+                        generic_chunks=generic_chunks,
                     )
 
-                yield generic_chunk
+                for generic_chunk in generic_chunks:
+                    yield generic_chunk
                 chunk_idx += 1
 
-            # EOF fallback: if provider ended stream without a terminal event and
-            # we have a pending tool with buffered args, emit once.
-            # TODO Refactor or get rid of the try/except block below after the
-            #  code in `common/utils.py` is owned (after the vibe-code there is
-            #  replaced with proper code)
-            try:
-                eof_chunk = responses_eof_finalize_chunk()
-                if eof_chunk is not None:
-                    yield eof_chunk
-            except Exception:  # pylint: disable=broad-exception-caught
-                # Ignore; best-effort fallback
-                pass
-
         except Exception as e:
             raise ProxyError(e) from e
 

diff --git a/claude_code_proxy/proxy_config.py b/claude_code_proxy/proxy_config.py
@@ -12,8 +12,6 @@
 REMAP_CLAUDE_SONNET_TO = os.getenv("REMAP_CLAUDE_SONNET_TO", "gpt-5-codex-reason-medium")
 REMAP_CLAUDE_OPUS_TO = os.getenv("REMAP_CLAUDE_OPUS_TO", "gpt-5.1-reason-high")
 
-ENFORCE_ONE_TOOL_CALL_PER_RESPONSE = env_var_to_bool(os.getenv("ENFORCE_ONE_TOOL_CALL_PER_RESPONSE"), "true")
-
 # TODO Move these two constants to common/config.py ?
 ALWAYS_USE_RESPONSES_API = env_var_to_bool(os.getenv("ALWAYS_USE_RESPONSES_API"), "false")
 RESPAPI_ONLY_MODELS = (

diff --git a/common/tracing_in_markdown.py b/common/tracing_in_markdown.py
@@ -1,5 +1,5 @@
 import json
-from typing import Optional
+from typing import Collection, Optional
 
 from litellm import ModelResponse, ResponsesAPIResponse
 
@@ -86,14 +86,14 @@ def write_response_trace(
             f.write(f"```json\n{response_complapi.model_dump_json(indent=2)}\n```\n")
 
 
-def write_streaming_chunk_trace(
+def write_streaming_chunks_trace(
     *,
     timestamp: str,
     calling_method: str,
     chunk_idx: int,
     respapi_chunk: Optional[ResponsesAPIResponse] = None,
     complapi_chunk: Optional[ModelResponse] = None,
-    generic_chunk: Optional[dict] = None,
+    generic_chunks: Optional[Collection[dict]] = None,
 ) -> None:
     TRACES_DIR.mkdir(parents=True, exist_ok=True)
     file = TRACES_DIR / f"{timestamp}_RESPONSE_STREAM.md"
@@ -114,11 +114,12 @@ def write_streaming_chunk_trace(
         if complapi_chunk is not None:
             f.write(f"### ChatCompletions API:\n```json\n{complapi_chunk.model_dump_json(indent=2)}\n```\n\n")
 
-        if generic_chunk is not None:
-            # TODO Do `gen_chunk.model_dump_json(indent=2)` once it's not
-            #  just a dict
-            f.write(f"### GenericStreamingChunk:\n```json\n{json.dumps(generic_chunk, indent=2)}\n```\n\n")
+        if generic_chunks:
+            f.write("### GenericStreamingChunk(s):\n")
+            for generic_chunk in generic_chunks:
+                f.write(f"```json\n{json.dumps(generic_chunk, indent=2)}\n```\n\n")
 
-            # Append text only to the text file
-            with text_file.open("a", encoding="utf-8") as text_f:
-                text_f.write(generic_chunk["text"])
+                if generic_chunk["text"]:
+                    # Append the text of the chunk to the text file
+                    with text_file.open("a", encoding="utf-8") as text_f:
+                        text_f.write(generic_chunk["text"])