In-proxy tool calling and Responses API #14484

mramendi · 2025-09-12T05:09:07Z

mramendi
Sep 12, 2025

Hello,

I am looking at implementing conversational memory in proxy hooks. My persistent questions were how to detect what thread I am in and how to keep the actual conversation history for the model (given that memory injections modify it compared to what the client sends).

If I use the Responses API, both questions vanish. Thread identification is trivial and the model (or for non-OpenAI models, LiteLLM) keeps the correct history as the model saw it.

However: I also would like to let the model call a tool for additional memory retrieval. This would be quite doable in a ChatCompletions async_post_call_hook implementation by intercepting the tool call and, after providing the result, calling the model yet again. But with Responses, how do I do the "calling the model yet again" part, so that the response chain is maintained correctly?

The only idea I get is to send a response to the very same endpoint, but insert data ("metadata" probably) so that new calls to async_pre_call_hook and async_post_call_hook detect it and are no-ops in this case. But this would mean that the hook now needs to know the exact endpoint and full model name with prefixes?

EDIT: Or can I just call litellm.responses() from the hook and it will use the same router as the proxy itself?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

In-proxy tool calling and Responses API #14484

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

In-proxy tool calling and Responses API #14484

Uh oh!

Uh oh!

mramendi Sep 12, 2025

Replies: 0 comments

mramendi
Sep 12, 2025