You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking at implementing conversational memory in proxy hooks. My persistent questions were how to detect what thread I am in and how to keep the actual conversation history for the model (given that memory injections modify it compared to what the client sends).
If I use the Responses API, both questions vanish. Thread identification is trivial and the model (or for non-OpenAI models, LiteLLM) keeps the correct history as the model saw it.
However: I also would like to let the model call a tool for additional memory retrieval. This would be quite doable in a ChatCompletions async_post_call_hook implementation by intercepting the tool call and, after providing the result, calling the model yet again. But with Responses, how do I do the "calling the model yet again" part, so that the response chain is maintained correctly?
The only idea I get is to send a response to the very same endpoint, but insert data ("metadata" probably) so that new calls to async_pre_call_hook and async_post_call_hook detect it and are no-ops in this case. But this would mean that the hook now needs to know the exact endpoint and full model name with prefixes?
EDIT: Or can I just call litellm.responses() from the hook and it will use the same router as the proxy itself?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am looking at implementing conversational memory in proxy hooks. My persistent questions were how to detect what thread I am in and how to keep the actual conversation history for the model (given that memory injections modify it compared to what the client sends).
If I use the Responses API, both questions vanish. Thread identification is trivial and the model (or for non-OpenAI models, LiteLLM) keeps the correct history as the model saw it.
However: I also would like to let the model call a tool for additional memory retrieval. This would be quite doable in a ChatCompletions async_post_call_hook implementation by intercepting the tool call and, after providing the result, calling the model yet again. But with Responses, how do I do the "calling the model yet again" part, so that the response chain is maintained correctly?
The only idea I get is to send a response to the very same endpoint, but insert data ("metadata" probably) so that new calls to async_pre_call_hook and async_post_call_hook detect it and are no-ops in this case. But this would mean that the hook now needs to know the exact endpoint and full model name with prefixes?
EDIT: Or can I just call litellm.responses() from the hook and it will use the same router as the proxy itself?
Beta Was this translation helpful? Give feedback.
All reactions