-
Notifications
You must be signed in to change notification settings - Fork 5
feat: enable function calling support for streaming responses #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: enable function calling support for streaming responses #102
Conversation
- Removed the restriction in LitellmProcessor that disabled streaming when tools are present - Implemented `_handle_streaming_tool_calls` in LLMProcessor to aggregate chunks, reconstruct tool calls, and handle recursion - Updated `_completion_with_tools` to delegate to the streaming handler when `stream=True` - Added unit tests covering streaming tool calls and recursive execution
|
Thanks for the pull request, @Pavilion4ik! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #102 +/- ##
==========================================
+ Coverage 90.74% 90.86% +0.11%
==========================================
Files 48 48
Lines 4389 4487 +98
Branches 271 284 +13
==========================================
+ Hits 3983 4077 +94
- Misses 317 319 +2
- Partials 89 91 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @Pavilion4ik I did not notice this was open. I assigned myself to take a look soon. |
|
Hi @Pavilion4ik i haven't checked static code but i tried to run it with an error:
My profile's config:
|


This PR refactors the LiteLLM-based processors to support streaming responses even when OpenAI function calling (tools) is enabled. Specifically, it includes:
chunk aggregation: Added logic to buffer streaming chunks in LLMProcessor, reconstruct fragmented tool call arguments, and execute the tools once the stream for that specific call is complete.
Recursive Streaming: Implemented yield from recursion in _handle_streaming_tool_calls to allow the LLM to call a function, receive the output, and continue streaming the final text response to the user.
Educator Processor Update: enabled streaming in EducatorAssistantProcessor for general chat, while explicitly forcing non-streaming mode for generate_quiz_questions (since it requires full response JSON validation and retry logic).
Unit Tests: Added comprehensive tests to verify that streaming works correctly with single and multiple tool calls.
Why?
Previously, LitellmProcessor explicitly disabled streaming if any tools were configured. This resulted in a poor User Experience (UX) where users had to wait for the entire generation to finish before seeing any text, simply because a tool might have been used. This change allows for a "best of both worlds" scenario: immediate feedback via streaming for text responses, and correct execution of background functions when the model decides to use them.