You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add the tool calling to the openai frontend (#8134)
Add the tool calling parsers implementation to openai frontend, the available parsers are llama3 and mistral. Most of the implementation is from the vllm. A user could use the --tool-call-parser arguments to specify the tool parser.
Add the --chat-template {chat template file path} argument to allow the user use the customized template to better tune the prompt for tool calling.
Add the guided decoding backend integration with the tool calling to enable the named tool calling and required tool calling functionalities.
Please check more detail in the change of README.md
All changes in python/openai/openai_frontend/engine/utils/tool_call_parsers are from the vLLM with some minor compatibility changes.
The OpenAI frontend supports `tools` and `tool_choice` in the `v1/chat/completions` API. Please refer to the OpenAI API reference for more details about these parameters:
To enable the tool-calling feature, add the `--tool-call-parser {parser_name}` flag when starting the server. The two available parsers are `llama3` and `mistral`.
416
+
The `llama3` parser supports tool-calling features for LLaMA 3.1, 3.2, and 3.3 models, while the `mistral` parser supports tool-calling features for the Mistral Instruct model.
417
+
418
+
Example for launching the OpenAI frontend with a tool call parser:
function arguments: {"city": "Dallas", "state": "TX", "unit": "fahrenheit"}
504
+
tool calling result: The weather in Dallas, Texas is 85 degrees fahrenheit. It is partly cloudly, with highs in the 90's.
505
+
```
506
+
507
+
<!-- TODO: Remove this warning when the openai api supports the max_completion_tokens instead of max_tokens -->
508
+
> [!WARNING]
509
+
> When using LangChain to call the `v1/chat/completions` endpoint, you might encounter an exception related to `max_completion_tokens` if you have specified `max_tokens` in the request.
> This issue is due to an incompatibility between Triton's OpenAI API frontend and the latest OpenAI API. We are actively working to address this gap. A workaround is adding the `max_tokens` into the `model_kwargs` of the LangChain OpenAI request.
514
+
>
515
+
> Example:
516
+
```python
517
+
from langchain.llms import OpenAI
518
+
519
+
llm = OpenAI(
520
+
model_name="llama-3.1-8b-instruct",
521
+
temperature=0.0,
522
+
model_kwargs={
523
+
"max_tokens": 4096
524
+
}
525
+
)
526
+
527
+
response = llm("Write a short poem about a sunset.")
528
+
print(response)
529
+
530
+
```
531
+
532
+
#### Named Tool Calling
533
+
534
+
The OpenAI frontend supports named function calling, utilizing guided decoding in the vLLM and TensorRT-LLM backends. Users can specify one of the tools in `tool_choice` to force the model to select a specific tool for function calling.
535
+
536
+
> [!NOTE]
537
+
> The latest release of TensorRT-LLM (v0.18.0) does not yet support guided decoding. To enable this feature, use a build from the main branch of TensorRT-LLM.
538
+
> For instructions on enabling guided decoding in the TensorRT-LLM backend, please refer to [this guide](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/guided_decoding.md)
0 commit comments