You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tool calling result: The weather in Dallas, Texas is 85 degrees fahrenheit. It is partly cloudly, with highs in the 90's.
505
505
```
506
506
507
-
<!-- TODO: Remove this warning when the openai api supports the max_completion_tokens instead of max_tokens -->
508
-
> [!WARNING]
509
-
> When using LangChain to call the `v1/chat/completions` endpoint, you might encounter an exception related to `max_completion_tokens` if you have specified `max_tokens` in the request.
> This issue is due to an incompatibility between Triton's OpenAI API frontend and the latest OpenAI API. We are actively working to address this gap. A workaround is adding the `max_tokens` into the `model_kwargs` of the LangChain OpenAI request.
514
-
>
515
-
> Example:
516
-
```python
517
-
from langchain.llms import OpenAI
518
-
519
-
llm = OpenAI(
520
-
model_name="llama-3.1-8b-instruct",
521
-
temperature=0.0,
522
-
model_kwargs={
523
-
"max_tokens": 4096
524
-
}
525
-
)
526
-
527
-
response = llm("Write a short poem about a sunset.")
528
-
print(response)
529
-
530
-
```
531
-
532
507
#### Named Tool Calling
533
508
534
509
The OpenAI frontend supports named function calling, utilizing guided decoding in the vLLM and TensorRT-LLM backends. Users can specify one of the tools in `tool_choice` to force the model to select a specific tool for function calling.
Copy file name to clipboardExpand all lines: python/openai/openai_frontend/main.py
+8Lines changed: 8 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -143,6 +143,13 @@ def parse_args():
143
143
help="The path to the custom Jinja chat template file. This is useful if you'd like to use a different chat template than the one provided by the model.",
144
144
)
145
145
146
+
triton_group.add_argument(
147
+
"--default-max-tokens",
148
+
type=int,
149
+
default=16,
150
+
help="The default maximum number of tokens to generate if not specified in the request. The default is 16.",
Copy file name to clipboardExpand all lines: python/openai/openai_frontend/schemas/openai.py
+9-13Lines changed: 9 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -103,7 +103,7 @@ class CreateCompletionRequest(BaseModel):
103
103
description="Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response.\n\nThe maximum value for `logprobs` is 5.\n",
104
104
)
105
105
max_tokens: Optional[conint(ge=0)] =Field(
106
-
16,
106
+
None,
107
107
description="The maximum number of [tokens](/tokenizer) that can be generated in the completion.\n\nThe token count of your prompt plus `max_tokens` cannot exceed the model's context length. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens.\n",
@@ -850,11 +842,15 @@ class CreateChatCompletionRequest(BaseModel):
850
842
None,
851
843
description="An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used.",
852
844
)
853
-
# TODO: Consider new max_completion_tokens field in the future: https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_completion_tokens
description="The maximum number of [tokens](/tokenizer) that can be generated in the chat completion.\n\nThe total length of input tokens and generated tokens is limited by the model's context length. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens.\n",
857
848
)
849
+
# TODO: Remove support for max_tokens field in the future: https://platform.openai.com/docs/api-reference/chat/create#chat-create-max_completion_tokens
850
+
max_tokens: Optional[conint(ge=0)] =Field(
851
+
None,
852
+
description="DEPRECATED: Use `max_completion_tokens` instead. The maximum number of [tokens](/tokenizer) that can be generated in the chat completion.\n\nThe total length of input tokens and generated tokens is limited by the model's context length. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens.\n",
853
+
)
858
854
# TODO: Extension, flesh out description and defaults
859
855
min_tokens: Optional[conint(ge=0)] =Field(
860
856
None,
@@ -871,7 +867,7 @@ class CreateChatCompletionRequest(BaseModel):
871
867
)
872
868
response_format: Optional[ResponseFormat] =Field(
873
869
None,
874
-
description='An object specifying the format that the model must output. Compatible with [GPT-4 Turbo](/docs/models/gpt-4-and-gpt-4-turbo) and all GPT-3.5 Turbo models newer than `gpt-3.5-turbo-1106`.\n\nSetting to `{ "type": "json_object" }` enables JSON mode, which guarantees the message the model generates is valid JSON.\n\n**Important:** when using JSON mode, you **must** also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly "stuck" request. Also note that the message content may be partially cut off if `finish_reason="length"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length.\n',
870
+
description='An object specifying the format that the model must output. Compatible with [GPT-4 Turbo](/docs/models/gpt-4-and-gpt-4-turbo) and all GPT-3.5 Turbo models newer than `gpt-3.5-turbo-1106`.\n\nSetting to `{ "type": "json_object" }` enables JSON mode, which guarantees the message the model generates is valid JSON.\n\n**Important:** when using JSON mode, you **must** also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly "stuck" request. Also note that the message content may be partially cut off if `finish_reason="length"`, which indicates the generation exceeded `max_completion_tokens` or the conversation exceeded the max context length.\n',
0 commit comments