diff --git a/docs/reference/inference/chat-completion-inference.asciidoc b/docs/reference/inference/chat-completion-inference.asciidoc new file mode 100644 index 0000000000000..83a8f94634f2f --- /dev/null +++ b/docs/reference/inference/chat-completion-inference.asciidoc @@ -0,0 +1,417 @@ +[role="xpack"] +[[chat-completion-inference-api]] +=== Chat completion inference API + +Streams a chat completion response. + +IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. +For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models. +However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the <>. + + +[discrete] +[[chat-completion-inference-api-request]] +==== {api-request-title} + +`POST /_inference//_unified` + +`POST /_inference/chat_completion//_unified` + + +[discrete] +[[chat-completion-inference-api-prereqs]] +==== {api-prereq-title} + +* Requires the `monitor_inference` <> +(the built-in `inference_admin` and `inference_user` roles grant this privilege) +* You must use a client that supports streaming. + + +[discrete] +[[chat-completion-inference-api-desc]] +==== {api-description-title} + +The chat completion {infer} API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation. +It only works with the `chat_completion` task type for `openai` and `elastic` {infer} services. + +[NOTE] +==== +The `chat_completion` task type is only available within the _unified API and only supports streaming. +==== + +[discrete] +[[chat-completion-inference-api-path-params]] +==== {api-path-parms-title} + +``:: +(Required, string) +The unique identifier of the {infer} endpoint. + + +``:: +(Optional, string) +The type of {infer} task that the model performs. If included, this must be set to the value `chat_completion`. + + +[discrete] +[[chat-completion-inference-api-request-body]] +==== {api-request-body-title} + +`messages`:: +(Required, array of objects) A list of objects representing the conversation. +Requests should generally only add new messages from the user (role `user`). The other message roles (`assistant`, `system`, or `tool`) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation. ++ +.Assistant message +[%collapsible%closed] +===== +`content`:: +(Required unless `tool_calls` is specified, string or array of objects) +The contents of the message. ++ +include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] ++ +`role`:: +(Required, string) +The role of the message author. This should be set to `assistant` for this type of message. ++ +`tool_calls`:: +(Optional, array of objects) +The tool calls generated by the model. ++ +.Examples +[%collapsible%closed] +====== +[source,js] +------------------------------------------------------------ +{ + "tool_calls": [ + { + "id": "call_KcAjWtAww20AihPHphUh46Gd", + "type": "function", + "function": { + "name": "get_current_weather", + "arguments": "{\"location\":\"Boston, MA\"}" + } + } + ] +} +------------------------------------------------------------ +// NOTCONSOLE +====== ++ +`id`::: +(Required, string) +The identifier of the tool call. ++ +`type`::: +(Required, string) +The type of tool call. This must be set to the value `function`. ++ +`function`::: +(Required, object) +The function that the model called. ++ +`name`:::: +(Required, string) +The name of the function to call. ++ +`arguments`:::: +(Required, string) +The arguments to call the function with in JSON format. +===== ++ +.System message +[%collapsible%closed] +===== +`content`::: +(Required, string or array of objects) +The contents of the message. ++ +include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] ++ +`role`::: +(Required, string) +The role of the message author. This should be set to `system` for this type of message. +===== ++ +.Tool message +[%collapsible%closed] +===== +`content`:: +(Required, string or array of objects) +The contents of the message. ++ +include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] ++ +`role`:: +(Required, string) +The role of the message author. This should be set to `tool` for this type of message. ++ +`tool_call_id`:: +(Required, string) +The tool call that this message is responding to. +===== ++ +.User message +[%collapsible%closed] +===== +`content`:: +(Required, string or array of objects) +The contents of the message. ++ +include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] ++ +`role`:: +(Required, string) +The role of the message author. This should be set to `user` for this type of message. +===== + +`model`:: +(Optional, string) +The ID of the model to use. By default, the model ID is set to the value included when creating the inference endpoint. + +`max_completion_tokens`:: +(Optional, integer) +The upper bound limit for the number of tokens that can be generated for a completion request. + +`stop`:: +(Optional, array of strings) +A sequence of strings to control when the model should stop generating additional tokens. + +`temperature`:: +(Optional, float) +The sampling temperature to use. + +`tools`:: +(Optional, array of objects) +A list of tools that the model can call. ++ +.Structure +[%collapsible%closed] +===== +`type`:: +(Required, string) +The type of tool, must be set to the value `function`. ++ +`function`:: +(Required, object) +The function definition. ++ +`description`::: +(Optional, string) +A description of what the function does. This is used by the model to choose when and how to call the function. ++ +`name`::: +(Required, string) +The name of the function. ++ +`parameters`::: +(Optional, object) +The parameters the functional accepts. This should be formatted as a JSON object. ++ +`strict`::: +(Optional, boolean) +Whether to enable schema adherence when generating the function call. +===== ++ +.Examples +[%collapsible%closed] +====== +[source,js] +------------------------------------------------------------ +{ + "tools": [ + { + "type": "function", + "function": { + "name": "get_price_of_item", + "description": "Get the current price of an item", + "parameters": { + "type": "object", + "properties": { + "item": { + "id": "12345" + }, + "unit": { + "type": "currency" + } + } + } + } + } + ] +} +------------------------------------------------------------ +// NOTCONSOLE +====== + +`tool_choice`:: +(Optional, string or object) +Controls which tool is called by the model. ++ +String representation::: +One of `auto`, `none`, or `requrired`. `auto` allows the model to choose between calling tools and generating a message. `none` causes the model to not call any tools. `required` forces the model to call one or more tools. ++ +Object representation::: ++ +.Structure +[%collapsible%closed] +===== +`type`:: +(Required, string) +The type of the tool. This must be set to the value `function`. ++ +`function`:: +(Required, object) ++ +`name`::: +(Required, string) +The name of the function to call. +===== ++ +.Examples +[%collapsible%closed] +===== +[source,js] +------------------------------------------------------------ +{ + "tool_choice": { + "type": "function", + "function": { + "name": "get_current_weather" + } + } +} +------------------------------------------------------------ +// NOTCONSOLE +===== + +`top_p`:: +(Optional, float) +Nucleus sampling, an alternative to sampling with temperature. + +[discrete] +[[chat-completion-inference-api-example]] +==== {api-examples-title} + +The following example performs a chat completion on the example question with streaming. + + +[source,console] +------------------------------------------------------------ +POST _inference/chat_completion/openai-completion/_stream +{ + "model": "gpt-4o", + "messages": [ + { + "role": "user", + "content": "What is Elastic?" + } + ] +} +------------------------------------------------------------ +// TEST[skip:TBD] + +The following example performs a chat completion using an Assistant message with `tool_calls`. + +[source,console] +------------------------------------------------------------ +POST _inference/chat_completion/openai-completion/_stream +{ + "messages": [ + { + "role": "assistant", + "content": "Let's find out what the weather is", + "tool_calls": [ <1> + { + "id": "call_KcAjWtAww20AihPHphUh46Gd", + "type": "function", + "function": { + "name": "get_current_weather", + "arguments": "{\"location\":\"Boston, MA\"}" + } + } + ] + }, + { <2> + "role": "tool", + "content": "The weather is cold", + "tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd" + } + ] +} +------------------------------------------------------------ +// TEST[skip:TBD] + +<1> Each tool call needs a corresponding Tool message. +<2> The corresponding Tool message. + +The following example performs a chat completion using a User message with `tools` and `tool_choice`. + +[source,console] +------------------------------------------------------------ +POST _inference/chat_completion/openai-completion/_stream +{ + "messages": [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "What's the price of a scarf?" + } + ] + } + ], + "tools": [ + { + "type": "function", + "function": { + "name": "get_current_price", + "description": "Get the current price of a item", + "parameters": { + "type": "object", + "properties": { + "item": { + "id": "123" + } + } + } + } + } + ], + "tool_choice": { + "type": "function", + "function": { + "name": "get_current_price" + } + } +} +------------------------------------------------------------ +// TEST[skip:TBD] + +The API returns the following response when a request is made to the OpenAI service: + + +[source,txt] +------------------------------------------------------------ +event: message +data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} + +event: message +data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} + +event: message +data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} + +(...) + +event: message +data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}} <1> + +event: message +data: [DONE] +------------------------------------------------------------ +// NOTCONSOLE + +<1> The last object message of the stream contains the token usage information. diff --git a/docs/reference/inference/inference-apis.asciidoc b/docs/reference/inference/inference-apis.asciidoc index ca273afc478ea..4f27409973ca2 100644 --- a/docs/reference/inference/inference-apis.asciidoc +++ b/docs/reference/inference/inference-apis.asciidoc @@ -26,6 +26,7 @@ the following APIs to manage {infer} models and perform {infer}: * <> * <> * <> +* <> * <> [[inference-landscape]] @@ -34,9 +35,9 @@ image::images/inference-landscape.jpg[A representation of the Elastic inference An {infer} endpoint enables you to use the corresponding {ml} model without manual deployment and apply it to your data at ingestion time through -<>. +<>. -Choose a model from your provider or use ELSER – a retrieval model trained by +Choose a model from your provider or use ELSER – a retrieval model trained by Elastic –, then create an {infer} endpoint by the <>. Now use <> to perform <> on your data. @@ -67,7 +68,7 @@ The following list contains the default {infer} endpoints listed by `inference_i Use the `inference_id` of the endpoint in a <> field definition or when creating an <>. The API call will automatically download and deploy the model which might take a couple of minutes. Default {infer} enpoints have {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[adaptive allocations] enabled. -For these models, the minimum number of allocations is `0`. +For these models, the minimum number of allocations is `0`. If there is no {infer} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes. @@ -84,7 +85,7 @@ Returning a long document in search results is less useful than providing the mo Each chunk will include the text subpassage and the corresponding embedding generated from it. By default, documents are split into sentences and grouped in sections up to 250 words with 1 sentence overlap so that each chunk shares a sentence with the previous chunk. -Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break. +Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break. {es} uses the https://unicode-org.github.io/icu-docs/[ICU4J] library to detect word and sentence boundaries for chunking. https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary[Word boundaries] are identified by following a series of rules, not just the presence of a whitespace character. @@ -135,6 +136,7 @@ PUT _inference/sparse_embedding/small_chunk_size include::delete-inference.asciidoc[] include::get-inference.asciidoc[] include::post-inference.asciidoc[] +include::chat-completion-inference.asciidoc[] include::put-inference.asciidoc[] include::stream-inference.asciidoc[] include::update-inference.asciidoc[] diff --git a/docs/reference/inference/inference-shared.asciidoc b/docs/reference/inference/inference-shared.asciidoc index da497c6581e5d..b133c54082810 100644 --- a/docs/reference/inference/inference-shared.asciidoc +++ b/docs/reference/inference/inference-shared.asciidoc @@ -41,7 +41,7 @@ end::chunking-settings[] tag::chunking-settings-max-chunking-size[] Specifies the maximum size of a chunk in words. Defaults to `250`. -This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy). +This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy). end::chunking-settings-max-chunking-size[] tag::chunking-settings-overlap[] @@ -63,4 +63,48 @@ Specifies the chunking strategy. It could be either `sentence` or `word`. end::chunking-settings-strategy[] +tag::chat-completion-schema-content-with-examples[] +.Examples +[%collapsible%closed] +====== +String example +[source,js] +------------------------------------------------------------ +{ + "content": "Some string" +} +------------------------------------------------------------ +// NOTCONSOLE + +Object example +[source,js] +------------------------------------------------------------ +{ + "content": [ + { + "text": "Some text", + "type": "text" + } + ] +} +------------------------------------------------------------ +// NOTCONSOLE +====== + +String representation::: +(Required, string) +The text content. ++ +Object representation::: +`text`:::: +(Required, string) +The text content. ++ +`type`:::: +(Required, string) +This must be set to the value `text`. +end::chat-completion-schema-content-with-examples[] +tag::chat-completion-docs[] +For more information on how to use the `chat_completion` task type, please refer to the <>. +end::chat-completion-docs[] diff --git a/docs/reference/inference/put-inference.asciidoc b/docs/reference/inference/put-inference.asciidoc index c203b610169e6..fb73c70a54658 100644 --- a/docs/reference/inference/put-inference.asciidoc +++ b/docs/reference/inference/put-inference.asciidoc @@ -42,7 +42,7 @@ include::inference-shared.asciidoc[tag=inference-id] include::inference-shared.asciidoc[tag=task-type] + -- -Refer to the service list in the <> for the available task types. +Refer to the service list in the <> for the available task types. -- @@ -61,7 +61,7 @@ The create {infer} API enables you to create an {infer} endpoint and configure a The following services are available through the {infer} API. -You can find the available task types next to the service name. +You can find the available task types next to the service name. Click the links to review the configuration details of the services: * <> (`completion`, `rerank`, `sparse_embedding`, `text_embedding`) @@ -73,10 +73,10 @@ Click the links to review the configuration details of the services: * <> (`rerank`, `sparse_embedding`, `text_embedding` - this service is for built-in models and models uploaded through Eland) * <> (`sparse_embedding`) * <> (`completion`, `text_embedding`) -* <> (`rerank`, `text_embedding`) +* <> (`rerank`, `text_embedding`) * <> (`text_embedding`) * <> (`text_embedding`) -* <> (`completion`, `text_embedding`) +* <> (`chat_completion`, `completion`, `text_embedding`) * <> (`text_embedding`) * <> (`text_embedding`, `rerank`) diff --git a/docs/reference/inference/service-openai.asciidoc b/docs/reference/inference/service-openai.asciidoc index e4be7f18e09dd..590f280b1c494 100644 --- a/docs/reference/inference/service-openai.asciidoc +++ b/docs/reference/inference/service-openai.asciidoc @@ -31,10 +31,18 @@ include::inference-shared.asciidoc[tag=task-type] -- Available task types: +* `chat_completion`, * `completion`, * `text_embedding`. -- +[NOTE] +==== +The `chat_completion` task type only supports streaming and only through the `_unified` API. + +include::inference-shared.asciidoc[tag=chat-completion-docs] +==== + [discrete] [[infer-service-openai-api-request-body]] ==== {api-request-body-title} @@ -61,7 +69,7 @@ include::inference-shared.asciidoc[tag=chunking-settings-strategy] `service`:: (Required, string) -The type of service supported for the specified task type. In this case, +The type of service supported for the specified task type. In this case, `openai`. `service_settings`:: @@ -176,4 +184,4 @@ PUT _inference/completion/openai-completion } } ------------------------------------------------------------ -// TEST[skip:TBD] \ No newline at end of file +// TEST[skip:TBD] diff --git a/docs/reference/inference/stream-inference.asciidoc b/docs/reference/inference/stream-inference.asciidoc index 42abb589f9afd..4a3ce31909712 100644 --- a/docs/reference/inference/stream-inference.asciidoc +++ b/docs/reference/inference/stream-inference.asciidoc @@ -38,8 +38,12 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo ==== {api-description-title} The stream {infer} API enables real-time responses for completion tasks by delivering answers incrementally, reducing response times during computation. -It only works with the `completion` task type. +It only works with the `completion` and `chat_completion` task types. +[NOTE] +==== +include::inference-shared.asciidoc[tag=chat-completion-docs] +==== [discrete] [[stream-inference-api-path-params]]