No usage in streaming mode for legacy (openvinotoolkit#3428) (openvinotoolkit#3449)

mzegla · web-flow · commit 02f36c58bebd · 2025-06-27T15:11:49.000+02:00
diff --git a/demos/README.md b/demos/README.md
@@ -48,7 +48,7 @@ OpenVINO Model Server demos have been created to showcase the usage of the model
 ### Check Out New Generative AI Demos
 | Demo | Description |
 |---|---|
-|[AI Agents with MCP servers and serving language models](./continuous_batching/agentic_ai/README.md)|OpenAI agents with MPC servers and serving LLM models|
+|[AI Agents with MCP servers and serving language models](./continuous_batching/agentic_ai/README.md)|OpenAI agents with MCP servers and serving LLM models|
 |[LLM Text Generation with continuous batching](continuous_batching/README.md)|Generate text with LLM models and continuous batching pipeline|
 |[VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)|Generate text with VLM models and continuous batching pipeline|
 |[OpenAI API text embeddings ](embeddings/README.md)|Get text embeddings via endpoint compatible with OpenAI API|
diff --git a/demos/continuous_batching/agentic_ai/README.md b/demos/continuous_batching/agentic_ai/README.md
@@ -120,7 +120,7 @@ docker build . -t mcp_weather_server
 docker run -d -v $(pwd)/src/mcp_weather_server:/mcp_weather_server  -p 8080:8080 mcp_weather_server bash -c ". .venv/bin/activate ; python /mcp_weather_server/server-see.py"
 ```
 
-> **Note:** On Windows the MCP server will be demonstrated as an instance with stdip interface inside the agent application
+> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application
 
 ## Start the agent
 
diff --git a/docs/llm/reference.md b/docs/llm/reference.md
@@ -271,6 +271,7 @@ Some servable types introduce additional limitations:
 - sequential request processing (only one request is handled at a time),
 - only a single response can be returned. Parameter `n` is not supported.
 - prompt lookup decoding is not supported
+- `usage` is not supported in streaming mode
 - **[NPU only]** beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
 - **[NPU only]** models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
 
diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md
@@ -158,7 +158,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
 | stop | ✅ | ✅ | ✅ | string/array of strings (optional) | Up to 4 sequences where the API will stop generating further tokens. If `stream` is set to `false` matched stop string **is not** included in the output by default. If `stream` is set to `true` matched stop string **is** included in the output by default. It can be changed with `include_stop_str_in_output` parameter, but for `stream=true` setting `include_stop_str_in_output=false` is invalid. |
 | stream | ✅ | ✅ | ✅ | bool (optional, default: `false`) | If set to true, partial message deltas will be sent to the client. The generation chunks will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format) as they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](clients_genai.md) |
 | stream_options | ✅ | ✅ | ✅ | object (optional) | Options for streaming response. Only set this when you set stream: true |
-| stream_options.include_usage | ✅ | ✅ | ✅ | bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. |
+| stream_options.include_usage | ✅ | ✅ | ✅ | bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. **Supported only in Continuous Batching servables.** |
 | messages | ✅ | ✅ | ✅ | array (required) | A list of messages comprising the conversation so far. Each object in the list should contain `role` and either `content` or `tool_call` when using tools. [Example Python code](clients_genai.md) |
 | max_tokens | ✅ | ✅ | ✅ | integer | The maximum number of tokens that can be generated. If not set, the generation will stop once `EOS` token is generated. If max_tokens_limit is set in graph.pbtxt it will be default value of max_tokens. |
 | ignore_eos | ✅ | ❌ | ✅ | bool (default: `false`) | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. |
diff --git a/docs/model_server_rest_api_completions.md b/docs/model_server_rest_api_completions.md
@@ -55,7 +55,7 @@ curl http://localhost/v3/completions \
 | stop | ✅ | ✅ | ✅ | string/array of strings (optional) | Up to 4 sequences where the API will stop generating further tokens. If `stream` is set to `false` matched stop string **is not** included in the output by default. If `stream` is set to `true` matched stop string **is** included in the output by default. It can be changed with `include_stop_str_in_output` parameter, but for `stream=true` setting `include_stop_str_in_output=false` is invalid. |
 | stream | ✅ | ✅ | ✅ | bool (optional, default: `false`) | If set to true, partial message deltas will be sent to the client. The generation chunks will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format) as they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](clients_genai.md) |
 | stream_options | ✅ | ✅ | ✅ | object (optional) | Options for streaming response. Only set this when you set stream: true |
-| stream_options.include_usage | ✅ | ✅ | ✅ | bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. |
+| stream_options.include_usage | ✅ | ✅ | ✅ | bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. **Supported only in Continuous Batching servables.** |
 | prompt | ⚠️ | ✅ | ✅ | string or array (required) | The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays. **_Limitations: only single string prompt is currently supported._** |
 | max_tokens | ✅ | ✅ | ✅ | integer | The maximum number of tokens that can be generated. If not set, the generation will stop once `EOS` token is generated. If max_tokens_limit is set in graph.pbtxt it will be default value of max_tokens. |
 | ignore_eos | ✅ | ❌ | ✅ | bool (default: `false`) | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. If set to `true`. |
diff --git a/src/llm/language_model/legacy/servable.cpp b/src/llm/language_model/legacy/servable.cpp
@@ -179,8 +179,10 @@ absl::Status LegacyServable::preparePartialResponse(std::shared_ptr<GenAiServabl
         if (!executionContext->lastStreamerCallbackOutput.empty())
             lastTextChunk = lastTextChunk + executionContext->lastStreamerCallbackOutput;
         executionContext->response = wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingChunk(lastTextChunk, ov::genai::GenerationFinishReason::STOP));
+        // Disabling usage in streaming mode in legacy servable due to the issue with token counting.
         if (executionContext->apiHandler->getStreamOptions().includeUsage)
-            executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
+            return absl::InvalidArgumentError("Usage is not supported in legacy servable in streaming mode.");
+        // executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
 
         executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
 
diff --git a/src/llm/visual_language_model/legacy/servable.cpp b/src/llm/visual_language_model/legacy/servable.cpp
@@ -182,8 +182,10 @@ absl::Status VisualLanguageModelLegacyServable::preparePartialResponse(std::shar
         if (!executionContext->lastStreamerCallbackOutput.empty())
             lastTextChunk = lastTextChunk + executionContext->lastStreamerCallbackOutput;
         executionContext->response = wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingChunk(lastTextChunk, ov::genai::GenerationFinishReason::STOP));
+        // Disabling usage in streaming mode in legacy servable due to the issue with token counting.
         if (executionContext->apiHandler->getStreamOptions().includeUsage)
-            executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
+            return absl::InvalidArgumentError("Usage is not supported in legacy servable in streaming mode.");
+        // executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
 
         executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
 
diff --git a/src/test/llm/llmnode_test.cpp b/src/test/llm/llmnode_test.cpp
@@ -1834,19 +1834,36 @@ TEST_P(LLMFlowHttpTestParameterized, streamChatCompletionsUsage) {
 
     std::vector<std::string> responses;
 
-    EXPECT_CALL(*writer, PartialReply(::testing::_))
-        .WillRepeatedly([this, &responses](std::string response) {
-            responses.push_back(response);
-        });
-    EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
-    ASSERT_EQ(
-        handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
-        ovms::StatusCode::PARTIAL_END);
-    ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
-    ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
-    ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
-    if (params.checkFinishReason) {
-        ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
+    if (params.modelName.find("cb") != std::string::npos) {
+        EXPECT_CALL(*writer, PartialReply(::testing::_))
+            .WillRepeatedly([this, &responses](std::string response) {
+                responses.push_back(response);
+            });
+        EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
+        ASSERT_EQ(
+            handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
+            ovms::StatusCode::PARTIAL_END);
+
+        ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
+        ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
+        ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
+        if (params.checkFinishReason) {
+            ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
+        }
+        // For non-continuous batching servables usage is not supported
+    } else {
+        EXPECT_CALL(*writer, PartialReplyWithStatus(::testing::_, ::testing::_))
+            .WillOnce([this](std::string response, ovms::HTTPStatusCode code) {
+                ASSERT_EQ(response, "{\"error\":\"Mediapipe execution failed. MP status - INVALID_ARGUMENT: CalculatorGraph::Run() failed: \\nCalculator::Process() for node \\\"llmNode1\\\" failed: Usage is not supported in legacy servable in streaming mode.\"}");
+                rapidjson::Document d;
+                rapidjson::ParseResult ok = d.Parse(response.c_str());
+                ASSERT_EQ(ok.Code(), 0);
+                ASSERT_EQ(code, ovms::HTTPStatusCode::BAD_REQUEST);
+            });
+        EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
+        ASSERT_EQ(
+            handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
+            ovms::StatusCode::PARTIAL_END);
     }
 }
 
@@ -1871,19 +1888,36 @@ TEST_P(LLMFlowHttpTestParameterized, streamCompletionsUsage) {
 
     std::vector<std::string> responses;
 
-    EXPECT_CALL(*writer, PartialReply(::testing::_))
-        .WillRepeatedly([this, &responses](std::string response) {
-            responses.push_back(response);
-        });
-    EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
-    ASSERT_EQ(
-        handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
-        ovms::StatusCode::PARTIAL_END);
-    ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
-    ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
-    ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
-    if (params.checkFinishReason) {
-        ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
+    if (params.modelName.find("cb") != std::string::npos) {
+        EXPECT_CALL(*writer, PartialReply(::testing::_))
+            .WillRepeatedly([this, &responses](std::string response) {
+                responses.push_back(response);
+            });
+        EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
+        ASSERT_EQ(
+            handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
+            ovms::StatusCode::PARTIAL_END);
+
+        ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
+        ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
+        ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
+        if (params.checkFinishReason) {
+            ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
+        }
+        // For non-continuous batching servables usage is not supported
+    } else {
+        EXPECT_CALL(*writer, PartialReplyWithStatus(::testing::_, ::testing::_))
+            .WillOnce([this](std::string response, ovms::HTTPStatusCode code) {
+                ASSERT_EQ(response, "{\"error\":\"Mediapipe execution failed. MP status - INVALID_ARGUMENT: CalculatorGraph::Run() failed: \\nCalculator::Process() for node \\\"llmNode1\\\" failed: Usage is not supported in legacy servable in streaming mode.\"}");
+                rapidjson::Document d;
+                rapidjson::ParseResult ok = d.Parse(response.c_str());
+                ASSERT_EQ(ok.Code(), 0);
+                ASSERT_EQ(code, ovms::HTTPStatusCode::BAD_REQUEST);
+            });
+        EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
+        ASSERT_EQ(
+            handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
+            ovms::StatusCode::PARTIAL_END);
     }
 }