Skip to content

Commit 02f36c5

Browse files
authored
No usage in streaming mode for legacy (openvinotoolkit#3428) (openvinotoolkit#3449)
1 parent e090ff1 commit 02f36c5

File tree

8 files changed

+71
-32
lines changed

8 files changed

+71
-32
lines changed

demos/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ OpenVINO Model Server demos have been created to showcase the usage of the model
4848
### Check Out New Generative AI Demos
4949
| Demo | Description |
5050
|---|---|
51-
|[AI Agents with MCP servers and serving language models](./continuous_batching/agentic_ai/README.md)|OpenAI agents with MPC servers and serving LLM models|
51+
|[AI Agents with MCP servers and serving language models](./continuous_batching/agentic_ai/README.md)|OpenAI agents with MCP servers and serving LLM models|
5252
|[LLM Text Generation with continuous batching](continuous_batching/README.md)|Generate text with LLM models and continuous batching pipeline|
5353
|[VLM Text Generation with continuous batching](continuous_batching/vlm/README.md)|Generate text with VLM models and continuous batching pipeline|
5454
|[OpenAI API text embeddings ](embeddings/README.md)|Get text embeddings via endpoint compatible with OpenAI API|

demos/continuous_batching/agentic_ai/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ docker build . -t mcp_weather_server
120120
docker run -d -v $(pwd)/src/mcp_weather_server:/mcp_weather_server -p 8080:8080 mcp_weather_server bash -c ". .venv/bin/activate ; python /mcp_weather_server/server-see.py"
121121
```
122122

123-
> **Note:** On Windows the MCP server will be demonstrated as an instance with stdip interface inside the agent application
123+
> **Note:** On Windows the MCP server will be demonstrated as an instance with stdio interface inside the agent application
124124
125125
## Start the agent
126126

docs/llm/reference.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,7 @@ Some servable types introduce additional limitations:
271271
- sequential request processing (only one request is handled at a time),
272272
- only a single response can be returned. Parameter `n` is not supported.
273273
- prompt lookup decoding is not supported
274+
- `usage` is not supported in streaming mode
274275
- **[NPU only]** beam_search algorithm is not supported with NPU. Greedy search and multinomial algorithms are supported.
275276
- **[NPU only]** models must be exported with INT4 precision and `--sym --ratio 1.0 --group-size -1` params. This is enforced in the export_model.py script when the target_device in NPU.
276277

docs/model_server_rest_api_chat.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
158158
| stop |||| string/array of strings (optional) | Up to 4 sequences where the API will stop generating further tokens. If `stream` is set to `false` matched stop string **is not** included in the output by default. If `stream` is set to `true` matched stop string **is** included in the output by default. It can be changed with `include_stop_str_in_output` parameter, but for `stream=true` setting `include_stop_str_in_output=false` is invalid. |
159159
| stream |||| bool (optional, default: `false`) | If set to true, partial message deltas will be sent to the client. The generation chunks will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format) as they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](clients_genai.md) |
160160
| stream_options |||| object (optional) | Options for streaming response. Only set this when you set stream: true |
161-
| stream_options.include_usage |||| bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. |
161+
| stream_options.include_usage |||| bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. **Supported only in Continuous Batching servables.** |
162162
| messages |||| array (required) | A list of messages comprising the conversation so far. Each object in the list should contain `role` and either `content` or `tool_call` when using tools. [Example Python code](clients_genai.md) |
163163
| max_tokens |||| integer | The maximum number of tokens that can be generated. If not set, the generation will stop once `EOS` token is generated. If max_tokens_limit is set in graph.pbtxt it will be default value of max_tokens. |
164164
| ignore_eos |||| bool (default: `false`) | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. |

docs/model_server_rest_api_completions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ curl http://localhost/v3/completions \
5555
| stop |||| string/array of strings (optional) | Up to 4 sequences where the API will stop generating further tokens. If `stream` is set to `false` matched stop string **is not** included in the output by default. If `stream` is set to `true` matched stop string **is** included in the output by default. It can be changed with `include_stop_str_in_output` parameter, but for `stream=true` setting `include_stop_str_in_output=false` is invalid. |
5656
| stream |||| bool (optional, default: `false`) | If set to true, partial message deltas will be sent to the client. The generation chunks will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#event_stream_format) as they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](clients_genai.md) |
5757
| stream_options |||| object (optional) | Options for streaming response. Only set this when you set stream: true |
58-
| stream_options.include_usage |||| bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. |
58+
| stream_options.include_usage |||| bool (optional) | Streaming option. If set, an additional chunk will be streamed before the data: [DONE] message. The usage field in this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value. **Supported only in Continuous Batching servables.** |
5959
| prompt | ⚠️ ||| string or array (required) | The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays. **_Limitations: only single string prompt is currently supported._** |
6060
| max_tokens |||| integer | The maximum number of tokens that can be generated. If not set, the generation will stop once `EOS` token is generated. If max_tokens_limit is set in graph.pbtxt it will be default value of max_tokens. |
6161
| ignore_eos |||| bool (default: `false`) | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. If set to `true`. |

src/llm/language_model/legacy/servable.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,8 +179,10 @@ absl::Status LegacyServable::preparePartialResponse(std::shared_ptr<GenAiServabl
179179
if (!executionContext->lastStreamerCallbackOutput.empty())
180180
lastTextChunk = lastTextChunk + executionContext->lastStreamerCallbackOutput;
181181
executionContext->response = wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingChunk(lastTextChunk, ov::genai::GenerationFinishReason::STOP));
182+
// Disabling usage in streaming mode in legacy servable due to the issue with token counting.
182183
if (executionContext->apiHandler->getStreamOptions().includeUsage)
183-
executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
184+
return absl::InvalidArgumentError("Usage is not supported in legacy servable in streaming mode.");
185+
// executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
184186

185187
executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
186188

src/llm/visual_language_model/legacy/servable.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,8 +182,10 @@ absl::Status VisualLanguageModelLegacyServable::preparePartialResponse(std::shar
182182
if (!executionContext->lastStreamerCallbackOutput.empty())
183183
lastTextChunk = lastTextChunk + executionContext->lastStreamerCallbackOutput;
184184
executionContext->response = wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingChunk(lastTextChunk, ov::genai::GenerationFinishReason::STOP));
185+
// Disabling usage in streaming mode in legacy servable due to the issue with token counting.
185186
if (executionContext->apiHandler->getStreamOptions().includeUsage)
186-
executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
187+
return absl::InvalidArgumentError("Usage is not supported in legacy servable in streaming mode.");
188+
// executionContext->response += wrapTextInServerSideEventMessage(executionContext->apiHandler->serializeStreamingUsageChunk());
187189

188190
executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
189191

src/test/llm/llmnode_test.cpp

Lines changed: 60 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1834,19 +1834,36 @@ TEST_P(LLMFlowHttpTestParameterized, streamChatCompletionsUsage) {
18341834

18351835
std::vector<std::string> responses;
18361836

1837-
EXPECT_CALL(*writer, PartialReply(::testing::_))
1838-
.WillRepeatedly([this, &responses](std::string response) {
1839-
responses.push_back(response);
1840-
});
1841-
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1842-
ASSERT_EQ(
1843-
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1844-
ovms::StatusCode::PARTIAL_END);
1845-
ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
1846-
ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
1847-
ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
1848-
if (params.checkFinishReason) {
1849-
ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
1837+
if (params.modelName.find("cb") != std::string::npos) {
1838+
EXPECT_CALL(*writer, PartialReply(::testing::_))
1839+
.WillRepeatedly([this, &responses](std::string response) {
1840+
responses.push_back(response);
1841+
});
1842+
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1843+
ASSERT_EQ(
1844+
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1845+
ovms::StatusCode::PARTIAL_END);
1846+
1847+
ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
1848+
ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
1849+
ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
1850+
if (params.checkFinishReason) {
1851+
ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
1852+
}
1853+
// For non-continuous batching servables usage is not supported
1854+
} else {
1855+
EXPECT_CALL(*writer, PartialReplyWithStatus(::testing::_, ::testing::_))
1856+
.WillOnce([this](std::string response, ovms::HTTPStatusCode code) {
1857+
ASSERT_EQ(response, "{\"error\":\"Mediapipe execution failed. MP status - INVALID_ARGUMENT: CalculatorGraph::Run() failed: \\nCalculator::Process() for node \\\"llmNode1\\\" failed: Usage is not supported in legacy servable in streaming mode.\"}");
1858+
rapidjson::Document d;
1859+
rapidjson::ParseResult ok = d.Parse(response.c_str());
1860+
ASSERT_EQ(ok.Code(), 0);
1861+
ASSERT_EQ(code, ovms::HTTPStatusCode::BAD_REQUEST);
1862+
});
1863+
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1864+
ASSERT_EQ(
1865+
handler->dispatchToProcessor(endpointChatCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1866+
ovms::StatusCode::PARTIAL_END);
18501867
}
18511868
}
18521869

@@ -1871,19 +1888,36 @@ TEST_P(LLMFlowHttpTestParameterized, streamCompletionsUsage) {
18711888

18721889
std::vector<std::string> responses;
18731890

1874-
EXPECT_CALL(*writer, PartialReply(::testing::_))
1875-
.WillRepeatedly([this, &responses](std::string response) {
1876-
responses.push_back(response);
1877-
});
1878-
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1879-
ASSERT_EQ(
1880-
handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1881-
ovms::StatusCode::PARTIAL_END);
1882-
ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
1883-
ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
1884-
ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
1885-
if (params.checkFinishReason) {
1886-
ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
1891+
if (params.modelName.find("cb") != std::string::npos) {
1892+
EXPECT_CALL(*writer, PartialReply(::testing::_))
1893+
.WillRepeatedly([this, &responses](std::string response) {
1894+
responses.push_back(response);
1895+
});
1896+
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1897+
ASSERT_EQ(
1898+
handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1899+
ovms::StatusCode::PARTIAL_END);
1900+
1901+
ASSERT_TRUE(responses.back().find("\"completion_tokens\":5") != std::string::npos);
1902+
ASSERT_TRUE(responses.back().find("\"prompt_tokens\"") != std::string::npos);
1903+
ASSERT_TRUE(responses.back().find("\"total_tokens\"") != std::string::npos);
1904+
if (params.checkFinishReason) {
1905+
ASSERT_TRUE(responses.back().find("\"finish_reason\":\"length\"") != std::string::npos);
1906+
}
1907+
// For non-continuous batching servables usage is not supported
1908+
} else {
1909+
EXPECT_CALL(*writer, PartialReplyWithStatus(::testing::_, ::testing::_))
1910+
.WillOnce([this](std::string response, ovms::HTTPStatusCode code) {
1911+
ASSERT_EQ(response, "{\"error\":\"Mediapipe execution failed. MP status - INVALID_ARGUMENT: CalculatorGraph::Run() failed: \\nCalculator::Process() for node \\\"llmNode1\\\" failed: Usage is not supported in legacy servable in streaming mode.\"}");
1912+
rapidjson::Document d;
1913+
rapidjson::ParseResult ok = d.Parse(response.c_str());
1914+
ASSERT_EQ(ok.Code(), 0);
1915+
ASSERT_EQ(code, ovms::HTTPStatusCode::BAD_REQUEST);
1916+
});
1917+
EXPECT_CALL(*writer, PartialReplyEnd()).Times(1);
1918+
ASSERT_EQ(
1919+
handler->dispatchToProcessor(endpointCompletions, requestBody, &response, comp, responseComponents, writer, multiPartParser),
1920+
ovms::StatusCode::PARTIAL_END);
18871921
}
18881922
}
18891923

0 commit comments

Comments
 (0)