Skip to content

Commit 98842a0

Browse files
authored
docs: Cleanup KServe GRPC docs (#4081)
1 parent ed0849b commit 98842a0

File tree

4 files changed

+30
-14
lines changed

4 files changed

+30
-14
lines changed

docs/frontends/kserve.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,40 @@
11
# KServe gRPC frontend
2+
23
## Motivation
4+
35
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
46

57
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
68

7-
## Supporting endpoint
9+
## Supported Endpoints
10+
811
* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
912
* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
1013
* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
1114
* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
12-
## Starting the frontend
15+
16+
## Starting the Frontend
17+
1318
To start the KServe frontend, run the below command
1419
```
1520
python -m dynamo.frontend --kserve-grpc-server
1621
```
1722

18-
## Register a backend
23+
## Registering a Backend
24+
1925
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
20-
* `ModelType:Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
21-
* `ModelType:Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
22-
* `ModelType:TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
26+
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
27+
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
28+
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
2329

2430
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
2531

2632
### OpenAI Completions
33+
2734
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
2835

29-
#### Model metadata / config
36+
#### Model Metadata / Config
37+
3038
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
3139
```
3240
{
@@ -64,6 +72,7 @@ The metadata and config endpoint will report the registered backend to have the
6472
```
6573

6674
#### Inference
75+
6776
On receiving inference request, the following conversion will be performed:
6877
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
6978
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
@@ -72,15 +81,19 @@ On receiving model response, the following conversion will be performed:
7281
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
7382

7483
### Tensor
84+
7585
This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
7686

77-
#### Model metadata / config
87+
#### Model Metadata / Config
88+
7889
When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
7990
* [TensorModelConfig](../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
8091
* [triton_model_config](../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../tests/frontend/grpc/echo_tensor_worker.py) for example.
8192

8293
#### Inference
94+
8395
When receiving inference request, the backend will receive [NvCreateTensorRequest](../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
8496

85-
## Python binding
86-
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
97+
## Python Bindings
98+
99+
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.

docs/hidden_toctree.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,10 @@
7676

7777
benchmarks/kv-router-ab-testing.md
7878

79+
frontends/kserve.md
80+
_sections/frontends.rst
7981

8082
.. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
8183
have some outdated names/references and need a refresh.
84+
.. TODO: Add an OpenAI frontend doc and then add top-level Frontends section
85+
to index.rst pointing to both OpenAI HTTP and KServe GRPC docs.

docs/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,6 @@ Quickstart
7070
:hidden:
7171
:caption: Components
7272

73-
Frontends <_sections/frontends>
7473
Backends <_sections/backends>
7574
Router <router/README>
7675
Planner <planner/planner_intro>

docs/observability/logging.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ export DYN_LOGGING_JSONL="true"
7777

7878
Resulting Log format:
7979

80-
```json
80+
```
8181
{"time":"2025-09-02T15:53:31.943377Z","level":"INFO","target":"log","message":"VllmWorker for Qwen/Qwen3-0.6B has been initialized","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":191,"log.target":"main.init"}
8282
{"time":"2025-09-02T15:53:31.943550Z","level":"INFO","target":"log","message":"Reading Events from tcp://127.0.0.1:26771","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":212,"log.target":"main.init"}
8383
{"time":"2025-09-02T15:53:31.943636Z","level":"INFO","target":"log","message":"Getting engine runtime configuration metadata from vLLM engine...","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":220,"log.target":"main.init"}
@@ -212,7 +212,7 @@ When viewing the corresponding trace in Grafana, you should be able to see somet
212212

213213
The following shows the JSONL logs from the frontend service for the same request. Note the `trace_id` field (`b672ccf48683b392891c5cb4163d4b51`) that correlates all logs for this request, and the `span_id` field that identifies individual operations:
214214

215-
```json
215+
```
216216
{"time":"2025-10-31T20:52:07.707164Z","level":"INFO","file":"/opt/dynamo/lib/runtime/src/logging.rs","line":806,"target":"dynamo_runtime::logging","message":"OpenTelemetry OTLP export enabled","endpoint":"http://tempo.tm.svc.cluster.local:4317","service":"frontend"}
217217
...
218218
{"time":"2025-10-31T20:52:10.707164Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
@@ -249,7 +249,7 @@ All spans and logs for this request will include the `x_request_id` attribute wi
249249

250250
Notice how the `x_request_id` field appears in all log entries, alongside the `trace_id` (`80196f3e3a6fdf06d23bb9ada3788518`) and `span_id`:
251251

252-
```json
252+
```
253253
{"time":"2025-10-31T21:06:45.397194Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
254254
{"time":"2025-10-31T21:06:45.418584Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}
255255
{"time":"2025-10-31T21:06:45.418854Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"}

0 commit comments

Comments
 (0)