Skip to content

Commit 8d70753

Browse files
committed
feat: add VLM/image endpoints, harden KServe V2 transport, and expand test coverage
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
1 parent 9826ff5 commit 8d70753

31 files changed

+748
-133
lines changed

docs/dev/adding-grpc-endpoints.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ This guide explains how to add support for new gRPC-based inference protocols in
1111

1212
AIPerf separates **endpoints** (payload formatting and response parsing), **transports** (wire protocol and connection management), and **serializers** (proto-specific byte conversion):
1313

14-
```
14+
```text
1515
InferenceClient
1616
|
1717
|-- Endpoint (format_payload / parse_response)
@@ -38,7 +38,7 @@ InferenceClient
3838
|---|---|---|
3939
| Proto definitions (`.proto`) | Yes | - |
4040
| Serializer class (dict <-> protobuf bytes) | Yes | - |
41-
| `payload_converter.py` (dict <-> protobuf objects) | Yes | - |
41+
| Payload converter (dict <-> protobuf objects) | Yes | - |
4242
| Endpoint class (`format_payload` / `parse_response`) | Yes | - |
4343
| `GrpcTransport` (timing, tracing, cancellation) | - | Yes |
4444
| `GenericGrpcClient` (raw bytes over gRPC) | - | Yes |
@@ -159,9 +159,9 @@ Usage: `aiperf profile --endpoint-type my_v2_endpoint --url grpc://triton:8001 .
159159

160160
Create your `.proto` file and generate stubs:
161161

162-
```
162+
```text
163163
src/aiperf/transports/grpc/proto/my_service.proto
164-
src/aiperf/transports/grpc/proto/my_service_pb2.py (generated)
164+
src/aiperf/transports/grpc/proto/my_service_pb2.py (generated)
165165
src/aiperf/transports/grpc/proto/my_service_pb2_grpc.py (generated)
166166
```
167167

@@ -460,5 +460,5 @@ When adding a new endpoint reusing an existing protocol (Strategy A):
460460
- [Source: GrpcTransport](../../src/aiperf/transports/grpc/grpc_transport.py) -- Generic transport implementation
461461
- [Source: GenericGrpcClient](../../src/aiperf/transports/grpc/grpc_client.py) -- Proto-free gRPC client
462462
- [Source: KServeV2GrpcSerializer](../../src/aiperf/transports/grpc/kserve_v2_serializers.py) -- Reference serializer implementation
463-
- [Source: payload_converter](../../src/aiperf/transports/grpc/payload_converter.py) -- V2 dict/protobuf conversion
463+
- [Source: KServeV2GrpcSerializer](../../src/aiperf/transports/grpc/kserve_v2_serializers.py) -- V2 dict/protobuf conversion
464464
- [Source: InferenceClient](../../src/aiperf/workers/inference_client.py) -- Transport/endpoint wiring

docs/tutorials/grpc-transport.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ GenericGrpcClient -- proto-free, sends/receives raw bytes via
7979
Triton / TRT-LLM Server
8080
```
8181

82-
The endpoint never knows it's running over gRPC. The serializer (e.g., `KServeV2GrpcSerializer`) converts the endpoint's dict payload to protobuf bytes on the way out, and converts response bytes back to a JSON-serialized `TextResponse` on the way in. This means all existing `--extra-inputs` options (like `v2_input_name`, `v2_output_name`) work identically over gRPC.
82+
The endpoint never knows it's running over gRPC. The serializer (e.g., `KServeV2GrpcSerializer`) converts the endpoint's dict payload to protobuf bytes on the way out, and converts response bytes back to a V2 JSON-format dict on the way in. The transport layer then wraps this dict as a `TextResponse`. This means all existing `--extra-inputs` options (like `v2_input_name`, `v2_output_name`) work identically over gRPC.
8383

8484
The serializer class and gRPC method paths are declared in `plugins.yaml` endpoint metadata, so adding support for a new gRPC protocol requires only a new serializer — no transport changes.
8585

@@ -326,5 +326,5 @@ When choosing between HTTP and gRPC for V2 inference:
326326
- [Source: grpc_transport.py](../../src/aiperf/transports/grpc/grpc_transport.py) - Generic transport implementation
327327
- [Source: grpc_client.py](../../src/aiperf/transports/grpc/grpc_client.py) - Proto-free gRPC client
328328
- [Source: kserve_v2_serializers.py](../../src/aiperf/transports/grpc/kserve_v2_serializers.py) - KServe V2 serializer
329-
- [Source: payload_converter.py](../../src/aiperf/transports/grpc/payload_converter.py) - Dict/protobuf conversion
329+
- [Source: kserve_v2_serializers.py](../../src/aiperf/transports/grpc/kserve_v2_serializers.py) - V2 dict/protobuf conversion
330330
- [Source: status_mapping.py](../../src/aiperf/transports/grpc/status_mapping.py) - gRPC to HTTP status mapping

docs/tutorials/kserve.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,16 @@ AIPerf provides first-class support for benchmarking [KServe](https://kserve.git
99

1010
## Endpoint Types
1111

12-
AIPerf provides five KServe-specific endpoint types:
12+
AIPerf provides seven KServe-specific endpoint types:
1313

1414
| Endpoint Type | Protocol | URL Path | Streaming | Token Metrics | Use Case |
1515
|---|---|---|---|---|---|
1616
| `kserve_chat` | OpenAI-compatible | `/openai/v1/chat/completions` | Yes | Yes | LLMs via vLLM/TRT-LLM on KServe |
1717
| `kserve_completions` | OpenAI-compatible | `/openai/v1/completions` | Yes | Yes | Text completions via vLLM/TRT-LLM on KServe |
1818
| `kserve_embeddings` | OpenAI-compatible | `/openai/v1/embeddings` | No | No | Embedding models on KServe |
1919
| `kserve_v2_infer` | V2 Open Inference Protocol | `/v2/models/{model_name}/infer` | Yes (gRPC) | Yes | Triton/TRT-LLM tensor inference |
20+
| `kserve_v2_embeddings` | V2 Open Inference Protocol | `/v2/models/{model_name}/infer` | No | No | Triton/TRT-LLM embedding models |
21+
| `kserve_v2_rankings` | V2 Open Inference Protocol | `/v2/models/{model_name}/infer` | No | No | Triton/TRT-LLM reranking models |
2022
| `kserve_v1_predict` | V1 TensorFlow Serving | `/v1/models/{model_name}:predict` | No | No | Legacy TF Serving-style models |
2123

2224
**Token Metrics**: When "Yes", AIPerf computes token-based metrics (input/output token counts, tokens per second). When "No", only request-level metrics (latency, throughput) are available.
@@ -353,6 +355,8 @@ Each KServe endpoint type includes a `health_path` in its metadata for pre-fligh
353355
| `kserve_completions` | `/openai/v1/models` | Lists available OpenAI-compatible models |
354356
| `kserve_embeddings` | `/openai/v1/models` | Lists available OpenAI-compatible models |
355357
| `kserve_v2_infer` | `/v2/models/{model_name}/ready` | V2 model readiness check |
358+
| `kserve_v2_embeddings` | `/v2/models/{model_name}/ready` | V2 model readiness check |
359+
| `kserve_v2_rankings` | `/v2/models/{model_name}/ready` | V2 model readiness check |
356360
| `kserve_v1_predict` | `/v1/models/{model_name}` | V1 model metadata/status |
357361

358362
Health paths that contain `{model_name}` are resolved using the same template substitution as endpoint paths.
@@ -368,6 +372,8 @@ Health paths that contain `{model_name}` are resolved using the same template su
368372
| KServe + vLLM (embeddings) | `kserve_embeddings` | Vector embeddings |
369373
| KServe + Triton (text) | `kserve_v2_infer` | Wraps text as BYTES tensors |
370374
| KServe + TRT-LLM via Triton | `kserve_v2_infer` | Standard Triton text pipeline |
375+
| KServe + Triton (embeddings) | `kserve_v2_embeddings` | V2 BYTES tensor embedding models |
376+
| KServe + Triton (reranking) | `kserve_v2_rankings` | V2 BYTES tensor reranking models |
371377
| KServe + TF Serving model | `kserve_v1_predict` | Legacy instance-based format |
372378
| KServe + custom model server | `kserve_v1_predict` or `template` | Depends on API format |
373379
| Non-KServe vLLM/TRT-LLM | `chat` or `completions` | Use standard endpoints for direct deployments |

src/aiperf/endpoints/base_rankings_endpoint.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33

4+
from __future__ import annotations
45

56
from abc import abstractmethod
67
from typing import Any
@@ -98,7 +99,7 @@ def format_payload(self, request_info: RequestInfo) -> dict[str, Any]:
9899
turn = request_info.turns[0]
99100
model_endpoint = request_info.model_endpoint
100101

101-
if turn.max_tokens:
102+
if turn.max_tokens is not None:
102103
self.warning("Max_tokens is provided but is not supported for rankings.")
103104

104105
query_text, passage_texts = self._extract_query_and_passages(turn)

src/aiperf/endpoints/kserve_v2_embeddings.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -93,14 +93,10 @@ def parse_response(
9393
if not outputs:
9494
return None
9595

96-
# Find output tensor by name, fallback to first
97-
output = None
98-
for o in outputs:
99-
if o.get("name") == self._output_name:
100-
output = o
101-
break
102-
if output is None:
103-
output = outputs[0]
96+
# Find the output tensor with the matching name
97+
output = next(
98+
(o for o in outputs if o.get("name") == self._output_name), outputs[0]
99+
)
104100

105101
data = output.get("data")
106102
if not data:

src/aiperf/endpoints/kserve_v2_images.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -124,14 +124,10 @@ def parse_response(
124124
if not outputs:
125125
return None
126126

127-
# Find output tensor by name, fallback to first
128-
output = None
129-
for o in outputs:
130-
if o.get("name") == self._output_name:
131-
output = o
132-
break
133-
if output is None:
134-
output = outputs[0]
127+
# Find the output tensor with the matching name
128+
output = next(
129+
(o for o in outputs if o.get("name") == self._output_name), outputs[0]
130+
)
135131

136132
data = output.get("data")
137133
if not isinstance(data, list) or not data:

src/aiperf/endpoints/kserve_v2_infer.py

Lines changed: 62 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,67 @@
1313
from aiperf.endpoints.base_endpoint import BaseEndpoint
1414

1515

16+
def _extract_v2_text(output: dict[str, Any]) -> str | None:
17+
"""Extract text from a V2 BYTES output tensor.
18+
19+
Args:
20+
output: V2 output tensor dict with ``data`` key.
21+
22+
Returns:
23+
First data element as string, or None if empty.
24+
"""
25+
data = output.get("data")
26+
if isinstance(data, list) and len(data) > 0 and data[0] is not None:
27+
return str(data[0])
28+
return None
29+
30+
31+
def parse_v2_text_response(
32+
endpoint: BaseEndpoint,
33+
response: InferenceServerResponse,
34+
output_name: str,
35+
) -> ParsedResponse | None:
36+
"""Parse V2 inference response, extracting text from BYTES output tensor.
37+
38+
Shared by KServeV2InferEndpoint and KServeV2VLMEndpoint since both
39+
produce text output in the same tensor format.
40+
41+
Args:
42+
endpoint: Endpoint instance (for make_text_response_data).
43+
response: Raw response from inference server.
44+
output_name: Expected output tensor name.
45+
46+
Returns:
47+
Parsed response with extracted text content, or None if no content.
48+
"""
49+
json_obj = response.get_json()
50+
if not json_obj:
51+
return None
52+
53+
outputs = json_obj.get("outputs")
54+
if not outputs:
55+
return None
56+
57+
for output in outputs:
58+
if output.get("name") == output_name:
59+
text = _extract_v2_text(output)
60+
if text is not None:
61+
return ParsedResponse(
62+
perf_ns=response.perf_ns,
63+
data=endpoint.make_text_response_data(text),
64+
)
65+
66+
for output in outputs:
67+
text = _extract_v2_text(output)
68+
if text is not None:
69+
return ParsedResponse(
70+
perf_ns=response.perf_ns,
71+
data=endpoint.make_text_response_data(text),
72+
)
73+
74+
return None
75+
76+
1677
class KServeV2InferEndpoint(BaseEndpoint):
1778
"""KServe V2 Open Inference Protocol endpoint for Triton/TRT-LLM.
1879
@@ -91,32 +152,4 @@ def parse_response(
91152
Returns:
92153
Parsed response with extracted text content, or None if no content
93154
"""
94-
json_obj = response.get_json()
95-
if not json_obj:
96-
return None
97-
98-
outputs = json_obj.get("outputs")
99-
if not outputs:
100-
return None
101-
102-
for output in outputs:
103-
if output.get("name") == self._output_name:
104-
data = output.get("data")
105-
if isinstance(data, list) and len(data) > 0 and data[0] is not None:
106-
text = str(data[0])
107-
return ParsedResponse(
108-
perf_ns=response.perf_ns,
109-
data=self.make_text_response_data(text),
110-
)
111-
112-
# Fallback: try first output with data
113-
for output in outputs:
114-
data = output.get("data")
115-
if isinstance(data, list) and len(data) > 0 and data[0] is not None:
116-
text = str(data[0])
117-
return ParsedResponse(
118-
perf_ns=response.perf_ns,
119-
data=self.make_text_response_data(text),
120-
)
121-
122-
return None
155+
return parse_v2_text_response(self, response, self._output_name)

src/aiperf/endpoints/kserve_v2_rankings.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -114,16 +114,19 @@ def extract_rankings(self, json_obj: dict[str, Any]) -> list[dict[str, Any]]:
114114
if not outputs:
115115
return []
116116

117-
output = None
118-
for o in outputs:
119-
if o.get("name") == self._output_name:
120-
output = o
121-
break
122-
if output is None:
123-
output = outputs[0]
117+
# Find the output tensor with the matching name
118+
output = next(
119+
(o for o in outputs if o.get("name") == self._output_name), outputs[0]
120+
)
124121

125122
data = output.get("data")
126123
if not data:
127124
return []
128125

129-
return [{"index": i, "score": float(s)} for i, s in enumerate(data)]
126+
results = []
127+
for i, s in enumerate(data):
128+
try:
129+
results.append({"index": i, "score": float(s)})
130+
except (ValueError, TypeError):
131+
self.warning(f"Skipping non-numeric score at index {i}: {s!r}")
132+
return results

src/aiperf/endpoints/kserve_v2_vlm.py

Lines changed: 2 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
RequestInfo,
1212
)
1313
from aiperf.endpoints.base_endpoint import BaseEndpoint
14+
from aiperf.endpoints.kserve_v2_infer import parse_v2_text_response
1415

1516

1617
class KServeV2VLMEndpoint(BaseEndpoint):
@@ -108,32 +109,4 @@ def parse_response(
108109
Returns:
109110
Parsed response with extracted text content, or None if no content
110111
"""
111-
json_obj = response.get_json()
112-
if not json_obj:
113-
return None
114-
115-
outputs = json_obj.get("outputs")
116-
if not outputs:
117-
return None
118-
119-
for output in outputs:
120-
if output.get("name") == self._output_name:
121-
data = output.get("data")
122-
if isinstance(data, list) and len(data) > 0 and data[0] is not None:
123-
text = str(data[0])
124-
return ParsedResponse(
125-
perf_ns=response.perf_ns,
126-
data=self.make_text_response_data(text),
127-
)
128-
129-
# Fallback: try first output with data
130-
for output in outputs:
131-
data = output.get("data")
132-
if isinstance(data, list) and len(data) > 0 and data[0] is not None:
133-
text = str(data[0])
134-
return ParsedResponse(
135-
perf_ns=response.perf_ns,
136-
data=self.make_text_response_data(text),
137-
)
138-
139-
return None
112+
return parse_v2_text_response(self, response, self._output_name)

src/aiperf/transports/base_transports.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import importlib.metadata as importlib_metadata
77
from abc import ABC, abstractmethod
88
from collections.abc import Awaitable, Callable
9-
from typing import Protocol, runtime_checkable
9+
from typing import Any, Protocol, runtime_checkable
1010
from urllib.parse import parse_qs, urlencode, urlparse, urlunparse
1111

1212
from aiperf.common.mixins import AIPerfLifecycleMixin
@@ -39,7 +39,7 @@
3939
class TransportProtocol(AIPerfLifecycleProtocol, Protocol):
4040
"""Protocol for a transport that sends requests to an inference server."""
4141

42-
def __init__(self, **kwargs) -> None: ...
42+
def __init__(self, **kwargs: Any) -> None: ...
4343

4444
def get_transport_headers(self, request_info: RequestInfo) -> dict[str, str]: ...
4545

@@ -60,7 +60,7 @@ class BaseTransport(AIPerfLifecycleMixin, ABC):
6060
Transports handle the protocol layer (HTTP, gRPC, etc.).
6161
"""
6262

63-
def __init__(self, model_endpoint: ModelEndpointInfo, **kwargs) -> None:
63+
def __init__(self, model_endpoint: ModelEndpointInfo, **kwargs: Any) -> None:
6464
super().__init__(**kwargs)
6565
self.model_endpoint: ModelEndpointInfo = model_endpoint
6666
self.user_agent: str = f"aiperf/{importlib_metadata.version('aiperf')}"
@@ -162,7 +162,7 @@ async def send_request(
162162
Args:
163163
request_info: Request context and metadata
164164
payload: Request payload (format depends on transport)
165-
first_token_callback: Optional callback fired on first SSE message with ttft_ns
165+
first_token_callback: Optional callback fired on first response with ttft_ns
166166
167167
Returns:
168168
Record containing responses, timing, and any errors

0 commit comments

Comments
 (0)