[WIP] Add KServe v2 and gRPC support by ajcasagrande · Pull Request #664 · ai-dynamo/aiperf

ajcasagrande · 2026-02-10T17:49:57Z

Summary by CodeRabbit

New Features
- Added gRPC transport and broad KServe/OpenInference support (V2 infer/embeddings/rankings/VLM/images, V1 predict, and KServe variants for chat/completions/embeddings); image-generation and embedding profiling.
Documentation
- New guides for gRPC transport, KServe integration, serializer/transport patterns, and extension tutorials.
Tests
- Extensive new unit, integration, and component tests covering gRPC, serializers, and KServe endpoints.
Chores
- Updated dev tooling and test targets; added gRPC-related dependencies and CI/test marker wiring.

Add 5 new endpoint types for benchmarking KServe deployments: - kserve_chat, kserve_completions, kserve_embeddings (reuse OpenAI classes with /openai prefix) - kserve_v2_infer (V2 Open Inference Protocol for Triton/TRT-LLM) - kserve_v1_predict (V1 TensorFlow Serving style) Includes {model_name} path template substitution in transport layer, health_path metadata field, and comprehensive documentation. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Add a gRPC transport that enables benchmarking KServe V2 inference servers over gRPC with HTTP/2 multiplexing and server-side streaming. Users switch from HTTP to gRPC by using --url grpc://host:port with the same kserve_v2_infer endpoint. Transport implementation: - GrpcTransport (BaseTransport) with @on_init/@on_stop lifecycle - GrpcClient wrapping grpc.aio for ModelInfer and ModelStreamInfer RPCs - Payload converter: dict <-> protobuf for transparent endpoint compat - GrpcTraceData with gRPC status code/message fields - Status code mapping (gRPC -> HTTP) for consistent metrics - TLS support via grpcs:// scheme - Request cancellation via asyncio.wait_for Framework changes: - Broaden FirstTokenCallback from SSEMessage to InferenceServerResponse - Enable kserve_v2_infer supports_streaming for gRPC ModelStreamInfer - Register grpc transport in plugins.yaml - Add grpcio, protobuf deps; grpcio-tools dev dep - Exclude generated proto stubs from ruff Documentation: - New grpc-transport.md tutorial (architecture, usage, trace data, troubleshooting) - Updated kserve.md with Section 8 gRPC transport and streaming clarifications - Updated patterns.md with Transport and Trace Data developer patterns Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Decouple GrpcTransport and GenericGrpcClient from KServe V2 protobuf stubs. All proto knowledge is now isolated in pluggable serializer classes loaded dynamically from endpoint metadata in plugins.yaml. - Add GrpcSerializerProtocol (runtime_checkable) for type-safe interface - Add KServeV2GrpcSerializer wrapping existing payload_converter - Add StreamChunk dataclass as protocol-agnostic streaming container - Add GrpcEndpointConfig schema for plugins.yaml grpc metadata block - Rewrite GenericGrpcClient to operate on raw bytes (identity passthrough) - GrpcTransport loads serializer/method paths from endpoint metadata - Update and rewrite gRPC documentation for new architecture Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…gRPC transport Replace the single GenericGrpcClient with per-target channel pooling and full connection reuse strategy support (POOLED, NEVER, STICKY_USER_SESSIONS), mirroring the aiohttp transport pattern. Key changes: - GrpcChannelLeaseManager for sticky-user-sessions strategy - Per-target channel pool with lazy creation for POOLED - Per-request channel create/close for NEVER - GrpcUnaryResult/GrpcStreamCall for trailing metadata capture - Two-stage cancellation (channel-ready + cancel timer) - Mixed-scheme validation at init - Trace data enrichment (request/response headers, status codes) - Absorb payload_converter.py into kserve_v2_serializers.py - raw_output_contents support for Triton responses - Triton integration test marker and Makefile target Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Source fixes: add type hint to _consume_stream closure, wrap debug calls in lambdas, fix timeout=0 treated as falsy in grpc_client, capture lease_manager ref to prevent race, cancel stream_call on timeout, and remove metadata() classmethod from both transports (belongs in plugins.yaml only). Adds 75 new unit tests covering GenericGrpcClient, metadata/target parsing, error paths, streaming edge cases, additional payload datatypes, raw_output_contents, and trace data export registration. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…ation Add kserve_v2_embeddings and kserve_v2_rankings endpoints with plugin registration. Fix review issues: falsy data[0] check in V2 infer parse_response, inconsistent max_tokens None checks, DRY violation in rankings format_payload (extract _extract_query_and_passages helper), incomplete gRPC status code mapping, and dead guard in gRPC client close. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Add two new KServe V2 endpoints for vision-language models and diffusion image generation, both using the same V2 /infer tensor protocol with different tensor layouts. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

github-actions · 2026-02-10T17:50:11Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f9f922235e856dc3e46ad92e0197c6e16008dabe

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f9f922235e856dc3e46ad92e0197c6e16008dabe

Last updated for commit: f9f9222 • Browse code

ajcasagrande · 2026-02-10T17:50:21Z

@coderabbitai review

coderabbitai · 2026-02-10T17:50:31Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

codecov · 2026-02-10T17:57:14Z

Codecov Report

❌ Patch coverage is 84.63886% with 151 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...iperf/transports/grpc/proto/grpc_predict_v2_pb2.py	14.10%	66 Missing and 1 partial ⚠️
.../transports/grpc/proto/grpc_predict_v2_pb2_grpc.py	44.15%	42 Missing and 1 partial ⚠️
src/aiperf/transports/grpc/grpc_transport.py	90.35%	22 Missing and 8 partials ⚠️
...rc/aiperf/transports/grpc/kserve_v2_serializers.py	97.05%	2 Missing and 2 partials ⚠️
src/aiperf/endpoints/kserve_v1_predict.py	92.85%	1 Missing and 2 partials ⚠️
src/aiperf/endpoints/base_rankings_endpoint.py	84.61%	1 Missing and 1 partial ⚠️
src/aiperf/endpoints/kserve_v2_images.py	95.91%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-02-10T18:00:39Z

Walkthrough

Adds end-to-end gRPC support for KServe (V1/V2): new endpoint implementations, a pluggable gRPC transport and client, KServe V2 protobuf/stubs and serializers, schema/plugin registry updates, docs/tutorials, Makefile/pytest wiring, many tests, mock-server and test-harness extensions, and tooling to generate stubs.

Changes

Cohort / File(s)	Summary
Build & Test Config `Makefile`, `pyproject.toml`	Adds Triton integration test targets and pytest marker; adds `grpcio`, `protobuf`, `grpcio-tools`; Ruff excludes generated stubs; default test selection excludes triton integration.
Documentation `docs/...` (multiple)	Adds gRPC endpoint/transport guides, KServe tutorial, patterns, tutorials, and updates CLI options to include gRPC and KServe endpoint choices.
Plugin Registry & Schema `src/aiperf/plugin/plugins.yaml`, `src/aiperf/plugin/schema/...`, `src/aiperf/plugin/enums.py`	Registers multiple KServe endpoints and a `grpc` transport; extends plugin schema with `grpc` config and `health_path`; updates inline docs.
Endpoints (KServe V1/V2 & Rankings helper) `src/aiperf/endpoints/kserve_v1_predict.py`, `src/aiperf/endpoints/kserve_v2_*.py`, `src/aiperf/endpoints/base_rankings_endpoint.py`	Adds KServe V1 predict and V2 endpoints (infer, embeddings, images, vlm, rankings); introduces rankings extraction helper and configurable tensor names; implements payload formatting and response parsing.
gRPC Core & Serializers `src/aiperf/transports/grpc/...` (`grpc_client.py`, `grpc_transport.py`, `kserve_v2_serializers.py`, `grpc_defaults.py`, `status_mapping.py`, `stream_chunk.py`, `trace_data.py`, `__init__.py`)	New async GenericGrpcClient, GrpcTransport with pooling/lease strategies, GrpcSerializerProtocol, KServe V2 serializer (dict↔protobuf), defaults, status mapping, stream chunk and trace models.
Proto & Generated Stubs `src/aiperf/transports/grpc/proto/grpc_predict_v2.proto`, `..._pb2.py`, `..._pb2_grpc.py`, `proto/__init__.py`	Adds vendored KServe V2 proto and generated Python protobuf + gRPC stubs (pb2 and pb2_grpc modules).
HTTP Transport & Base Transport Types `src/aiperf/transports/aiohttp_transport.py`, `src/aiperf/transports/base_transports.py`	Removes transport metadata() classmethod, updates FirstTokenCallback type and BaseTransport/init typing; adds `{model_name}` path substitution in URL generation.
Mock Server & Test Harness `tests/aiperf_mock_server/*`, `tests/harness/fake_transport.py`	Adds KServe V2 request model and mock V2 endpoints/response builders; updates token/request-id handling and fake transport dispatch to support V2 flows and new endpoint types.
Tests — Unit, Integration, Component `tests/unit/...`, `tests/integration/...`, `tests/component_integration/...`	Extensive new tests for gRPC client/transport, serializers/payload converter, trace data, status mapping, KServe endpoint units, component and integration tests for KServe V1/V2 endpoints, and path-template tests.
Tools `tools/generate_grpc_stubs.py`	Adds utility to generate Python gRPC stubs from the proto and post-process imports.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 I hopped through proto fields and springs,
I tucked in bytes and stitched their strings,
Channels pooled, serializers spun,
KServe carrots for every run,
Tests thumped loud — the burrow sings!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[WIP] Add KServe v2 and gRPC support' directly describes the main feature additions in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 82.98% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

src/aiperf/transports/grpc/grpc_transport.py (3)
245-247: Consider validating the loaded serializer against GrpcSerializerProtocol.

Since GrpcSerializerProtocol is already @runtime_checkable, you can add a cheap guard after instantiation to catch misconfigured plugins.yaml entries at startup rather than at first request time.
Suggested addition
         module = importlib.import_module(module_path)
         serializer_cls = getattr(module, class_name)
         self._serializer = serializer_cls()
+        if not isinstance(self._serializer, GrpcSerializerProtocol):
+            raise TypeError(
+                f"Serializer {metadata.grpc.serializer!r} does not implement "
+                f"GrpcSerializerProtocol"
+            )
         self._unary_method = metadata.grpc.method
666-694: chunk.response_dict may be None when error_message is falsy but data is absent.

StreamChunk.response_dict is typed dict[str, Any] | None. The error guard on Line 669 only catches truthy error_message, so a chunk with response_dict=None and error_message=None (or "") would pass orjson.dumps(None) → "null" into the response list. If the serializer contract guarantees a non-None response_dict for non-error chunks, a defensive guard or assertion here would make that expectation explicit.
Suggested guard
+                if chunk.response_dict is None:
+                    continue
+
                 perf_ns = time.perf_counter_ns()
185-191: No validation that URL schemes are actually grpc or grpcs.

The mixed-scheme check ensures consistency, but doesn't verify the scheme is a valid gRPC scheme. If an endpoint is misconfigured with http:// URLs, the transport would silently create insecure gRPC channels that would fail with confusing protocol-level errors.
Suggested validation
         schemes = {urlparse(u).scheme.lower() for u in base_urls}
         if len(schemes) > 1:
             raise ValueError(
                 f"All gRPC URLs must use the same scheme, got mixed: {schemes}"
             )
+        if schemes and not schemes <= {"grpc", "grpcs"}:
+            raise ValueError(
+                f"Expected grpc:// or grpcs:// scheme(s), got: {schemes}"
+            )
         self._secure = "grpcs" in schemes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@docs/dev/adding-grpc-endpoints.md`:
- Around line 455-464: The Related section contains a duplicate link to the same
source file (kserve_v2_serializers.py) with identical anchor text "Source:
KServeV2GrpcSerializer"; update docs/dev/adding-grpc-endpoints.md to either
remove the redundant entry or merge them into one entry and, if both references
are needed, give distinct anchor text (e.g., "Source: KServeV2GrpcSerializer —
Reference serializer implementation" and "Source: KServeV2GrpcSerializer — V2
dict/protobuf conversion") so the two links to kserve_v2_serializers.py are no
longer identical.

In `@docs/tutorials/grpc-transport.md`:
- Around line 328-329: Remove the duplicate link to kserve_v2_serializers.py in
the docs: keep one entry for the KServe V2 serializer and either delete the
second line or update it to point to the intended different source file
(whichever was meant to document "V2 dict/protobuf conversion"); locate the
duplicated entries referencing kserve_v2_serializers.py and ensure only the
correct link and description remain.

In `@docs/tutorials/kserve.md`:
- Around line 12-23: Update the KServe endpoint types table to include the two
missing V2 entries: add a row for kserve_v2_vlm with Protocol "V2 Open Inference
Protocol", URL Path "/v2/models/{model_name}/infer", Streaming "Yes (gRPC)",
Token Metrics "Yes", and Use Case "Vision-language models on Triton", and add a
row for kserve_v2_images with Protocol "V2 Open Inference Protocol", URL Path
"/v2/models/{model_name}/infer", Streaming "No", Token Metrics "No", and Use
Case "Image generation models (Stable Diffusion, SDXL, Flux) on Triton" so the
table includes kserve_v2_vlm and kserve_v2_images alongside the existing
endpoint rows.

In `@pyproject.toml`:
- Line 32: Update the protobuf dependency constraint in pyproject.toml to avoid
the yanked 5.29.0 release: locate the dependency entry "protobuf>=5.29.0,<6" and
change it to a non-yanked minimum (for example "protobuf>=5.30.0,<6"); keep the
existing "grpcio~=1.71.0" entry untouched but ensure the new protobuf range is
compatible with it.

In `@src/aiperf/transports/grpc/kserve_v2_serializers.py`:
- Around line 206-235: The code currently skips zero dimensions when computing
num_elements, causing shapes like [0] to yield a non-zero count and trigger
struct.error; change the computation in the block that builds num_elements
(which currently iterates "for dim in shape") to multiply all dimensions
directly (do not skip zeros) so zero in shape yields num_elements == 0, and then
add a guard: if num_elements == 0 return an empty list immediately before
calling struct.unpack_from; keep the existing fmt/elem_size logic and use the
same symbols (shape, num_elements, datatype, fmt_map, raw_bytes, elem_size,
struct.unpack_from) so the rest of the function remains unchanged.

In `@tests/aiperf_mock_server/app.py`:
- Around line 812-814: The code can raise TypeError when JSON includes a "model"
key because KServeV2InferRequest is called with both model=model_name and **raw;
before constructing the request (around the body = await request.body(); raw =
orjson.loads(body) code), defensively remove any "model" key from raw (e.g.,
raw.pop("model", None)) or otherwise ensure raw doesn't contain "model" so that
KServeV2InferRequest(model=model_name, **raw) cannot receive duplicate values;
update the call site to use the cleaned raw when creating KServeV2InferRequest.

In `@tests/harness/fake_transport.py`:
- Around line 488-499: The current _do_v2_embedding reads inp.req.inputs[0]
unguarded which can raise IndexError if inputs is empty; update _do_v2_embedding
to defensively handle missing/empty inputs by checking inp.req.inputs (or using
getattr) and defaulting to an empty data list when no input tensor exists, e.g.
compute data = inp.req.inputs[0].get("data", []) if inp.req.inputs else []; then
build texts = [str(d) for d in data] and continue to call
_build_v2_embedding_response as before so the harness returns an empty-embedding
response instead of crashing.

🧹 Nitpick comments (16)

src/aiperf/transports/aiohttp_transport.py (1)
175-178: str.format() can raise on unexpected placeholders in user-supplied paths.

If a custom endpoint or metadata path contains brace expressions other than {model_name} (e.g., {version}, {0}, or even a stray {), str.format(model_name=...) will raise KeyError or IndexError at runtime. Consider using a safer substitution approach.
🛡️ Suggested safer substitution
-            if "{" in path:
-                path = path.format(
-                    model_name=request_info.model_endpoint.primary_model_name
-                )
+            path = path.replace(
+                "{model_name}",
+                request_info.model_endpoint.primary_model_name,
+            )
Apply the same change in both branches (custom endpoint at Line 175 and metadata path at Line 197).
Also applies to: 197-200
tests/unit/transports/test_path_template.py (1)

41-51: Missing endpoint_headers and endpoint_params fields in helper.

_create_request_info omits endpoint_headers and endpoint_params that other test fixtures (e.g., in test_base_transport.py) include. If RequestInfo requires these fields or if future test extensions need headers, this will need updating. Not blocking since get_url() doesn't use them, but worth aligning with the existing fixture pattern for consistency.

src/aiperf/transports/grpc/status_mapping.py (1)

9-27: Consider using the enum member directly as dictionary key instead of extracting .value[0].

Accessing .value[0] relies on the internal tuple representation of grpc.StatusCode. If the grpc library ever changes the enum value format, this breaks silently. Since the function receives an int, you could simply use integer literal keys (which are stable per the gRPC spec) or document why .value[0] is used.

That said, this approach does self-document which gRPC code each entry corresponds to, which is a readability benefit. Acceptable as-is.

tests/aiperf_mock_server/app.py (1)

756-791: _build_v2_embedding_response and _build_v2_ranking_response are defined but unused.

These builders are not called from any endpoint in this file. Since the PR is WIP, I assume these will be wired up to additional V2 endpoints later. Just flagging for visibility.

Would you like me to open a tracking issue for wiring up the V2 embedding and ranking endpoints?
tests/unit/transports/grpc/test_grpc_client.py (1)
199-228: Unused call_count variable in state transition test.

call_count (line 211) is incremented but never asserted on. It appears to be leftover debug tracking. Consider removing it or adding an assertion to validate the expected number of state queries.
Remove unused counter
         # First call with try_to_connect=True, then without
-        call_count = 0

         def get_state(try_to_connect: bool = False) -> grpc.ChannelConnectivity:
-            nonlocal call_count
-            call_count += 1
             return next(states)
src/aiperf/endpoints/kserve_v2_images.py (1)

84-98: Consider noting that cast_fn(value) can raise on bad user input.

If a user provides a non-numeric value for a numeric typed tensor (e.g., --extra guidance_scale:abc), float("abc") will raise a ValueError with a somewhat opaque traceback originating from the cast call. This is an edge case, and the current behavior (letting it raise) is acceptable, but a more user-friendly error message could be considered in a follow-up.
src/aiperf/endpoints/kserve_v2_vlm.py (2)
49-57: Empty prompt silently submitted when no text content is found.

If turn.texts is empty or all text.contents are falsy, prompt becomes "" and is still sent as the text tensor. This could make debugging tricky — consider logging a warning or raising if the VLM endpoint requires non-empty text input.
💡 Optional: warn on empty prompt
         prompt = " ".join(prompts) if prompts else ""
+        if not prompt:
+            self.warning(lambda: "No text content found in turn; sending empty prompt")
68-70: Nit: comment describes "what" rather than "why".

As per coding guidelines, comments should explain why rather than restate the code. The code already makes it clear that image data is being extracted.

As per coding guidelines: Comments should explain 'why?' not 'what'
src/aiperf/transports/grpc/proto/grpc_predict_v2.proto (1)

15-19: Vendored proto — Buf PACKAGE_DIRECTORY_MATCH warning is expected and can be suppressed.

The static analysis flags that package inference doesn't match the file path src/aiperf/transports/grpc/proto. This is a standard consequence of vendoring upstream protos into a project-specific directory. Consider adding a buf.yaml lint exception or a comment noting this is intentionally vendored if the Buf linter runs in CI.

tests/component_integration/endpoints/test_kserve_v2_vlm_endpoint.py (1)

17-31: Test name could include expected outcome per naming convention.

The coding guidelines recommend test_<function>_<scenario>_<expected> naming. Consider renaming to something like test_kserve_v2_vlm_synthetic_returns_expected_request_count to capture the expected outcome.

That said, this is a minor nit — the test logic itself looks correct.

tests/unit/transports/grpc/test_grpc_trace_data.py (1)

82-90: Registration test accesses private _model_lookup_table.

This is fine for verifying discriminator registration, but it couples the test to the internal lookup mechanism. If the registration API changes, this test will break. Consider adding a comment noting this is intentionally testing internal registration behavior.

tests/unit/endpoints/test_kserve_v2_rankings_endpoint.py (1)

207-322: Consider adding a test for non-numeric score handling.

extract_rankings has a try/except path that warns on non-numeric scores (line 131-134 in kserve_v2_rankings.py). There's no test exercising this branch. A quick parametrized test with a mixed [0.9, "bad", 0.3] data array would cover the warning + skip logic.

tests/unit/endpoints/test_kserve_v2_vlm_endpoint.py (1)

152-245: Consider adding an empty-data parse test for consistency.

The images endpoint tests include a test_parse_response_empty_data case (empty data: []). The VLM parse_response has similar logic (isinstance(data, list) and len(data) > 0) but no equivalent test here. Adding one would ensure parity and guard the empty-data branch.

src/aiperf/endpoints/kserve_v2_infer.py (1)

83-128: _extract_text_from_output returns None for empty-string data, causing make_text_response_data to also return None.

If a V2 output tensor contains data: [""], _extract_text_from_output returns "", and make_text_response_data("") returns None (since empty string is falsy). The fallback loop then skips this output and tries the next one. This seems intentional for "no meaningful content," but it's worth calling out: a legitimate empty-string response (e.g., a model that returned "") would be silently dropped. If that's intended, no action needed.
src/aiperf/transports/grpc/grpc_transport.py (2)
114-119: close_all doesn't handle individual close failures.

If one lease.close() raises, remaining leases won't be closed. Consider suppressing per-lease errors during shutdown.
Proposed fix
     async def close_all(self) -> None:
         """Close all active channel leases."""
         leases = list(self._leases.values())
         self._leases.clear()
         for lease in leases:
-            await lease.close()
+            with contextlib.suppress(Exception):
+                await lease.close()
(Requires import contextlib at the top of the file.)
254-264: Same close_all concern applies to _close_grpc_client pool cleanup.

Similar to GrpcChannelLeaseManager.close_all, if one client.close() raises on line 264, remaining pool clients won't be closed.
Proposed fix
     `@on_stop`
     async def _close_grpc_client(self) -> None:
         """Close all gRPC channels and the lease manager."""
         if self._lease_manager:
             lease_manager = self._lease_manager
             self._lease_manager = None
             await lease_manager.close_all()
         clients = list(self._channel_pool.values())
         self._channel_pool.clear()
         for client in clients:
-            await client.close()
+            with contextlib.suppress(Exception):
+                await client.close()

docs/dev/adding-grpc-endpoints.md

docs/tutorials/grpc-transport.md

docs/tutorials/kserve.md

pyproject.toml

src/aiperf/transports/grpc/kserve_v2_serializers.py

tests/aiperf_mock_server/app.py

tests/harness/fake_transport.py

… test coverage Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

- Merge duplicate doc links to kserve_v2_serializers.py - Add missing kserve_v2_vlm and kserve_v2_images to endpoint table - Bump protobuf minimum to 5.29.1 to avoid yanked 5.29.0 - Fix zero-dimension shape producing wrong element count in serializer - Prevent duplicate "model" kwarg TypeError in mock server - Guard against empty inputs IndexError in fake transport - Use safer .replace() instead of str.format() for path templates - Suppress individual close() errors during gRPC shutdown Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com> # Conflicts: # docs/cli_options.md # src/aiperf/plugin/schema/plugins.schema.json # src/aiperf/plugin/schema/schemas.py # src/aiperf/transports/aiohttp_transport.py

- Remove unused call_count variable in test_grpc_client - Remove "what" comment in kserve_v2_vlm (guidelines: comments explain "why") - Add test for non-numeric score handling in rankings endpoint - Add empty-data parse test for VLM endpoint Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

ajcasagrande · 2026-02-11T00:03:15Z

Nitpick Review Responses

Responding to the 16 nitpick comments from the automated review:

Addressed:

aiohttp_transport.py — str.format() safety (lines 175-178, 197-200): Addressed — switched to .replace("{model_name}", ...) which is safe against unexpected brace expressions.
test_grpc_client.py — unused call_count (lines 199-228): Addressed — removed the unused counter variable.
kserve_v2_vlm.py — comment describes "what" not "why" (lines 68-70): Addressed — removed the comment entirely; the code is self-explanatory.

Not addressing (intentional):

test_path_template.py — missing endpoint_headers/endpoint_params (lines 41-51): Not addressing — these fields aren't needed by get_url() and would add noise. Can add if future tests require them.
status_mapping.py — .value[0] as dict key (lines 9-27): Not addressing — the review itself notes "acceptable as-is." The .value[0] approach self-documents which gRPC code each entry maps to.
tests/aiperf_mock_server/app.py — unused _build_v2_embedding_response/_build_v2_ranking_response (lines 756-791): Not addressing — these ARE used by the V2 infer endpoint handler (see v2_infer() which dispatches to embedding/ranking paths). The review's static analysis missed the dynamic dispatch.
kserve_v2_images.py — cast_fn(value) error on bad input (lines 84-98): Not addressing now — this is an edge case (non-numeric value for numeric tensor). The ValueError from float("abc") is already descriptive enough. Can improve in a follow-up if users hit this.
kserve_v2_vlm.py — empty prompt warning (lines 49-57): Not addressing — an empty prompt is a valid edge case (e.g., image-only VLM queries). Adding a warning would be misleading. The endpoint correctly sends whatever the user provides.

9-16. Remaining nitpicks (proto file structure, proto vendoring approach, etc.): These were informational observations, not actionable suggestions.

@coderabbitai please review

coderabbitai · 2026-02-11T00:06:20Z

@ajcasagrande Thank you for the thorough response to the nitpick comments! Your explanations are clear and well-reasoned:

Addressed items — Good fixes:

✅ str.format() → .replace() is the right call for safety
✅ Removing unused call_count cleans up the test
✅ Removing self-explanatory comments is appropriate

Intentionally not addressed — All reasonable:
4. ✅ Test helper omitting unused fields keeps tests focused
5. ✅ The .value[0] approach does self-document the gRPC code mapping
6. ✅ You're correct — the functions are used via dynamic dispatch in v2_infer(); static analysis can't always catch that pattern
7. ✅ The built-in ValueError from type coercion is descriptive for this edge case
8. ✅ Valid point about image-only VLM queries — empty text prompts are legitimate

I'll trigger a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/aiperf/transports/grpc/grpc_transport.py`:
- Around line 506-511: The current calculation of send_timeout uses a truthy
check (self.model_endpoint.endpoint.timeout or _CHANNEL_READY_TIMEOUT_S) which
treats 0/0.0 as falsy; change it to only fall back to _CHANNEL_READY_TIMEOUT_S
when timeout is None (e.g., use the endpoint.timeout if endpoint.timeout is not
None, otherwise the default), and keep the min(...) wrapping intact so
send_timeout = min((self.model_endpoint.endpoint.timeout if
self.model_endpoint.endpoint.timeout is not None else _CHANNEL_READY_TIMEOUT_S),
_CHANNEL_READY_TIMEOUT_S); reference the send_timeout variable, the
self.model_endpoint.endpoint.timeout attribute, and the _CHANNEL_READY_TIMEOUT_S
constant when making this change.

🧹 Nitpick comments (5)

tests/aiperf_mock_server/app.py (1)
723-725: Unparameterized dict in type hint.

list[dict] should be list[dict[str, Any]] for consistency with the rest of the file.
Proposed fix
-def _has_input_tensor(inputs: list[dict], name: str) -> bool:
+def _has_input_tensor(inputs: list[dict[str, Any]], name: str) -> bool:
src/aiperf/transports/grpc/grpc_client.py (1)

108-134: wait_for_ready — good separation of connection setup from request timing.

The SHUTDOWN state check prevents infinite loops, and asyncio.wait_for wrapping ensures the caller can bound the connection wait. One subtle note: if the channel transitions to TRANSIENT_FAILURE, the loop will keep waiting (which is the correct behavior for retryable connection failures), but consider whether a max-retry or logging on transient failures would be useful for debugging slow startups.
src/aiperf/endpoints/kserve_v2_embeddings.py (1)
113-130: Consider validating shape consistency in _reshape_embeddings.

If n * d != len(flat_data), the reshape silently produces truncated or padded embeddings. This could mask server-side issues. A debug/warning log when the product doesn't match the data length would help troubleshoot.
💡 Optional: add a length mismatch warning
     `@staticmethod`
     def _reshape_embeddings(
         flat_data: list[float], shape: list[int]
     ) -> list[list[float]]:
         if len(shape) == 2:
             n, d = shape
+            if n * d != len(flat_data):
+                # Log would require instance method; consider validating in parse_response instead
+                pass
             return [flat_data[i * d : (i + 1) * d] for i in range(n)]
         return [flat_data]
Since _reshape_embeddings is a @staticmethod and can't log, the validation could be done in parse_response before calling it.
src/aiperf/transports/base_transports.py (1)
21-34: Minor docstring inconsistency in FirstTokenCallback.

Line 27 documents the second parameter as message while the type now accepts InferenceServerResponse. Consider renaming it to response for consistency with the updated terminology used elsewhere in this file (e.g., Line 166).
📝 Suggested docstring update
 Args:
     ttft_ns: duration from request start
-    message: the first inference server response (SSEMessage, TextResponse, etc.)
+    response: the first inference server response (SSEMessage, TextResponse, etc.)
tests/unit/transports/grpc/test_grpc_transport.py (1)
150-151: Inaccurate return type annotation on initial_metadata.

tuple[()] means "a tuple containing one element which is an empty tuple." The actual return value () is an empty tuple, so the annotation should be tuple[tuple[str, str], ...] to match the real gRPC API, or simply tuple[()] could be replaced.

This is a minor mock inconsistency and won't break tests, but could confuse readers.
Suggested fix
-    async def initial_metadata(self) -> tuple[()]:
-        return ()
+    async def initial_metadata(self) -> tuple[tuple[str, str], ...]:
+        return ()

src/aiperf/transports/grpc/grpc_transport.py

Avoid treating timeout=0 as falsy when falling back to the default channel-ready timeout. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

ajcasagrande added 7 commits February 10, 2026 09:48

feat: add KServe V2 VLM and image generation endpoints

9826ff5

Add two new KServe V2 endpoints for vision-language models and diffusion image generation, both using the same V2 /infer tensor protocol with different tensor layouts. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

ajcasagrande marked this pull request as draft February 10, 2026 17:50

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

feat: add VLM/image endpoints, harden KServe V2 transport, and expand…

8d70753

… test coverage Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

ajcasagrande force-pushed the ajc/kserve branch from 47f02af to 8d70753 Compare February 10, 2026 18:23

ajcasagrande added 4 commits February 10, 2026 10:54

Merge remote-tracking branch 'origin/main' into ajc/kserve

ea10078

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com> # Conflicts: # docs/cli_options.md # src/aiperf/plugin/schema/plugins.schema.json # src/aiperf/plugin/schema/schemas.py # src/aiperf/transports/aiohttp_transport.py

Merge remote-tracking branch 'origin/main' into ajc/kserve

d818c0f

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

src/aiperf/transports/grpc/grpc_transport.py Show resolved Hide resolved

ajcasagrande added 2 commits February 10, 2026 16:41

fix: use explicit None check for endpoint timeout fallback

9df6e2b

Avoid treating timeout=0 as falsy when falling back to the default channel-ready timeout. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Merge remote-tracking branch 'origin/main' into ajc/kserve

f9f9222

Conversation

ajcasagrande commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

ajcasagrande commented Feb 10, 2026

Uh oh!

coderabbitai bot commented Feb 10, 2026

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajcasagrande commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nitpick Review Responses

Uh oh!

coderabbitai bot commented Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajcasagrande commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading

ajcasagrande commented Feb 11, 2026 •

edited

Loading