Welcome to Triton Inference Server Discussions! #5398

dzier · 2023-02-23T00:08:19Z

dzier
Feb 23, 2023
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we build together 💪.

ChamalJ96 · 2025-07-30T11:26:50Z

ChamalJ96
Jul 30, 2025

I’m using the Merlin multi-stage recommender system example, and both notebooks ran successfully with the default dataset. However, when I try using my own dataset, I encounter an issue during inference from the Triton server. It results in a NoneType error related to the Feast feature repository.
I've confirmed that the Triton server loads all models successfully and that the local Feast data exists. I've tried several approaches to resolve the issue, but so far, no success.

InferenceServerException Traceback (most recent call last)
Cell In[106], line 11
8 request["user_id"] = request["user_id"].astype(np.int32)
9 print(request)
---> 11 response = send_triton_request(request_schema, request, outputs)
12 response

File D:\build-rec\building-rec.venv\lib\site-packages\merlin\systems\triton\utils.py:230, in send_triton_request(schema, inputs, outputs_list, client, endpoint, request_id, triton_model)
226 triton_inputs = triton.convert_df_to_triton_input(schema, inputs, grpcclient.InferInput)
228 outputs = [grpcclient.InferRequestedOutput(col) for col in outputs_list]
--> 230 response = client.infer(triton_model, triton_inputs, request_id=request_id, outputs=outputs)
232 results = {}
233 for col in outputs_list:

File D:\build-rec\building-rec.venv\lib\site-packages\tritonclient\grpc_client.py:1572, in InferenceServerClient.infer(self, model_name, inputs, model_version, outputs, request_id, sequence_id, sequence_start, sequence_end, priority, timeout, client_timeout, headers, compression_algorithm, parameters)
1570 return result
1571 except grpc.RpcError as rpc_error:
-> 1572 raise_error_grpc(rpc_error)

File D:\build-rec\building-rec.venv\lib\site-packages\tritonclient\grpc_utils.py:77, in raise_error_grpc(rpc_error)
65 def raise_error_grpc(rpc_error):
66 """Raise an :py:class:InferenceServerException from a gRPC error.
67
68 Parameters
(...)
75 InferenceServerException
76 """
---> 77 raise get_error_grpc(rpc_error) from None

InferenceServerException: [StatusCode.INTERNAL] Traceback (most recent call last):
File "/models/executor_model/1/model.py", line 82, in execute
outputs = self.ensemble.transform(inputs, runtime=TritonExecutorRuntime())
File "/usr/local/lib/python3.10/dist-packages/merlin/systems/dag/ensemble.py", line 78, in transform
return runtime.transform(self.graph, transformable)
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/runtime.py", line 53, in transform
return self.executor.transform(transformable, [graph.output_node])
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 102, in transform
transformed_data = self._execute_node(node, transformable, capture_dtypes, strict)
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 116, in _execute_node
upstream_outputs = self._run_upstream_transforms(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 130, in _run_upstream_transforms
node_output = self._execute_node(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 116, in _execute_node
upstream_outputs = self._run_upstream_transforms(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 130, in _run_upstream_transforms
node_output = self._execute_node(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 116, in _execute_node
upstream_outputs = self._run_upstream_transforms(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 130, in _run_upstream_transforms
node_output = self._execute_node(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 116, in _execute_node
upstream_outputs = self._run_upstream_transforms(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 130, in _run_upstream_transforms
node_output = self._execute_node(
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 122, in _execute_node
transform_output = self._run_node_transform(node, transform_input, capture_dtypes, strict)
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 250, in _run_node_transform
raise exc
File "/usr/local/lib/python3.10/dist-packages/merlin/dag/executors.py", line 237, in _run_node_transform
transformed_data = node.op.transform(selection, input_data)
File "/usr/local/lib/python3.10/dist-packages/merlin/systems/dag/ops/feast.py", line 241, in transform
feature_array = array_constructor(feature_value).astype(
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

0 replies

vadapallij · 2025-09-03T16:35:07Z

vadapallij
Sep 3, 2025

Subject: Triton + TensorRT-LLM (Llama 3.1 8B) – Feasibility of Stateful Serving + KV Cache Reuse + Priority Caching

Hello everyone,
triton_setup.txt

I’m working with Triton Inference Server + TensorRT-LLM backend serving the Llama-3.1-8B model.

Based on my current setup (Attached), My goals for this deployment are:

Stateful serving – avoid sending long context with every request (true continuation across sequence_id).
KV cache reuse across requests – leverage cached K/V tensors for efficiency.
Priority-based KV caching – allow eviction of low-priority sequences if needed.

My question to the community

With this configuration:
triton_setup.txt

Is it feasible today to achieve true stateful continuati
triton_setup.txt
on (i.e., send prefix once, then continue generation without resending it) with Triton + TensorRT-LLM?
Or is KV cache reuse currently limited to prefix caching (must resend the same prefix for reuse)?
Is it feasible to use Priority based caching and Stateful behavior together?
Are there example configs / references for enabling end-to-end stateful serving with LLMs in Triton?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Welcome to Triton Inference Server Discussions! #5398

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Welcome to Triton Inference Server Discussions! #5398

Uh oh!

dzier Feb 23, 2023 Maintainer

👋 Welcome!

Replies: 2 comments

Uh oh!

ChamalJ96 Jul 30, 2025

Uh oh!

vadapallij Sep 3, 2025

dzier
Feb 23, 2023
Maintainer

ChamalJ96
Jul 30, 2025

vadapallij
Sep 3, 2025