-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I am working with TensorRT v10 to do inference with dynamic batches.
My model is ViT base obtained based-on (https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq)
The model was exported with dynamic axes to batch dimension. In build step a profile was setep.
To do an inference first alloc memory using the max shape (32, 3, 224, 224) to input and (32, 1000) to output.
After copy input data to device memory using device_ptr.
Set input shape in context.
Call do_inference function. But the output is always (32000) to any batch dim.
Environment
TensorRT Version: 10.0.1
NVIDIA GPU: Tesla T4
NVIDIA Driver Version: 530
CUDA Version: 12.2
CUDNN Version: 9.2.0
Operating System:
Python Version (if applicable): 3.10
PyTorch Version (if applicable): 2.3.1
Relevant Files
Model link: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq
Steps To Reproduce
Commands or scripts:
import tensorrt as trt
from cuda import cuda, cudart
# https://github.com/NVIDIA/TensorRT/blob/release/10.1/samples/python/common_runtime.py
from common_runtime import *
# load vit engine with dynamic batch = 32
stream = cuda_call(cudart.cudaStreamCreate())
batch_size = None
inputs = []
outputs = []
bindings = []
for i in range(engine.num_io_tensors):
tensor_name = engine.get_tensor_name(i)
# If binding is dynamic some dimensions can be -1
# get_tensor_shape returns shape with dynamic dim, same ONNX
# get_tensor_profile_shape returns (min_shape, optimal_shape, max_shape)
# Pick out the max shape to allocate enough memory for the binding
if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
shape = engine.get_tensor_profile_shape(tensor_name, 0)[-1]
batch_size = shape[0]
else:
shape = engine.get_tensor_shape(tensor_name)
# Replace dynamic batch dim max input batch
shape[0] = batch_size
# Size in bytes
size = trt.volume(shape)
trt_type = engine.get_tensor_dtype(tensor_name)
#print(shape)
# Allocate host and device buffers
if trt.nptype(trt_type):
dtype = np.dtype(trt.nptype(trt_type))
bindingMemory = HostDeviceMem(size, dtype)
else: # no numpy support: create a byte array instead (BF16, FP8, INT4)
size = int(size * trt_type.itemsize)
bindingMemory = HostDeviceMem(size)
# Append the device buffer to device bindings
bindings.append(int(bindingMemory.device))
# Append to the appropriate list
if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
inputs.append(bindingMemory)
else:
outputs.append(bindingMemory)
context = engine.create_execution_context()
batch = 4
shape = (batch, 3, 224, 224)
input_data = np.random.rand(*shape).astype("float32")
context.set_input_shape("input", shape)
memcpy_host_to_device(inputs[0].device, input_data)
results = do_inference(context, engine, bindings, inputs, outputs, stream)
results[0].shape
> (32000,)