-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
After putting a comment on #1488 , I decided to open my own issue.
Environment
TensorRT Version: 8.5.2.2
NVIDIA GPU: NVIDIA RTX 2000 Ada Generation Laptop GPU
NVIDIA Driver Version: 570.169
CUDA Version: 11.5
CUDNN Version: 8.9.2.26
Operating System: Linux 6.8.0-65-generic
Python Version: 3.10.12
PyTorch Version: 2.3.0+cu121
Baremetal or Container: Both
I am having the same problem but only when importing a TensorRT inference engine class I created and which runs properly when doing both in Baremetal and the Container:
if __name__ == "__main__":
import pycuda.autoinit
engine_path = "model_engine_fp16.trt"
calibration_images_path = "calib_images.pt"
calibration_tensors = torch.load(calibration_images_path, map_location='cpu')
calibration_images = [img.numpy().astype(np.float32) for img in calibration_tensors]
print("Testing inference with variable batch sizes...")
inference_engine = TensorRTInference(engine_path, max_batch_size=6)
# Test with random batches
for i in range(100):
batch_size = np.random.randint(1, 6) # Random batch size 1-8
print(f"Batch size for run #{i}: {batch_size}")
# Select random images
indices = np.random.choice(len(calibration_images), batch_size, replace=False)
batch_images = [calibration_images[idx] for idx in indices]
# Stack to batch
batch_input = np.stack(batch_images, axis=0)
t_start = time.time()
results = inference_engine.infer(batch_input)
t_total = time.time() - t_start
time.sleep(1./6.)Exception: Execution context is None!
print(f"Batch {i}: size={batch_size}, time={t_total:.4f}s")
print(f" Input shape: {batch_input.shape}")
print(f" Output shapes: {[r.shape for r in results]}")But then I try importing the same model in a class which mixes the TensorRT model with a YOLO model, which runs well on Baremetal, I run into a problem when trying to do the same on the container, the self.engine.create_execution_context() returns a NoneType.
Code for the constructior of the TensorRTInference class:
class TensorRTInference:
"""Handle TensorRT inference with variable batch sizes"""
def __init__(self, engine_path, max_batch_size=32):
try:
current_context = cuda.Context.get_current()
print(f"Current CUDA context: {current_context}")
print(f"Context device: {current_context.get_device()}")
print(f"Context API version: {current_context.get_api_version()}")
except cuda.LogicError as e:
print(f"No CUDA context: {e}")
import pycuda.autoinit
current_context = cuda.Context.get_current()
print(f"After autoinit - Context: {current_context}")
# Test context is working
try:
cuda.Context.synchronize()
print("CUDA context is active and synchronized")
except Exception as e:
print(f"CUDA context sync failed: {e}")
# Load TensorRT engine
self.logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
runtime = trt.Runtime(self.logger)
self.engine = runtime.deserialize_cuda_engine(f.read())
print(f"Engine deserialized successfully: {self.engine is not None}")
if self.engine is None:
raise Exception("Failed to deserialize TensorRT engine")
self.context = self.engine.create_execution_context()
if self.context == None:
raise Exception("Execution context is None!")
self.max_batch_size = max_batch_size
# Get tensor info
self.input_names = []
self.output_names = []
for i in range(self.engine.num_io_tensors):
name = self.engine.get_tensor_name(i)
if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
self.input_names.append(name)
else:
self.output_names.append(name)
# Pre-allocate buffers for maximum batch size
self.buffers = {}
self.stream = cuda.Stream()
self._allocate_max_buffers()Code for the constructor of the class which loads the TensorRTInference class and the YOLO model:
class UseTwoModels:
def __init__(self,
detection_model_path : str,
other_model_path : str,
device : str ='cuda:0',
min_conf : float = 0.5) -> None:
self.device = torch.device(device if torch.cuda.is_available() else 'cpu')
torch.cuda.set_device(device)
print(device)
# Warm up CUDA context
dummy_tensor = torch.zeros(1).to(device)
del dummy_tensor
print(f"Using CUDA device: {torch.cuda.get_device_name(0)}")
try:
self.detector_model = YOLO(detection_model_path, task='detect')
except Exception as e:
print(f"Error loading YOLO model: {e}")
else:
print("YOLO detector properly loaded.")
self.keypoint_model = TensorRTInference(keypoint_model_path, max_batch_size=6)
self.min_conf = min_conf
self.person_cls = 0Just to clarify my problem, all works well on the baremetal, but when using the UseTwoModels class on the container, I get the Exception: Execution context is None! which I put in the constructor for the TensorRTInference class.
Here is the trace when trying to load the model from UseTwoModels:
# Here we are inside UseTwoModels
cuda:0
Using CUDA device: NVIDIA RTX 2000 Ada Generation Laptop GPU
Current CUDA context Det: <pycuda._driver.Context object at 0x7c80388b5540>
YOLO detector properly loaded.
# Here we are inside TensorRTInference
Current CUDA context: <pycuda._driver.Context object at 0x7c8038360890>
Context device: <pycuda._driver.Device object at 0x7c803889ece0>
Context API version: 3020
CUDA context is active and synchronized
Engine deserialized successfully: True
# This is the error
Traceback (most recent call last):
...
raise Exception("Execution context is None!")
Exception: Execution context is None!
I have also tried rebuilding the engine directly on the container and using it, but I keep getting the same problem.
My intuition tells me it is not a TensorRT version problem, given that the first snippet runs well on both baremetal and the container.
Please let me know if I should provide other information. Thanks!