Add trt cudagraph #369

wsttiger · 2025-11-25T17:24:34Z

Add CUDA Graph Optimization to TensorRT Decoder

Overview

This PR implements CUDA graph optimization for the TensorRT decoder, providing a measurable 20% performance improvement while maintaining full numerical accuracy. The implementation uses a clean executor abstraction pattern that automatically handles models with dynamic shapes.

Key Features

1. CUDA Graph Optimization

Captures TensorRT inference operations into a CUDA graph on first decode call
Replays captured graph on subsequent calls, eliminating kernel launch overhead
~20% reduction in inference latency (1.24x speedup)
Follows NVIDIA TensorRT best practices

2. Executor Abstraction (PIMPL + Variant Pattern)

TraditionalExecutor: Standard TensorRT execution path
CudaGraphExecutor: CUDA graph-optimized execution path
Zero-overhead std::variant dispatch (no virtual calls)
Clean separation of concerns, easy to extend

3. Dynamic Shape Detection

Automatically detects models with dynamic tensor dimensions
Detects models with multiple optimization profiles
Falls back to traditional execution when CUDA graphs are incompatible
Prevents runtime errors from improper CUDA graph usage

4. User Control

New parameter: use_cuda_graph (default: true)
Users can explicitly disable CUDA graphs if needed
Clear logging of executor selection and reasoning

Performance Results

Benchmark Test (200 iterations):

Implement CUDA graph capture and replay for TensorRT inference operations to reduce kernel launch overhead and improve performance. Following NVIDIA TensorRT best practices, the decoder now: - Captures inference operations into a CUDA graph on first decode() call - Executes the captured graph on all subsequent decode() calls - Properly manages CUDA graph lifecycle with cleanup in destructor Changes: - Add CUDA graph member variables (cuda_graph_captured_, cuda_graph_, cuda_graph_exec_) - Modify decode() to capture setTensorAddress() and enqueueV3() operations on first invocation and replay the graph on subsequent calls - Add CUDA graph cleanup to destructor Performance benefits: - 10-20% reduction in inference latency from reduced launch overhead - Better GPU utilization through optimized command submission - Memory operations outside graph allow per-call data updates Tested with all unit tests passing (10/10) including actual GPU inference validation with numerical accuracy verified. Ref: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html Signed-off-by: Scott Thornton <[email protected]>

Signed-off-by: Scott Thornton <[email protected]>

… detection Refactor the TensorRT decoder implementation to use a PIMPL + variant pattern with separate executor strategies, enabling better extensibility and automatic handling of models with dynamic shapes. Key changes: 1. Executor Abstraction: - Introduce TraditionalExecutor for standard TensorRT execution - Introduce CudaGraphExecutor for CUDA graph-optimized execution - Use std::variant for zero-overhead executor selection - Executors encapsulate their state and lifecycle management 2. Dynamic Shape Detection: - Add supports_cuda_graphs() to detect model compatibility - Automatically fall back to traditional execution for: * Models with dynamic tensor dimensions * Models with multiple optimization profiles - Prevents runtime errors from incompatible CUDA graph usage 3. User Control: - Support 'use_cuda_graph' parameter (default: true) - Users can explicitly disable CUDA graphs if needed - Clear logging of executor selection and reasoning 4. PIMPL Pattern: - Hide all TensorRT implementation details in Impl struct - Clean separation between interface and implementation - Improved compilation times for users of the decoder 5. Code Quality: - Rename 'initialized_' to 'decoder_ready_' for clarity - Simplified decode() method with single execution path - Better organized resource cleanup in Impl destructor Architecture benefits: - Extensible: Easy to add new executor types (e.g., batched, streamed) - Type-safe: Compile-time guarantees via std::variant - Zero overhead: No virtual call overhead vs manual branching - Maintainable: Clear separation of concerns All tests passing (10/10) with numerical accuracy verified. Signed-off-by: Scott Thornton <[email protected]>

Add PerformanceComparisonCudaGraphVsTraditional test that quantifies the performance benefit of CUDA graph execution vs traditional TensorRT execution. Test measures 200 iterations after warm-up and demonstrates: - CUDA Graph: ~14.4 μs average - Traditional: ~17.8 μs average - Speedup: 1.24x (24% faster, ~20% improvement) Includes assertions ensuring ≥5% speedup and convergence validation. Provides empirical evidence that CUDA graphs deliver measurable performance benefits without sacrificing accuracy. All 11 tests passing. Signed-off-by: Scott Thornton <[email protected]>

Signed-off-by: Scott Thornton <[email protected]>

wsttiger added 4 commits November 19, 2025 03:35

Formatting

2e9f66a

Signed-off-by: Scott Thornton <[email protected]>

wsttiger requested review from bmhowe23, cketcham2333, kvmto and melody-ren November 25, 2025 17:25

wsttiger and others added 2 commits November 25, 2025 17:26

Formatting

cd3058a

Signed-off-by: Scott Thornton <[email protected]>

Merge branch 'main' into add_trt_cudagraph

8bc4734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add trt cudagraph #369

Add trt cudagraph #369

Uh oh!

wsttiger commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add trt cudagraph #369

Are you sure you want to change the base?

Add trt cudagraph #369

Uh oh!

Conversation

wsttiger commented Nov 25, 2025

Add CUDA Graph Optimization to TensorRT Decoder

Overview

Key Features

1. CUDA Graph Optimization

2. Executor Abstraction (PIMPL + Variant Pattern)

3. Dynamic Shape Detection

4. User Control

Performance Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant