Skip to content

Conversation

@wsttiger
Copy link
Collaborator

Add CUDA Graph Optimization to TensorRT Decoder

Overview

This PR implements CUDA graph optimization for the TensorRT decoder, providing a measurable 20% performance improvement while maintaining full numerical accuracy. The implementation uses a clean executor abstraction pattern that automatically handles models with dynamic shapes.

Key Features

1. CUDA Graph Optimization

  • Captures TensorRT inference operations into a CUDA graph on first decode call
  • Replays captured graph on subsequent calls, eliminating kernel launch overhead
  • ~20% reduction in inference latency (1.24x speedup)
  • Follows NVIDIA TensorRT best practices

2. Executor Abstraction (PIMPL + Variant Pattern)

  • TraditionalExecutor: Standard TensorRT execution path
  • CudaGraphExecutor: CUDA graph-optimized execution path
  • Zero-overhead std::variant dispatch (no virtual calls)
  • Clean separation of concerns, easy to extend

3. Dynamic Shape Detection

  • Automatically detects models with dynamic tensor dimensions
  • Detects models with multiple optimization profiles
  • Falls back to traditional execution when CUDA graphs are incompatible
  • Prevents runtime errors from improper CUDA graph usage

4. User Control

  • New parameter: use_cuda_graph (default: true)
  • Users can explicitly disable CUDA graphs if needed
  • Clear logging of executor selection and reasoning

Performance Results

Benchmark Test (200 iterations):

Implement CUDA graph capture and replay for TensorRT inference operations
to reduce kernel launch overhead and improve performance. Following NVIDIA
TensorRT best practices, the decoder now:

- Captures inference operations into a CUDA graph on first decode() call
- Executes the captured graph on all subsequent decode() calls
- Properly manages CUDA graph lifecycle with cleanup in destructor

Changes:
- Add CUDA graph member variables (cuda_graph_captured_, cuda_graph_,
  cuda_graph_exec_)
- Modify decode() to capture setTensorAddress() and enqueueV3() operations
  on first invocation and replay the graph on subsequent calls
- Add CUDA graph cleanup to destructor

Performance benefits:
- 10-20% reduction in inference latency from reduced launch overhead
- Better GPU utilization through optimized command submission
- Memory operations outside graph allow per-call data updates

Tested with all unit tests passing (10/10) including actual GPU inference
validation with numerical accuracy verified.

Ref: https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html
Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
… detection

Refactor the TensorRT decoder implementation to use a PIMPL + variant pattern
with separate executor strategies, enabling better extensibility and automatic
handling of models with dynamic shapes.

Key changes:

1. Executor Abstraction:
   - Introduce TraditionalExecutor for standard TensorRT execution
   - Introduce CudaGraphExecutor for CUDA graph-optimized execution
   - Use std::variant for zero-overhead executor selection
   - Executors encapsulate their state and lifecycle management

2. Dynamic Shape Detection:
   - Add supports_cuda_graphs() to detect model compatibility
   - Automatically fall back to traditional execution for:
     * Models with dynamic tensor dimensions
     * Models with multiple optimization profiles
   - Prevents runtime errors from incompatible CUDA graph usage

3. User Control:
   - Support 'use_cuda_graph' parameter (default: true)
   - Users can explicitly disable CUDA graphs if needed
   - Clear logging of executor selection and reasoning

4. PIMPL Pattern:
   - Hide all TensorRT implementation details in Impl struct
   - Clean separation between interface and implementation
   - Improved compilation times for users of the decoder

5. Code Quality:
   - Rename 'initialized_' to 'decoder_ready_' for clarity
   - Simplified decode() method with single execution path
   - Better organized resource cleanup in Impl destructor

Architecture benefits:
- Extensible: Easy to add new executor types (e.g., batched, streamed)
- Type-safe: Compile-time guarantees via std::variant
- Zero overhead: No virtual call overhead vs manual branching
- Maintainable: Clear separation of concerns

All tests passing (10/10) with numerical accuracy verified.

Signed-off-by: Scott Thornton <[email protected]>
Add PerformanceComparisonCudaGraphVsTraditional test that quantifies the
performance benefit of CUDA graph execution vs traditional TensorRT execution.

Test measures 200 iterations after warm-up and demonstrates:
- CUDA Graph: ~14.4 μs average
- Traditional: ~17.8 μs average
- Speedup: 1.24x (24% faster, ~20% improvement)

Includes assertions ensuring ≥5% speedup and convergence validation.
Provides empirical evidence that CUDA graphs deliver measurable performance
benefits without sacrificing accuracy.

All 11 tests passing.

Signed-off-by: Scott Thornton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant