Skip to content

Conversation

@soumyadbanik
Copy link

  • Add O(num_dets) optimized drawing kernel (100x speedup: 5ms -> 0.05ms)
  • Add gather_kept_bboxes_kernel for dense bbox extraction
  • Add process_mask_kernel with bilinear interpolation and strict bbox clipping
  • Add cuda_blur_masks for mask smoothing
  • Increase kMaxNumOutputBbox to 8500 (fixes crash with standard YOLOv8 models)
  • Update yolov8_seg.cpp for TensorRT 10 compatibility (enqueueV3)
  • Add comprehensive documentation (GPU_POSTPROCESSING.md)
  • Add result images demonstrating correct mask output

Tested on RTX 3080 Ti with CUDA 12.6 and TensorRT 10.x

- Add O(num_dets) optimized drawing kernel (100x speedup: 5ms -> 0.05ms)
- Add gather_kept_bboxes_kernel for dense bbox extraction
- Add process_mask_kernel with bilinear interpolation and strict bbox clipping
- Add cuda_blur_masks for mask smoothing
- Increase kMaxNumOutputBbox to 8500 (fixes crash with standard YOLOv8 models)
- Update yolov8_seg.cpp for TensorRT 10 compatibility (enqueueV3)
- Add comprehensive documentation (GPU_POSTPROCESSING.md)
- Add result images demonstrating correct mask output

Tested on RTX 3080 Ti with CUDA 12.6 and TensorRT 10.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant