-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Your code notes in the cuda preprocessing kernel that pinned memory is used for faster memcpy performance.
Yolo-V11-cpp-TensorRT/src/preprocess.cu
Lines 130 to 160 in 988adf1
| // Host function to perform CUDA-based preprocessing | |
| void cuda_preprocess( | |
| uint8_t* src, // Source image data on host | |
| int src_width, // Source image width | |
| int src_height, // Source image height | |
| float* dst, // Destination buffer on device | |
| int dst_width, // Destination image width | |
| int dst_height, // Destination image height | |
| cudaStream_t stream // CUDA stream for asynchronous execution | |
| ) { | |
| // Calculate the size of the image in bytes (3 channels: BGR) | |
| int img_size = src_width * src_height * 3; | |
| // Copy source image data to pinned host memory for faster transfer | |
| memcpy(img_buffer_host, src, img_size); | |
| // Asynchronously copy image data from host to device memory | |
| CUDA_CHECK(cudaMemcpyAsync( | |
| img_buffer_device, | |
| img_buffer_host, | |
| img_size, | |
| cudaMemcpyHostToDevice, | |
| stream | |
| )); | |
| // Define affine transformation matrices | |
| AffineMatrix s2d, d2s; // Source to destination and vice versa | |
| // Calculate the scaling factor to maintain aspect ratio | |
| float scale = std::min( |
Pinned memory regions are required to copy data from host to device in CUDA. Normally, copying from virtual (normal) memory to the gpu requires CUDA, in the background, copying the data from the region of virtual memory to pinned memory and then from pinned memory to device.
This code does that manually and points to a general misunderstanding about how H2D transfers work in CUDA. I wanted to point this out.
Metadata
Metadata
Assignees
Labels
No labels