-
Notifications
You must be signed in to change notification settings - Fork 741
[WIP][CUDA backend]: Async copy between host<->device #16053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16053
Note: Links to docs will display an error until the docs builds have been completed. ❌ 12 New Failures, 12 PendingAs of commit 743dec9 with merge base 33ec615 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This work-in-progress PR introduces asynchronous memory copying between host and device in the CUDA backend to improve performance. The implementation adds a new aoti_torch_copy_async function that uses CUDA streams for non-blocking memory transfers, replacing synchronous aoti_torch_copy_ calls in the execution pipeline.
Key changes:
- Added
aoti_torch_copy_asyncAPI with stream-based async memory transfers - Integrated async copies in the CUDA backend execution flow with proper stream synchronization
- Added comprehensive documentation for the new async copy function
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| backends/cuda/runtime/shims/memory.h | Added function declaration and documentation for aoti_torch_copy_async |
| backends/cuda/runtime/shims/memory.cpp | Implemented aoti_torch_copy_async with validation, device detection, and async CUDA memory operations |
| backends/cuda/runtime/cuda_backend.cpp | Integrated async copy for H2D and D2H transfers in the execution pipeline with stream synchronization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| AOTITorchError | ||
| aoti_torch_copy_async(Tensor* self, Tensor* src, cudaStream_t stream) { | ||
| // Check for null pointers first | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| self != nullptr, | ||
| InvalidArgument, | ||
| "aoti_torch_copy_async failed: self tensor is null"); | ||
|
|
||
| ET_CHECK_OR_RETURN_ERROR( | ||
| src != nullptr, | ||
| InvalidArgument, | ||
| "aoti_torch_copy_async failed: src tensor is null"); | ||
|
|
||
| // Get dtype information and validate compatibility | ||
| int32_t self_dtype, src_dtype; | ||
| aoti_torch_get_dtype(self, &self_dtype); | ||
| aoti_torch_get_dtype(src, &src_dtype); | ||
|
|
||
| ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(self_dtype)); | ||
| ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(src_dtype)); | ||
|
|
||
| // Check dtype compatibility - both tensors must have the same dtype | ||
| ET_CHECK_OR_RETURN_ERROR( | ||
| self_dtype == src_dtype, | ||
| InvalidArgument, | ||
| "dtype mismatch. self.dtype=%d, src.dtype=%d. aoti_torch_copy_async requires same dtypes", | ||
| self_dtype, | ||
| src_dtype); | ||
|
|
||
| // Check total number of elements compatibility | ||
| int64_t self_numel = self->numel(); | ||
| int64_t src_numel = src->numel(); | ||
|
|
||
| ET_CHECK_OR_RETURN_ERROR( | ||
| self_numel == src_numel, | ||
| InvalidArgument, | ||
| "numel mismatch. self.numel()=%ld, src.numel()=%ld", | ||
| self_numel, | ||
| src_numel); | ||
|
|
||
| // Get tensor metadata | ||
| int64_t* self_strides; | ||
| int64_t* src_strides; | ||
| aoti_torch_get_strides(self, &self_strides); | ||
| aoti_torch_get_strides(src, &src_strides); | ||
|
|
||
| // Check if tensors have the same strides (required for async copy) | ||
| bool same_strides = true; | ||
| for (int i = 0; i < self->dim(); i++) { | ||
| if (self_strides[i] != src_strides[i]) { | ||
| same_strides = false; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| ET_CHECK_OR_RETURN_ERROR( | ||
| same_strides, | ||
| InvalidArgument, | ||
| "aoti_torch_copy_async requires tensors with same strides. Use aoti_torch_copy_ for non-contiguous tensors"); | ||
|
|
||
| // Determine device locations | ||
| cudaPointerAttributes srcAttributes{}; | ||
| cudaPointerAttributes dstAttributes{}; | ||
|
|
||
| ET_CUDA_CHECK_OR_RETURN_ERROR( | ||
| cudaPointerGetAttributes(&srcAttributes, src->data_ptr())); | ||
|
|
||
| ET_CUDA_CHECK_OR_RETURN_ERROR( | ||
| cudaPointerGetAttributes(&dstAttributes, self->data_ptr())); | ||
|
|
||
| bool srcIsDevice = srcAttributes.type == cudaMemoryTypeDevice; | ||
| bool dstIsDevice = dstAttributes.type == cudaMemoryTypeDevice; | ||
|
|
||
| size_t total_bytes = src->nbytes(); | ||
|
|
||
| // Determine copy direction and perform async copy | ||
| if (srcIsDevice && dstIsDevice) { | ||
| ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync( | ||
| self->mutable_data_ptr(), | ||
| src->data_ptr(), | ||
| total_bytes, | ||
| cudaMemcpyDeviceToDevice, | ||
| stream)); | ||
| } else if (srcIsDevice && !dstIsDevice) { | ||
| ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync( | ||
| self->mutable_data_ptr(), | ||
| src->data_ptr(), | ||
| total_bytes, | ||
| cudaMemcpyDeviceToHost, | ||
| stream)); | ||
| } else if (!srcIsDevice && dstIsDevice) { | ||
| ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync( | ||
| self->mutable_data_ptr(), | ||
| src->data_ptr(), | ||
| total_bytes, | ||
| cudaMemcpyHostToDevice, | ||
| stream)); | ||
| } else { | ||
| // Host to host - use regular memcpy (no async benefit) | ||
| std::memcpy(self->mutable_data_ptr(), src->data_ptr(), total_bytes); | ||
| } | ||
|
|
||
| return Error::Ok; | ||
| } |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new aoti_torch_copy_async function lacks test coverage. Given that aoti_torch_copy_ has comprehensive test coverage in test_aoti_torch_copy_.cpp, the async variant should have similar tests covering:
- Basic async copy functionality with stream synchronization
- Dimension mismatch validation
- Stride mismatch validation
- Different device location combinations (H2D, D2H, D2D, H2H)
- Error cases (null pointers, dtype mismatch, etc.)
Consider adding a new test file test_aoti_torch_copy_async.cpp following the pattern of existing tests.
ae46c63 to
743dec9
Compare
No description provided.