Skip to content

Conversation

@mergennachin
Copy link
Contributor

No description provided.

Copilot AI review requested due to automatic review settings December 2, 2025 20:13
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16053

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 12 Pending

As of commit 743dec9 with merge base 33ec615 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2025
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot finished reviewing on behalf of mergennachin December 2, 2025 20:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This work-in-progress PR introduces asynchronous memory copying between host and device in the CUDA backend to improve performance. The implementation adds a new aoti_torch_copy_async function that uses CUDA streams for non-blocking memory transfers, replacing synchronous aoti_torch_copy_ calls in the execution pipeline.

Key changes:

  • Added aoti_torch_copy_async API with stream-based async memory transfers
  • Integrated async copies in the CUDA backend execution flow with proper stream synchronization
  • Added comprehensive documentation for the new async copy function

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
backends/cuda/runtime/shims/memory.h Added function declaration and documentation for aoti_torch_copy_async
backends/cuda/runtime/shims/memory.cpp Implemented aoti_torch_copy_async with validation, device detection, and async CUDA memory operations
backends/cuda/runtime/cuda_backend.cpp Integrated async copy for H2D and D2H transfers in the execution pipeline with stream synchronization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 585 to 764
AOTITorchError
aoti_torch_copy_async(Tensor* self, Tensor* src, cudaStream_t stream) {
// Check for null pointers first
ET_CHECK_OR_RETURN_ERROR(
self != nullptr,
InvalidArgument,
"aoti_torch_copy_async failed: self tensor is null");

ET_CHECK_OR_RETURN_ERROR(
src != nullptr,
InvalidArgument,
"aoti_torch_copy_async failed: src tensor is null");

// Get dtype information and validate compatibility
int32_t self_dtype, src_dtype;
aoti_torch_get_dtype(self, &self_dtype);
aoti_torch_get_dtype(src, &src_dtype);

ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(self_dtype));
ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(src_dtype));

// Check dtype compatibility - both tensors must have the same dtype
ET_CHECK_OR_RETURN_ERROR(
self_dtype == src_dtype,
InvalidArgument,
"dtype mismatch. self.dtype=%d, src.dtype=%d. aoti_torch_copy_async requires same dtypes",
self_dtype,
src_dtype);

// Check total number of elements compatibility
int64_t self_numel = self->numel();
int64_t src_numel = src->numel();

ET_CHECK_OR_RETURN_ERROR(
self_numel == src_numel,
InvalidArgument,
"numel mismatch. self.numel()=%ld, src.numel()=%ld",
self_numel,
src_numel);

// Get tensor metadata
int64_t* self_strides;
int64_t* src_strides;
aoti_torch_get_strides(self, &self_strides);
aoti_torch_get_strides(src, &src_strides);

// Check if tensors have the same strides (required for async copy)
bool same_strides = true;
for (int i = 0; i < self->dim(); i++) {
if (self_strides[i] != src_strides[i]) {
same_strides = false;
break;
}
}

ET_CHECK_OR_RETURN_ERROR(
same_strides,
InvalidArgument,
"aoti_torch_copy_async requires tensors with same strides. Use aoti_torch_copy_ for non-contiguous tensors");

// Determine device locations
cudaPointerAttributes srcAttributes{};
cudaPointerAttributes dstAttributes{};

ET_CUDA_CHECK_OR_RETURN_ERROR(
cudaPointerGetAttributes(&srcAttributes, src->data_ptr()));

ET_CUDA_CHECK_OR_RETURN_ERROR(
cudaPointerGetAttributes(&dstAttributes, self->data_ptr()));

bool srcIsDevice = srcAttributes.type == cudaMemoryTypeDevice;
bool dstIsDevice = dstAttributes.type == cudaMemoryTypeDevice;

size_t total_bytes = src->nbytes();

// Determine copy direction and perform async copy
if (srcIsDevice && dstIsDevice) {
ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
self->mutable_data_ptr(),
src->data_ptr(),
total_bytes,
cudaMemcpyDeviceToDevice,
stream));
} else if (srcIsDevice && !dstIsDevice) {
ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
self->mutable_data_ptr(),
src->data_ptr(),
total_bytes,
cudaMemcpyDeviceToHost,
stream));
} else if (!srcIsDevice && dstIsDevice) {
ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
self->mutable_data_ptr(),
src->data_ptr(),
total_bytes,
cudaMemcpyHostToDevice,
stream));
} else {
// Host to host - use regular memcpy (no async benefit)
std::memcpy(self->mutable_data_ptr(), src->data_ptr(), total_bytes);
}

return Error::Ok;
}
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new aoti_torch_copy_async function lacks test coverage. Given that aoti_torch_copy_ has comprehensive test coverage in test_aoti_torch_copy_.cpp, the async variant should have similar tests covering:

  • Basic async copy functionality with stream synchronization
  • Dimension mismatch validation
  • Stride mismatch validation
  • Different device location combinations (H2D, D2H, D2D, H2H)
  • Error cases (null pointers, dtype mismatch, etc.)

Consider adding a new test file test_aoti_torch_copy_async.cpp following the pattern of existing tests.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants