[WIP][CUDA backend]: Async copy between host<->device #16053

mergennachin · 2025-12-02T20:13:28Z

No description provided.

pytorch-bot · 2025-12-02T20:13:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16053

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 12 Pending

As of commit 743dec9 with merge base 33ec615 ():

NEW FAILURES - The following jobs have failed:

cuda-perf / benchmark-cuda (google/gemma-3-4b-it, non-quantized, google_gemma-3-4b-it, 50) / linux-job (gh)
RuntimeError: Command docker exec -t 17fd8ebc19fcd67f8a2dc04dee7ed08e3025561116212bd1e4ba8fef3daa2fda /exec failed with exit code 1
cuda-perf / benchmark-cuda (google/gemma-3-4b-it, quantized-int4-tile-packed, google_gemma-3-4b-it, 50) / linux-job (gh)
RuntimeError: Command docker exec -t 68fbab77e37b2ec401f8cc3f6664d0e28dacdbf20747a80aee5ed62ada224c0e /exec failed with exit code 1
cuda-perf / benchmark-cuda (google/gemma-3-4b-it, quantized-int4-weight-only, google_gemma-3-4b-it, 50) / linux-job (gh)
Unable to download artifact(s): Artifact not found for name: model-google_gemma-3-4b-it-quantized-int4-weight-only
cuda-perf / benchmark-cuda (mistralai/Voxtral-Mini-3B-2507, non-quantized, mistralai_Voxtral-Mini-3B-2507, 50) / linux-job (gh)
RuntimeError: Command docker exec -t 84f3e685a3a1a5a140d77951003fb4b4e26174963a69a2046e60be4db704a81b /exec failed with exit code 1
cuda-perf / benchmark-cuda (mistralai/Voxtral-Mini-3B-2507, quantized-int4-tile-packed, mistralai_Voxtral-Min... / linux-job (gh)
RuntimeError: Command docker exec -t 04372091df70930c644041b6b86560e155f520bd1b8ca45beda29eab150f9b55 /exec failed with exit code 1
cuda-perf / benchmark-cuda (mistralai/Voxtral-Mini-3B-2507, quantized-int4-weight-only, mistralai_Voxtral-Min... / linux-job (gh)
RuntimeError: Command docker exec -t e0b17c061a0d9cc62a98706413a4ef837258aaf2d5c4bc83b3f16a8694635171 /exec failed with exit code 1
cuda-perf / benchmark-cuda (openai/whisper-large-v3-turbo, quantized-int4-weight-only, openai_whisper-large-v... / linux-job (gh)
RuntimeError: Command docker exec -t 79a277491facf6ca03f210666f9d1c06d70a04c7a1f94ec1938f150702d63ccf /exec failed with exit code 1
cuda-perf / benchmark-cuda (openai/whisper-medium, quantized-int4-tile-packed, openai_whisper-medium, 50) / linux-job (gh)
RuntimeError: Command docker exec -t 37d89bde57c9154304ca0933588825a839f2e35eccfa10057c680e501dfd1895 /exec failed with exit code 1
cuda-perf / export-models (google/gemma-3-4b-it, quantized-int4-weight-only, google_gemma-3-4b-it, 50) / linux-job (gh)
RuntimeError: Command docker exec -t c275354afdc2bc3135ea75803c76888ec4e8daaa2b9342958e7ca550e3c6e05f /exec failed with exit code 1
pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh)
RuntimeError: Command docker exec -t 09689ae41192d0dfa4aea897cd2e2277905bf1017500b688069824c2b7327fcb /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d
Test CUDA Builds / test-models-cuda (sdpa) / linux-job (gh)
RuntimeError: Command docker exec -t 27c8a65d64957f20833e94fa7f549011a31a464f57909c3f463ae2569aed01f8 /exec failed with exit code 134

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-12-02T20:14:04Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This work-in-progress PR introduces asynchronous memory copying between host and device in the CUDA backend to improve performance. The implementation adds a new aoti_torch_copy_async function that uses CUDA streams for non-blocking memory transfers, replacing synchronous aoti_torch_copy_ calls in the execution pipeline.

Key changes:

Added aoti_torch_copy_async API with stream-based async memory transfers
Integrated async copies in the CUDA backend execution flow with proper stream synchronization
Added comprehensive documentation for the new async copy function

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
backends/cuda/runtime/shims/memory.h	Added function declaration and documentation for `aoti_torch_copy_async`
backends/cuda/runtime/shims/memory.cpp	Implemented `aoti_torch_copy_async` with validation, device detection, and async CUDA memory operations
backends/cuda/runtime/cuda_backend.cpp	Integrated async copy for H2D and D2H transfers in the execution pipeline with stream synchronization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backends/cuda/runtime/shims/memory.cpp

backends/cuda/runtime/shims/memory.h

backends/cuda/runtime/shims/memory.cpp

backends/cuda/runtime/shims/memory.h

Copilot · 2025-12-02T20:18:04Z

backends/cuda/runtime/shims/memory.cpp

+AOTITorchError
+aoti_torch_copy_async(Tensor* self, Tensor* src, cudaStream_t stream) {
+  // Check for null pointers first
+  ET_CHECK_OR_RETURN_ERROR(
+      self != nullptr,
+      InvalidArgument,
+      "aoti_torch_copy_async failed: self tensor is null");
+
+  ET_CHECK_OR_RETURN_ERROR(
+      src != nullptr,
+      InvalidArgument,
+      "aoti_torch_copy_async failed: src tensor is null");
+
+  // Get dtype information and validate compatibility
+  int32_t self_dtype, src_dtype;
+  aoti_torch_get_dtype(self, &self_dtype);
+  aoti_torch_get_dtype(src, &src_dtype);
+
+  ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(self_dtype));
+  ET_CHECK_OK_OR_RETURN_ERROR(validate_dtype(src_dtype));
+
+  // Check dtype compatibility - both tensors must have the same dtype
+  ET_CHECK_OR_RETURN_ERROR(
+      self_dtype == src_dtype,
+      InvalidArgument,
+      "dtype mismatch. self.dtype=%d, src.dtype=%d. aoti_torch_copy_async requires same dtypes",
+      self_dtype,
+      src_dtype);
+
+  // Check total number of elements compatibility
+  int64_t self_numel = self->numel();
+  int64_t src_numel = src->numel();
+
+  ET_CHECK_OR_RETURN_ERROR(
+      self_numel == src_numel,
+      InvalidArgument,
+      "numel mismatch. self.numel()=%ld, src.numel()=%ld",
+      self_numel,
+      src_numel);
+
+  // Get tensor metadata
+  int64_t* self_strides;
+  int64_t* src_strides;
+  aoti_torch_get_strides(self, &self_strides);
+  aoti_torch_get_strides(src, &src_strides);
+
+  // Check if tensors have the same strides (required for async copy)
+  bool same_strides = true;
+  for (int i = 0; i < self->dim(); i++) {
+    if (self_strides[i] != src_strides[i]) {
+      same_strides = false;
+      break;
+    }
+  }
+
+  ET_CHECK_OR_RETURN_ERROR(
+      same_strides,
+      InvalidArgument,
+      "aoti_torch_copy_async requires tensors with same strides. Use aoti_torch_copy_ for non-contiguous tensors");
+
+  // Determine device locations
+  cudaPointerAttributes srcAttributes{};
+  cudaPointerAttributes dstAttributes{};
+
+  ET_CUDA_CHECK_OR_RETURN_ERROR(
+      cudaPointerGetAttributes(&srcAttributes, src->data_ptr()));
+
+  ET_CUDA_CHECK_OR_RETURN_ERROR(
+      cudaPointerGetAttributes(&dstAttributes, self->data_ptr()));
+
+  bool srcIsDevice = srcAttributes.type == cudaMemoryTypeDevice;
+  bool dstIsDevice = dstAttributes.type == cudaMemoryTypeDevice;
+
+  size_t total_bytes = src->nbytes();
+
+  // Determine copy direction and perform async copy
+  if (srcIsDevice && dstIsDevice) {
+    ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
+        self->mutable_data_ptr(),
+        src->data_ptr(),
+        total_bytes,
+        cudaMemcpyDeviceToDevice,
+        stream));
+  } else if (srcIsDevice && !dstIsDevice) {
+    ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
+        self->mutable_data_ptr(),
+        src->data_ptr(),
+        total_bytes,
+        cudaMemcpyDeviceToHost,
+        stream));
+  } else if (!srcIsDevice && dstIsDevice) {
+    ET_CUDA_CHECK_OR_RETURN_ERROR(cudaMemcpyAsync(
+        self->mutable_data_ptr(),
+        src->data_ptr(),
+        total_bytes,
+        cudaMemcpyHostToDevice,
+        stream));
+  } else {
+    // Host to host - use regular memcpy (no async benefit)
+    std::memcpy(self->mutable_data_ptr(), src->data_ptr(), total_bytes);
+  }
+
+  return Error::Ok;
+}


The new aoti_torch_copy_async function lacks test coverage. Given that aoti_torch_copy_ has comprehensive test coverage in test_aoti_torch_copy_.cpp, the async variant should have similar tests covering:

Basic async copy functionality with stream synchronization

Dimension mismatch validation

Stride mismatch validation

Different device location combinations (H2D, D2H, D2D, H2H)

Error cases (null pointers, dtype mismatch, etc.)

Consider adding a new test file test_aoti_torch_copy_async.cpp following the pattern of existing tests.

Copilot AI review requested due to automatic review settings December 2, 2025 20:13

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2025

Copilot started reviewing on behalf of mergennachin December 2, 2025 20:13 View session

Copilot finished reviewing on behalf of mergennachin December 2, 2025 20:16

Copilot AI reviewed Dec 2, 2025

View reviewed changes

[WIP][CUDA backend]: Async copy between host<->device

743dec9

mergennachin force-pushed the cuda_memcpy branch from ae46c63 to 743dec9 Compare December 2, 2025 20:36

mergennachin closed this Dec 2, 2025

Gasoonjia had a problem deploying to upload-benchmark-results December 2, 2025 22:03 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][CUDA backend]: Async copy between host<->device #16053

[WIP][CUDA backend]: Async copy between host<->device #16053

Uh oh!

mergennachin commented Dec 2, 2025

Uh oh!

pytorch-bot bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP][CUDA backend]: Async copy between host<->device #16053

[WIP][CUDA backend]: Async copy between host<->device #16053

Uh oh!

Conversation

mergennachin commented Dec 2, 2025

Uh oh!

pytorch-bot bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16053

❌ 12 New Failures, 12 Pending

Uh oh!

github-actions bot commented Dec 2, 2025

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Dec 2, 2025 •

edited

Loading

This PR needs a `release notes:` label