-
Notifications
You must be signed in to change notification settings - Fork 95
Issue/900 cuda及类cuda embedding #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…graph recording - Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds device-side embedding support for CUDA and CUDA-like platforms (NVIDIA, Metax, Moore) to enable CUDA Graph recording. The key improvement is removing synchronous CPU transfers that previously prevented graph recording, replacing them with fully asynchronous device-side kernel implementations.
Key changes:
- Implements device-side embedding kernels for NVIDIA, Metax, and Moore platforms with optimized vectorized memory access
- Removes synchronous
to(cpu_device)operations from test and production code - Adds comprehensive test suite for validating CUDA Graph recording support
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test/infinicore/ops/embedding.py | Removed CPU conversion logic, now supports device-side input directly |
| test/infinicore/nn/embedding.py | Removed CPU conversion logic for testing |
| test/infinicore/nn/test_embedding_graph_recording.py | New comprehensive test suite for CUDA Graph recording validation |
| test/infinicore/nn/HOW_TO_USE_GRAPH_RECORDING_TEST.md | Documentation on how to use and interpret the graph recording tests |
| test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md | Technical comparison of before/after implementation details |
| src/infiniop/ops/embedding/operator.cc | Multi-platform dispatcher for embedding operations |
| src/infiniop/ops/embedding/embedding.h | Common descriptor macro definition |
| src/infiniop/ops/embedding/cpu/embedding_cpu.h | CPU implementation header |
| src/infiniop/ops/embedding/cpu/embedding_cpu.cc | CPU implementation with memcpy-based embedding lookup |
| src/infiniop/ops/embedding/cuda/embedding_kernel.cuh | Shared CUDA kernel helper functions with vectorized memory access |
| src/infiniop/ops/embedding/nvidia/embedding_nvidia.cuh | NVIDIA platform header |
| src/infiniop/ops/embedding/nvidia/embedding_nvidia.cu | NVIDIA CUDA kernel implementation with float4/half2/bfloat162 optimization |
| src/infiniop/ops/embedding/metax/embedding_metax.cuh | Metax platform header |
| src/infiniop/ops/embedding/metax/embedding_metax.maca | Metax MACA kernel implementation |
| src/infiniop/ops/embedding/moore/embedding_moore.h | Moore platform header |
| src/infiniop/ops/embedding/moore/embedding_moore_kernel.h | Moore-specific kernel helpers |
| src/infiniop/ops/embedding/moore/embedding_moore.mu | Moore MUSA kernel implementation |
| src/infinicore/ops/embedding/embedding_infiniop.cc | InfiniOP wrapper with descriptor caching |
| src/infinicore/ops/embedding/embedding.cc | Op dispatcher and device synchronization logic |
| src/infinicore/nn/embedding.cc | Simplified forward method using new device-side ops |
| python/infinicore/nn/functional/embedding.py | Removed CPU-only assertion to support device inputs |
| include/infiniop/ops/embedding.h | Public API for embedding operations |
| include/infiniop.h | Added embedding.h include |
| include/infinicore/ops/embedding.hpp | Added Embedding class with dispatcher pattern |
| include/infinicore/ops.hpp | Added embedding.hpp include |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if constexpr (std::is_same_v<T, cuda_bfloat16>) { | ||
| // Use bfloat162 for vectorized access | ||
| if (embedding_dim >= 2 && embedding_dim % 2 == 0) { | ||
| copyVectorizedBFloat162<IndexType>(dst, src, embedding_dim); | ||
| } else { | ||
| copyScalar<T, IndexType>(dst, src, embedding_dim); | ||
| } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The embeddingKernel template checks for cuda_bfloat16 type on line 48, but this is Metax platform code which should use __hpcc_bfloat16 instead. This type mismatch means the vectorized bfloat16 path (lines 48-54) will never execute on Metax platform. The kernel should check for the Metax-specific type __hpcc_bfloat16 to enable proper vectorization.
| } else if (_embedding_dim >= 1024) { | ||
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure | ||
| } else if (_embedding_dim <= 256) { | ||
| block_size = 384; // Medium embedding_dim: balanced configuration |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block size adjustment logic has an inconsistency. When _embedding_dim <= 256, the code sets block_size = 384 (line 91), but this case only applies if _embedding_dim is greater than 64 (because line 87 handles <= 64). This means embedding_dim values between 65-256 will use block_size=384. However, this contradicts the default of 256 which was already set. The logic should be restructured to avoid overlapping conditions and ensure proper block size selection.
| } else if (_embedding_dim >= 1024) { | |
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure | |
| } else if (_embedding_dim <= 256) { | |
| block_size = 384; // Medium embedding_dim: balanced configuration | |
| } else if (_embedding_dim <= 256) { | |
| block_size = 384; // Medium embedding_dim: balanced configuration | |
| } else if (_embedding_dim >= 1024) { | |
| block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure |
| print("⚠ CUDA 不可用,跳过图录制测试") | ||
| return False | ||
|
|
||
| device = infinicore.device("cuda", 0) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to a non-callable of builtin-class module.
| print("⚠ CUDA 不可用,跳过测试") | ||
| return False | ||
|
|
||
| device = infinicore.device("cuda", 0) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call to a non-callable of builtin-class module.
98a0be1 to
7997d89
Compare
7997d89 to
2c02d73
Compare
resolves #900
resolves #846
includes #859
天数


沐曦

摩尔
