[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243
Open
DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
Open
[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
…ize input When the input tensor has 0 elements (e.g. shape [0, 3]), `input_.numel()` returns K=0. The CountFreqKernel is then launched with GET_BLOCKS(0)=0 grid blocks, which violates the CUDA requirement that gridDim.x >= 1, causing CUDA error(9) (invalid configuration). Fix: add an early return after the cudaMemsetAsync initialization when K==0. The weight grad tensor is already correctly zeroed by the memset, so no further kernel launches are needed.
|
你的PR提交成功,感谢你对开源项目的贡献! |
…ze input Add TestEmbeddingScaleGradByFreqZeroSize to cover the bug fix where CountFreqKernel was launched with 0 grid blocks (CUDA error(9)) when input has 0 elements. Tests verify: - dygraph: shapes [0], [0,3] int32/int64, [2,0], [6,0] with various padding_idx - static: shape [0,3] int64 - output shape is correct (0-size preserved) - weight grad is all-zeros with correct shape
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
问题背景
调用
paddle.nn.functional.embedding时,若输入x包含 0-size 维度(如 shape[0, 3])且scale_grad_by_freq=True,在执行反向传播时会触发 CUDA error(9): invalid configuration argument,导致程序崩溃。根本原因
文件
paddle/phi/kernels/gpu/embedding_with_scaled_gradient_grad_kernel.cu中,EmbeddingWithScaledGradientGradCUDAFunctor::apply()使用GET_BLOCKS(K)计算 CUDA kernel 的 grid size,其中K = input_.numel()。当输入为 0-size tensor 时,
K = 0,而:以 0 个 block 启动
CountFreqKernel,违反了 CUDA 要求gridDim.x >= 1的约束,触发 error(9)。前向 kernel(
EmbeddingKernel)使用固定的gridx = 2 * GetSMCount(),不受 K 影响,因此前向正常;只有反向的CountFreqKernel存在该问题。修复方案
在
apply()中,cudaMemsetAsync将d_table初始化为 0 之后,立即检查K == 0并提前返回:当
K == 0时,d_table(权重梯度)已经通过cudaMemsetAsync正确置零,这正是空输入对应的正确梯度,无需再启动任何 kernel。测试验证
monitoring_configs中所有 289 个 embedding 历史 0-size case:全部 PASS,无回归是否引起精度变化
否