Skip to content

[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243

Open
DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
DanielSun11:fix/embedding-0size-scale-grad-by-freq
Open

[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243
DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
DanielSun11:fix/embedding-0size-scale-grad-by-freq

Conversation

@DanielSun11
Copy link
Contributor

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

问题背景

调用 paddle.nn.functional.embedding 时,若输入 x 包含 0-size 维度(如 shape [0, 3])且 scale_grad_by_freq=True,在执行反向传播时会触发 CUDA error(9): invalid configuration argument,导致程序崩溃。

x = paddle.zeros([0, 3], dtype='int32')
weight = paddle.rand([10, 4], dtype='float32')
weight.stop_gradient = False
out = paddle.nn.functional.embedding(x, weight, padding_idx=5, scale_grad_by_freq=True)
out.sum().backward()  # CUDA error(9) !

根本原因

文件 paddle/phi/kernels/gpu/embedding_with_scaled_gradient_grad_kernel.cu 中,EmbeddingWithScaledGradientGradCUDAFunctor::apply() 使用 GET_BLOCKS(K) 计算 CUDA kernel 的 grid size,其中 K = input_.numel()

当输入为 0-size tensor 时,K = 0,而:

GET_BLOCKS(0) = (0 + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS
             = (0 + 511) / 512 = 0

0 个 block 启动 CountFreqKernel,违反了 CUDA 要求 gridDim.x >= 1 的约束,触发 error(9)。

前向 kernel(EmbeddingKernel)使用固定的 gridx = 2 * GetSMCount(),不受 K 影响,因此前向正常;只有反向的 CountFreqKernel 存在该问题。

修复方案

apply() 中,cudaMemsetAsyncd_table 初始化为 0 之后,立即检查 K == 0 并提前返回:

// When input has 0 elements, d_table is already correctly zeroed.
// Skip all kernel launches to avoid CUDA error(9) from GET_BLOCKS(0)==0.
if (K == 0) return;

K == 0 时,d_table(权重梯度)已经通过 cudaMemsetAsync 正确置零,这正是空输入对应的正确梯度,无需再启动任何 kernel。

测试验证

  • 直接触发 bug 的原始 case(9 个不同参数组合):全部 PASS
  • monitoring_configs 中所有 289 个 embedding 历史 0-size case:全部 PASS,无回归

是否引起精度变化

…ize input

When the input tensor has 0 elements (e.g. shape [0, 3]),
`input_.numel()` returns K=0. The CountFreqKernel is then launched
with GET_BLOCKS(0)=0 grid blocks, which violates the CUDA requirement
that gridDim.x >= 1, causing CUDA error(9) (invalid configuration).

Fix: add an early return after the cudaMemsetAsync initialization when
K==0. The weight grad tensor is already correctly zeroed by the memset,
so no further kernel launches are needed.
@paddle-bot
Copy link

paddle-bot bot commented Mar 10, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…ze input

Add TestEmbeddingScaleGradByFreqZeroSize to cover the bug fix where
CountFreqKernel was launched with 0 grid blocks (CUDA error(9)) when
input has 0 elements. Tests verify:
- dygraph: shapes [0], [0,3] int32/int64, [2,0], [6,0] with various padding_idx
- static: shape [0,3] int64
- output shape is correct (0-size preserved)
- weight grad is all-zeros with correct shape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant