[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input by DanielSun11 · Pull Request #78243 · PaddlePaddle/Paddle

DanielSun11 · 2026-03-10T05:21:20Z

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

问题背景

调用 paddle.nn.functional.embedding 时，若输入 x 包含 0-size 维度（如 shape [0, 3]）且 scale_grad_by_freq=True，在执行反向传播时会触发 CUDA error(9): invalid configuration argument，导致程序崩溃。

x = paddle.zeros([0, 3], dtype='int32')
weight = paddle.rand([10, 4], dtype='float32')
weight.stop_gradient = False
out = paddle.nn.functional.embedding(x, weight, padding_idx=5, scale_grad_by_freq=True)
out.sum().backward()  # CUDA error(9) !

根本原因

文件 paddle/phi/kernels/gpu/embedding_with_scaled_gradient_grad_kernel.cu 中，EmbeddingWithScaledGradientGradCUDAFunctor::apply() 使用 GET_BLOCKS(K) 计算 CUDA kernel 的 grid size，其中 K = input_.numel()。

当输入为 0-size tensor 时，K = 0，而：

GET_BLOCKS(0) = (0 + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS
             = (0 + 511) / 512 = 0

以 0 个 block 启动 CountFreqKernel，违反了 CUDA 要求 gridDim.x >= 1 的约束，触发 error(9)。

前向 kernel（EmbeddingKernel）使用固定的 gridx = 2 * GetSMCount()，不受 K 影响，因此前向正常；只有反向的 CountFreqKernel 存在该问题。

修复方案

在 apply() 中，cudaMemsetAsync 将 d_table 初始化为 0 之后，立即检查 K == 0 并提前返回：

// When input has 0 elements, d_table is already correctly zeroed.
// Skip all kernel launches to avoid CUDA error(9) from GET_BLOCKS(0)==0.
if (K == 0) return;

当 K == 0 时，d_table（权重梯度）已经通过 cudaMemsetAsync 正确置零，这正是空输入对应的正确梯度，无需再启动任何 kernel。

测试验证

直接触发 bug 的原始 case（9 个不同参数组合）：全部 PASS
monitoring_configs 中所有 289 个 embedding 历史 0-size case：全部 PASS，无回归

是否引起精度变化

否

…ize input When the input tensor has 0 elements (e.g. shape [0, 3]), `input_.numel()` returns K=0. The CountFreqKernel is then launched with GET_BLOCKS(0)=0 grid blocks, which violates the CUDA requirement that gridDim.x >= 1, causing CUDA error(9) (invalid configuration). Fix: add an early return after the cudaMemsetAsync initialization when K==0. The weight grad tensor is already correctly zeroed by the memset, so no further kernel launches are needed.

paddle-bot · 2026-03-10T05:21:26Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…ze input Add TestEmbeddingScaleGradByFreqZeroSize to cover the bug fix where CountFreqKernel was launched with 0 grid blocks (CUDA error(9)) when input has 0 elements. Tests verify: - dygraph: shapes [0], [0,3] int32/int64, [2,0], [6,0] with various padding_idx - static: shape [0,3] int64 - output shape is correct (0-size preserved) - weight grad is all-zeros with correct shape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243

[Zero-size] fix embedding_with_scaled_gradient_grad CUDA error(9) on 0-size input#78243
DanielSun11 wants to merge 2 commits intoPaddlePaddle:developfrom
DanielSun11:fix/embedding-0size-scale-grad-by-freq

DanielSun11 commented Mar 10, 2026

Uh oh!

paddle-bot bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanielSun11 commented Mar 10, 2026

PR Category

PR Types

Description

问题背景

根本原因

修复方案

测试验证

是否引起精度变化

Uh oh!

paddle-bot bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant