Skip to content

Conversation

@gn64
Copy link
Contributor

@gn64 gn64 commented Dec 2, 2024

Fix division by zero error in soft_max vulkan shader

This PR fixes #2596 by adding a check for p.KY being zero in the soft_max compute shader.

Changes:

  • Modified the calculation of rowy in soft_max.comp to handle the case where p.KY is 0
  • Changed const uint rowy = rowx % p.KY; to const uint rowy = (p.KY > 0) ? (rowx % p.KY) : 0;

The original code would cause a division by zero error when p.KY is 0. This fix ensures that rowy is set to 0 in such cases, preventing the crash while maintaining the expected behavior in normal scenarios.

Testing:

  • Confirmed the fix resolves the crash in my environment
  • Existing functionality remains unchanged when p.KY > 0

gn64 added 3 commits December 3, 2024 06:17
This change prevents a division by zero error when p.KY is 0.
@slaren
Copy link
Member

slaren commented Dec 15, 2024

The Vulkan fix looks correct, this can happen when using ggml_soft_max. @0cc4m please take a look.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 15, 2024

Looks good, but what are the other two changes?

@ggerganov
Copy link
Member

@gn64 Please update the PR to include just the Vulkan fix.

@0cc4m In which cases p.KY == 0? We should add a test to test-backend-ops to cover these.

@gn64
Copy link
Contributor Author

gn64 commented Dec 16, 2024

Closing this PR in favor of a more focused fix in #2633.
The new PR contains only the Vulkan shader division by zero fix, making the changes easier to review and maintain.

@gn64 gn64 closed this Dec 16, 2024
@ggerganov
Copy link
Member

Thank you. I'm still interested in which cases p.KY == 0 - I'm not able to deduce by looking at the code.

@slaren
Copy link
Member

slaren commented Dec 16, 2024

I believe it happens when using using a NULL mask, which happens when using the plain ggml_soft_max.

    ggml_vk_op_f32<vk_op_soft_max_push_constants>(ctx, subctx, src0, src1, nullptr, dst, GGML_OP_SOFT_MAX, {
        ncols,
        src1 != nullptr ? nrows_y : (uint32_t)0, // this is KY
        scale, max_bias,
        m0, m1,
        n_head_log2,
        nrows_x,
    }, dryrun);

@ggerganov
Copy link
Member

That's what I thought as well, but we do have tests in test-backend-ops without a mask, so I am not sure why they are passing.

@slaren
Copy link
Member

slaren commented Dec 16, 2024

Division by zero is likely just undefined behavior, and in some GPUs it may just return zero. Still, can't rely on that behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid probability vector error with AMD iGPU on Vulkan backend Environment

4 participants