Skip to content

Conversation

@catan2001
Copy link

Description

Testing llama.cpp with Llama 3.2 model on RX 6700XT caused a floating point exception (SIGFPE) when launching the FlashAttention kernel (fattn_kernel):

Thread 1 "llama-cli" received signal SIGFPE, Arithmetic exception. 
0x000079c19ca40223 in void launch_fattn<64, 16, 1>(ggml_backend_cuda_context&, ggml_tensor*, void (*)(char const*, char const*, char const*, char const*, char const*, int const*, float*, HIP_vector_type<float, 2u>*, float, float, float, float, unsigned int, float, int, int, int, int, int, int, int, int, int, int, int, int, int, long, int, int, long, int, int, int, int, int, long), int, unsigned long, int, bool, bool, bool, int) () 
from /therock/src/external-builds/llama.cpp/llama.cpp/build/bin/libggml-hip.so

Technical Explanation

The issue occurs because cudaOccupancyMaxActiveBlocksPerMultiprocessor sometimes returns 0 for max_blocks_per_sm due to high shared memory or register usage. This value is later used in a division:

const int max_blocks = max_blocks_per_sm * nsm;
const int tiles_nwaves = (ntiles_total + max_blocks - 1) / max_blocks;

When max_blocks_per_sm is 0, this causes a division by zero, triggering the FPE.

Fix:

Add a safeguard to ensure at least one block is launched:

max_blocks_per_sm = std::max(max_blocks_per_sm, 1);

Some kernel configurations can produce zero occupancy on certain
GPUs (example: RX 6700XT). This adds a safeguard to ensure at least
one block is launched, preventing floating point exception.

Co-authored-by: Attila Dusnoki <[email protected]>.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025
@JohannesGaessler
Copy link
Collaborator

This seems like the wrong fix. Under which circumstances is an occupancy of 0 returned?

@catan2001
Copy link
Author

catan2001 commented Oct 17, 2025

This seems like the wrong fix. Under which circumstances is an occupancy of 0 returned?

@JohannesGaessler Hi, sorry if my initial description wasn’t clear enough. So this happens when I run llama-cli with the Llama-3.2 3B. Specifically, the error seems to be caused by max_blocks_per_sm being set to 0. This was done on AMD RX 6700XT.

Here is small print log:

FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 2
FATTN: max_blocks_per_sm = 0    < causes the fpe

@JohannesGaessler
Copy link
Collaborator

I don't understand why this is happening. According to the documentation GFX 1030 (which I use for testing) and GFX 1031 have the same amount of SRAM and registers so I would expect them to be able to achieve the same occupancy. I am not seeing any warnings about failing to meet occupancy targets in the compilation log, both for GFX 1030 and for GFX 1031. Please provide me with the exact commands you used to compile and run llama.cpp. It would also be helpful if you could provide me with the values for nwarps and nbytes_shared for the failing case.

@catan2001
Copy link
Author

I don't understand why this is happening. According to the documentation GFX 1030 (which I use for testing) and GFX 1031 have the same amount of SRAM and registers so I would expect them to be able to achieve the same occupancy. I am not seeing any warnings about failing to meet occupancy targets in the compilation log, both for GFX 1030 and for GFX 1031. Please provide me with the exact commands you used to compile and run llama.cpp. It would also be helpful if you could provide me with the values for nwarps and nbytes_shared for the failing case.

Sorry for the delay. I'll go ahead and close the PR, as I've identified the issue: it was related to using TheRock as the build environment for ROCm. After testing with prebuilt ROCm versions 6.4.1 and 7.0.2, everything works correctly without triggering the floating point exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants