diff --git a/_collections/_authors/nicolas-miller.markdown b/_collections/_authors/nicolas-miller.markdown new file mode 100644 index 0000000..586737e --- /dev/null +++ b/_collections/_authors/nicolas-miller.markdown @@ -0,0 +1,7 @@ +--- +user_id: 5011213131 +disabled: 0 +title: "Nicolas Miller" +position: Senior Software Engineer +avatar: /assets/images/portal/authors/missing.png +--- diff --git a/_collections/_portal_posts/2025-08-20-gromacs.md b/_collections/_portal_posts/2025-08-20-gromacs.md new file mode 100644 index 0000000..883e6f5 --- /dev/null +++ b/_collections/_portal_posts/2025-08-20-gromacs.md @@ -0,0 +1,995 @@ +--- +category: blogs +date: '2025-08-20T02:00:00.0' +hidden: false +layout: portal/portal-article-view +thumbnail: /assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp +title: 'Tracking down a register allocator bug' +user_id: 5011213131 +--- + +This article describes the investigation of a bug found when running +[GROMACS](https://www.gromacs.org/) testing using SYCL with the [DPC++ +compiler](https://github.com/intel/llvm/) while targeting AMD GPUs. + +This was initially reported in the `intel/llvm` GitHub repository issue-tracker +[intel/llvm#6209](https://github.com/intel/llvm/issues/6209/) and was using a +[specific +version](https://gitlab.com/gromacs/gromacs/-/tree/aa-hwe-release-2022-dpcpp-hip) +of GROMACS with SYCL support. + +This bug will take us from the GROMACS source code all the way down to the +register allocation in the compiler backend. + +## Description of the issue + +When running one of the GROMACS tests built with DPC++ targeting AMD, the test +crashes with the following output: + +```console +Memory access fault by GPU node-4 (Agent handle: 0x215c1f0) on address 0x7ff5f0d6d000. Reason: Page not present or supervisor privilege. +Aborted (core dumped) +``` + +This error indicates that the kernel is accessing invalid memory, for example +this can happen in simple out-of-bounds accesses. + +It is reported that this issue appears on MI50 ([gfx906 +ISA](https://developer.amd.com/wp-content/resources/Vega_7nm_Shader_ISA.pdf)), +and MI200 ([gfx90a +ISA](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf)), +but that it doesn't fail on MI100 ([gfx908 +ISA](https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf)). + +The specific test is being run as follows: + +```console +SYCL_DEVICE_FILTER=hip:gpu ./bin/mdrun-pull-test +``` + +## Finding the kernel + +The GROMACS application contains a lot of kernels so the first step is to figure +out exactly which kernel is causing the issue. + +One easy way to do that is to use the ROCm debug environment variable +`AMD_LOG_LEVEL=4`. We can then find the last `ShaderName` debug output before +the crash: + +``` +:3:rocvirtual.cpp :2738: 6536108370963 us: 24086: [tid:0x7f5e78ab4740] ShaderName : _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE +``` + +Which gives us the name of the last kernel to run on the GPU, which is almost +certainly the kernel that caused the access fault. This debug output gives us +the kernel name in its mangled form, but using the `c++filt` tool we can easily +turn it into a more readable format, and so we get: + +``` +c++filt _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE +typeinfo name for NbnxmKernel +``` + +Alternatively we can use the ROCm debugger `rocgdb`, simply running the program +under `rocgdb` after the segfault the debugger will show which kernel it +happened in and also allow us to disassemble the kernel. Using the debugger can +unfortunately only go so far because debug information is not supported in DPC++ +for AMD targets. + +With the debugger we get the following: + +``` +Thread 3 "mdrun-pull-test" received signal SIGBUS, Bus error.↩ +[Switching to thread 3, lane 0 (AMDGPU Lane 4:13:1:1/0 (0,0,0)[0,0,0])]↩ +0x00007fffe5297078 in typeinfo name for NbnxmKernel () from file:///path/to/gromacs/build/lib/libgromacs.so.7 #offset=75955456&size=17664↩ +``` + +Using the `gdb` command `disas` we can then do our first proper analysis step +and look at the assembly dump showing on which instruction the memory access +fault is happening. + +``` +Dump of assembler code for function _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE: +[...] + 0x00007fffe5297064 <+1124>:| global_load_dwordx2 v[14:15], v[5:6], off offset:16 + 0x00007fffe529706c <+1132>:| s_waitcnt lgkmcnt(0) + 0x00007fffe5297070 <+1136>:| v_mov_b32_e32 v10, s41 + 0x00007fffe5297074 <+1140>:| s_mov_b32 s64, s80 +=> 0x00007fffe5297078 <+1144>:| s_waitcnt vmcnt(0) +[...] +``` + +The arrow indicates that the error is happening on the `s_waitcnt vmcnt(0)`. +This instruction waits for all the memory operations using vector registers to +be completed. Therefore the likely culprit in this case is actually the +`global_load_dwordx2` instruction above. + +At this point we know that the issue happens in the `NbnxmKernel` and from the +disassembly it seems likely to be caused by a load instruction, this is +consistent with the memory access fault issue we're seeing, the address used for +this load instruction must be incorrect in some way. + +## Comparing the assembly + +We know that the application works on `gfx908` but not on `gfx906` so a first +step can be to compare the assembly generated by both and see if any significant +differences can be identified. + +To obtain the assembly we can use the environment variable `SYCL_DUMP_IMAGES=1`, +this is a DPC++ environment variable which will dump all the available kernels +images when running an application. GROMACS is a fairly large application so we +end up with 309 `.bin` images. + +We can then use `grep` to search for the mangled kernel name we got out of +the debugger: + +``` +% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *.bin +Binary file sycl_amdgcn142.bin matches +``` + +The `.bin` files are HIP fat binaries, we can then extract the actual device +object file from them with: + +``` +clang-offload-bundler --unbundle --type=o --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --input=sycl_amdgcn142.bin --output=sycl_amdgcn142.bin.o +``` + +And then disassemble the device objects with: + +``` +llvm-objdump -d sycl_amdgcn142.bin.o &> sycl_amdgcn142.s +``` + +Doing that for both architectures we end up with the assembly for both and we +can then use a diff tool to compare them. + +Unfortunately at this point the differences between the two are very significant +and it's very difficult to track down exactly where the offending instruction is +in the `gfx908` assembly, so more work will be required to make use of this and +we'll come back to it later. + +## Tracking down the issue in the source + +Knowing the kernel name we can find the source for it in GROMACS: + +``` +src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h +``` + +* https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L663 + +Unfortunately for us this kernel is quite large and complicated, roughly 400 +lines of code with loops and many conditionals. And our debugging options are +very limited, indeed as mentioned above source level debugging is not available, +and neither is `printf`! Both of these are available in the regular AMD +toolchain but not yet enabled in DPC++ for AMD at the time of this +investigation. + +So as a way to get source level information on where this bug is happening we +used the horrible snippet of code below: + +```cpp +volatile int* crash = nullptr; +*crash++; +``` + +You see adding this in the body of a kernel will cause it to crash with a memory +access fault error nearly identical to the one caused by our bug, but since the +error message prints the address of the memory access fault it will always show +`0` if this piece of code caused it. And so this will show us if `*crash++` was +run before or after the offending line of code, which means that by moving +`*crash++` around in our kernel code and looking at the resulting error we can +eventually track down which line of code is causing the issue. + +After a lot of trial and error the [offending +line](https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L793/) +in the kernel source was identified to be: + +``` +unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask; +``` + +This matches what we were expecting since this line loads from both `a_plistCJ4` +and then from `imei`, so it must be that either one of these loads triggers the +error. + +In addition, we also observed that the error never happens on the first +iteration of the loop this line is in, this information will become important +later on. + +At this stage the obvious next step is to check whether `j4` or `imeiIdx` are +out of bounds for what they're indexing. A cursory look through the code didn't +spot anything obvious that would suggest these indices may be wrong. But by that +point even though we still didn't have access to a proper `printf` we figured +out that it was possible to use lower level `printf` primitives provided by the +HIP toolchain from within the kernel, so we were able to try printing these +indices: + +``` +long msg = __ockl_fprintf_stdout_begin(); +msg = __ockl_fprintf_append_string_n(msg, "debug: %lu, %lu\n", 16, 0); +msg = __ockl_fprintf_append_args(msg, 2, j4, imeiIdx, 0, 0, 0, 0, 1); +``` + +Now surprisingly after adding this bit of code before the offending line the +program started working correctly! This is usually indicative that whatever we +did to the code caused the compiler take a different path and avoided triggering +the issue. + +Digging a little further around that idea, we found out that printing just +`imeiIdx` fixed the issue, and furthermore marking `imeiIdx` as `volatile` also +fixed the problem, and finally since we also know that the first iteration is +always correct, we also ended up figuring out that disabling loop unrolling with +`-fno-unroll-loops` also fixes the issue. + +So we now have a lot more information about what's going on and multiple ways of +avoiding the issue, so we're almost ready to dive back in the assembly to see +what we can figure out. But before we do, one of the helpful thing we can do is +to surround the offending line with barriers: + +``` +itemIdx.barrier(fence_space::local_space); +unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask; +itemIdx.barrier(fence_space::local_space); +``` + +Barriers are helpful because they're lowered to recognizable `s_barrier` +instructions and they limit how the compiler can re-organize the assembly around +them which makes it a lot easier to identify in the assembly where this line of +source code is. And luckily for us adding them doesn't fix the issue we're +seeing, so we can use them as a handy marker. + +## Diving back into the assembly + +Now we can look again at comparing assembly between all the different cases +we've identified, and see if we can spot what's incorrect: + +Assembly for `gfx906` with the extra barriers and no other modification, the +offending instruction is the `global_load_dword` at the bottom. The generated +assembly looks different than what we originally got in `rocgdb` but it still +triggers the error: + +``` +s_cbranch_scc1 2000 // 0000000020AC: BF8507D0 <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x23f0> +s_ashr_i32 s43, s42, 31 // 0000000020B0: 902B9F2A +s_lshl_b64 s[48:49], s[42:43], 5 // 0000000020B4: 8EB0852A +s_add_u32 s16, s40, s48 // 0000000020B8: 80103028 +v_lshlrev_b64 v[7:8], 3, v[8:9] // 0000000020BC: D28F0007 00021083 +s_addc_u32 s17, s41, s49 // 0000000020C4: 82113129 +s_waitcnt lgkmcnt(0) // 0000000020C8: BF8CC07F +v_mov_b32_e32 v6, s17 // 0000000020CC: 7E0C0211 +v_add_co_u32_e64 v5, s[16:17], s16, v7 // 0000000020D0: D1191005 00020E10 +v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17] // 0000000020D8: D11C1006 00421106 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020E0: BF8C0000 +s_barrier // 0000000020E4: BF8A0000 +global_load_dword v52, v[5:6], off offset:16 // 0000000020E8: DC508010 347F0005 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020F0: BF8C0000 +s_barrier +``` + +Assembly for `gfx906` with `imeiIdx` marked as `volatile` (test passing): + +``` +s_cbranch_scc1 2012 // 000000002114: BF8507DC <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x2488> +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +s_barrier // 00000000211C: BF8A0000 +buffer_load_dword v8, off, s[0:3], s33 offset:16 // 000000002120: E0500010 21000800 +s_nop 0 // 000000002128: BF800000 +buffer_load_dword v9, off, s[0:3], s33 offset:20 // 00000000212C: E0500014 21000900 +s_ashr_i32 s47, s46, 31 // 000000002134: 902F9F2E +s_lshl_b64 s[52:53], s[46:47], 5 // 000000002138: 8EB4852E +s_add_u32 s20, s44, s52 // 00000000213C: 8014342C +s_addc_u32 s21, s45, s53 // 000000002140: 8215352D +s_add_u32 s24, s20, 16 // 000000002144: 80189014 +s_addc_u32 s25, s21, 0 // 000000002148: 82198015 +v_mov_b32_e32 v7, s25 // 00000000214C: 7E0E0219 +s_mov_b32 s58, s81 // 000000002150: BEBA0051 +s_waitcnt vmcnt(0) // 000000002154: BF8C0F70 +flat_load_dword v11, v[8:9] glc // 000000002158: DC510000 0B000008 +s_waitcnt vmcnt(0) lgkmcnt(0) // 000000002160: BF8C0070 +v_lshlrev_b64 v[5:6], 3, v[11:12] // 000000002164: D28F0005 00021683 +v_add_co_u32_e64 v5, s[20:21], s24, v5 // 00000000216C: D1191405 00020A18 +v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21] // 000000002174: D11C1406 00520D07 +global_load_dword v51, v[5:6], off // 00000000217C: DC508000 337F0005 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 000000002184: BF8C0000 +s_barrier // 000000002188: BF8A0000 +``` + +With `volatile` the code is changed a fair bit, in particular a lot of the +address calculation code is moved inside the two barriers. But looking at the +assembly closely we can see that it is doing the same operations and the +`global_load_dword` at the bottom of that assembly snippet corresponds to the +offending `global_load_dword` in the previous snippet. + +The interesting part about the assembly with `volatile` is that it can help us +figure out what corresponds to `imeiIdx` in the assembly, here we can see the +instruction: + +``` +flat_load_dword v11, v[8:9] glc // 000000002158: DC510000 0B000008 +``` + +Just before the last stages of address computation before the +`global_load_dword` and that instruction is not in the snippet without +`volatile`. In addition looking at the ISA document for `gfx906` you can see +that `glc` stands for `Globally Coherent` and marks that this instruction +bypasses the L1 cache. So it is quite likely that `imeiIdx`, our `volatile` +variable is being loaded by this instruction, and so that in this assembly +snippet it would correspond to the vector register `v11`. + +We can then see how `v11` is used: + +``` +v_lshlrev_b64 v[5:6], 3, v[11:12] // 000000002164: D28F0005 00021683 +v_add_co_u32_e64 v5, s[20:21], s24, v5 // 00000000216C: D1191405 00020A18 +v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21] // 000000002174: D11C1406 00520D07 +global_load_dword v51, v[5:6], off // 00000000217C: DC508000 337F0005 +``` + +It is shifted by 3 into `v[5:6]`, as a register pair `v[11:12]`, then `v5` and +`v6` are used in the addition instructions, which outputs are then used as the +address for `global_load_dword`. + +We can then go back to the assembly of the case with errors and look for this +address computation pattern, to find which register corresponds to `imeiIdx`, +and we can easily find: + +``` +[...] +v_lshlrev_b64 v[7:8], 3, v[8:9] // 0000000020BC: D28F0007 00021083 +[...] +v_add_co_u32_e64 v5, s[16:17], s16, v7 // 0000000020D0: D1191005 00020E10 +v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17] // 0000000020D8: D11C1006 00421106 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020E0: BF8C0000 +s_barrier // 0000000020E4: BF8A0000 +global_load_dword v52, v[5:6], off offset:16 // 0000000020E8: DC508010 347F0005 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020F0: BF8C0000 +s_barrier +``` + +The shift by 3, followed by the two add instructions and then the load, +therefore in the broken assembly `imeiIdx` is stored in `v8`. + +So now that we know `v8` is supposed to be `imeiIdx` we can scroll up in the +assembly without `volatile` and find out how it's calculated, knowing that `v0` +and `v1` contain thread indices: + +``` +// v3 = v1 +v_mov_b32_e32 v3, v1 // 000000001C90: 7E060301 +[...] +// v15 = (v3 << 3) + v0 +// v15 = idy * 8 + idx // index flipped +v_lshl_add_u32 v15, v3, 3, v0 // 000000001D30: D1FD000F 04010703 +[...] +// v8 = v15 >> 5 +// v8 = v15 / 32 +v_lshrrev_b32_e32 v8, 5, v15 // 000000001F3C: 20101E85 +``` + +And noting that in the source `imeiIdx` is calculated by: + +``` +const unsigned tidxi = itemIdx.get_local_id(2);↩ +const unsigned tidxj = itemIdx.get_local_id(1);↩ +const unsigned tidx = tidxj * c_clSize + tidxi; +const unsigned imeiIdx = tidx / prunedClusterPairSize;↩ +``` + +With `c_clSize = 8` and `prunedClusterPairSize = 32`. + +So we can see that the calculation of `imeiIdx` before the loop is correct, this +is consistent with what we were seeing that the first iteration of the loop was +always correct. But we can look further into the loop after the load instruction +to see what happens to `v8`. + +And very quickly we can see that after the incorrect instruction `v8` is almost +immediately re-used and its value erased, and it doesn't seem to be spilled to +memory and re-loaded later or to have its value re-calculated: + +``` +v_ashrrev_i32_e32 v8, 31, v7 // 00000000210C: 22100E9F +``` + +At this point the suspicion is clear, `v8` which is supposed to contain +`imeiIdx` is overridden when it shouldn't be. However it's hard to say for sure +as the loop is thousands of instructions long. But we can try to confirm this by +comparing further with other cases that we know are working. + +Now comparing the same assembly for `gfx908` that we also know to be working, we +observe the following: + +``` +v_accvgpr_read_b32 v2, a8 // 000000002084: D3D84002 18000107 +v_add_co_u32_e64 v5, s[16:17], s16, v1 // 00000000208C: D1191005 00020210 +v_addc_co_u32_e64 v6, s[16:17], v6, v2, s[16:17] // 000000002094: D11C1006 00420506 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 00000000209C: BF8C0000 +s_barrier // 0000000020A0: BF8A0000 +global_load_dword v52, v[5:6], off offset:16 // 0000000020A4: DC508010 347F0005 +s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) // 0000000020AC: BF8C0000 +s_barrier +``` + +In the `gfx908` snippet we see that the shift by 3 that we were observing +previously is not near the load but instead there is a `v_accvgpr_read_b32` +instruction. Looking further up in the assembly we can see that `imeiIdx` is +calculated outside of the loop as previously, but then its value is stored in a +register `a8`, and then loaded before the `global_load_dword`. And `a8` is not +written to anywhere else in the program so the value of `imeiIdx` will be +correct for the entire loop. + +Looking at the ISA documents we can see that `a8` is an ACC vector register, +these registers are part of the matrix multiplication unit, but it seems here +that it is used for spilling the value of `imeiIdx`, it is important to note +that `gfx906` doesn't have this matrix multiplication unit or ACC register and +that `gfx90a` MI200, which we know also fails, does have these registers, but +looking through the LLVM code base we could figure out that on `gfx90a` these +can be used as general purpose vector registers and so they are not used for +spilling like they are on `gfx908`. + +Furthermore we can also look at the assembly for `gfx906` but when loop +unrolling is disabled. In that scenario it looks just like the regular `gfx906` +however, the register matching `imeiIdx` is not re-used in the body of the loop. + +So we have a fairly strong idea of why it is failing, that is to say `v8` is +being incorrectly re-used, but we don't now why or how to fix it yet. + +## Diving further into the compiler + +Now we need to analyze what the compiler is doing and try to understand why `v8` +is re-used, but to do that we first need to narrow down a bit the compilation of +the kernels so we can look at what the compiler is doing without too much noise +from unrelated kernels. + +So first we build with verbose output to try and identify the compilation +commands for the `Nbnxm` kernel: + +``` +make VERBOSE=1 mdrun-pull-test -j$(nproc) +``` + +From that we see that the `Nbnxm` kernel is actually built in four different +configurations: + +``` +[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_prune.cpp.o +[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_noprune.cpp.o +[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_noprune.cpp.o +[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o +``` + +Going into that build directory we can use `grep` again to figure out which of +these contain our offending kernel: + +``` +% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE * +Binary file nbnxm_sycl_kernel_body_fv_prune.cpp.o matches +``` + +Which means it's the `fv_prune` configuration that is causing issues. This `.o` +file here doesn't actually contain assembly, but LLVM IR bitcode, with DPC++ for +AMD the actual assembly is generated during the linking stage by `lld`. + +We can extract the bitcode with the `clang-offload-bundler`: + +``` +clang-offload-bundler --unbundle --type=o --targets=sycl-amdgcn-amd-amdhsa-gfx906 --input=nbnxm_sycl_kernel_body_fv_prune.cpp.o --output=fv_prune.bc +``` + +Now in theory we could then simply build this bitcode file with `llc` and get +similar assembly and use that to investigate the compiler, however trying that +produces assembly that is fairly different, so to manually reproduce what the +compiler is doing in the regular build we need to look further. + +As stated previously on AMD the assembly is generated during the link step by +`lld`, looking at the output of the verbose make command, we can identify the +linking stage: + +``` +/path/to/llvm/build/bin/clang++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libgromacs.so.7 -o ../../lib/libgromacs.so.7.0.0 @CMakeFiles/libgromacs.dir/objects1.rsp -Wl,-rpath,/path/to/gromacs/build2/lib: -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload -arch=gfx906 -lrt ../external/build-fftw/fftwBuild-prefix/lib/libfftw3f.a -lpthread -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 -lm ../../lib/libmuparser.so.2.3.2 -lm +``` + +The input files, including the `nbnxm_sycl_kernel_body_fv_prune.cpp.o` file +we're interested in are listed in the `objects1.rsp` file. We can then run this +command with `-###`, this will make the clang driver list all the underlying +commands that would be executed during this specific operation. + +The output of this is quite large as GROMACS is a large application so we'll +just show the interesting lines, but looking at it you can find a command +similar to the one we've used above to extract the bitcode: + +``` +"/path/to/llvm/build/bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux-gnu,sycl-amdgcn-amd-amdhsa-gfx906" "-input=CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-b557cf.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" "-unbundle" "-allow-missing-bundles" +``` + +Then you can track the bitcode file being used in a very long `llvm-link` +command: + +``` +"/path/to/llvm/build/bin/llvm-link" [...] "/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" [...] "-o" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc" "--suppress-warnings" +``` + +This command links all the bitcode files from all the kernels in GROMACS +together into one very large bitcode file `alignedallocator-gfx906.bc`, this is +because with `SYCL_EXPORT` some symbols may be defined in different translation +units. + +Then `sycl-post-link` is used and splits this large bitcode file per kernel +instead of originally per translation unit: + +``` +"/path/to/llvm/build/bin/sycl-post-link" "-split=kernel" "-symbols" "-emit-exported-symbols" "-lower-esimd" "-O3" "-spec-const=default" "-o" "/tmp/alignedallocator-c32144/alignedallocator-gfx906.bc" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc" +``` + +Since this is splitting one large bitcode files into a number of other bitcode +files the following commands working on the bitcode files will be wrapped in +`llvm-foreach`. And this is where we reach the command we're actually interested +in: + +``` +"/path/to/llvm/build/bin/llvm-foreach" "--out-ext=out" "--in-file-list=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--in-replace=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--out-file-list=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--out-replace=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--" "/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" +``` + +This calls `lld` over all of the bitcode files, and this is the step that +actually ends up generating the assembly and object file. So we can try to +extract just the `lld` command and use that on our bitcode file we extracted +manually earlier: + +``` +"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc" +``` + +And then we can disassemble `fv_prune.out` with `llvm-objdump` and as opposed to +`llc` this file is actually very similar to our problematic assembly and shows +the pattern we are looking for of `v8` being overridden. + +Now that we have narrowed down building the `Nbnxm` kernel to a fairly simple +command we can move on to using one of the most powerful LLVM debugging tool: +`-print-after-all`, with this flag the LLVM compiler will print the IR and +Machine IR after every single pass or stage of the compiler. This is extremely +helpful to debug however it does produce huge amounts of output which is why we +couldn't use it on the commands building the entirety of GROMACS and had to +narrow it down first. + +``` +"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc" -mllvm -print-after-all -mllvm -filter-print-funcs=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE &> print-after-all.txt +``` + +And `-filter-print-funcs` narrows down the output further to only the function +we're interested in. + +Now we can look through the output of this command and look for our problematic +instructions, so again the pattern of a load between two barriers, and starting +from the bottom of the file we end up finding the following Machine IR: + +``` +renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc +renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc +renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc +renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec +renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit killed $scc +S_WAITCNT 49279 +$vgpr6 = V_MOV_B32_e32 killed $sgpr17, implicit $exec, implicit $exec +renamable $vgpr5, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, killed $vgpr7, 0, implicit $exec +renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec +S_WAITCNT 0 +S_BARRIER +renamable $vgpr50 = GLOBAL_LOAD_DWORD renamable $vgpr5_vgpr6, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1) +S_WAITCNT 0 +S_BARRIER +``` + +This is clearly our problematic code, we can see the two barriers, the load, the +shift by 3 and the additions, and we can see `vgpr8` being used, the important +instructions are the following: + +``` +[...] +renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec +[...] +renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec +``` + +What is very interesting here is that in the `V_ADDC` instruction the `$vgpr8` +operand is marked `killed` which tells the compiler that the register is not +used after this instruction and can be re-used. This is why `v8` is being +re-used in the body of the loop. In addition even the shift by 3 is overriding +`v8` so it definitely won't be correct in the next iteration. + +So we scroll back up our `print-after-all.txt` file up until we find the first +pass that introduced these seemingly incorrect instructions, and we track it +down to: + +``` +# *** IR Dump After Virtual Register Rewriter (virtregrewriter) ***: +``` + +This is the first pass that introduces this killed `$vgpr8`, but this pass is +also the first pass where the Machine IR has machine registers, it runs right +after the register allocator to do the actual replacement between the virtual +registers and the newly allocated physical registers. + +Scrolling up further to look at the code after the greedy register allocator, we +can see that the code at that point looks like: + +``` +# *** IR Dump After Greedy Register Allocator (greedy) ***: +[...] +19776B| S_WAITCNT 0 +19792B| S_BARRIER +19824B| renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc +19872B| renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc +19888B| renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc +19904B| renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit $scc +19936B| %3340:vgpr_32 = COPY killed renamable $sgpr17 +19944B| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +19952B| %3962:vreg_64 = COPY %3963:vreg_64 +19960B| undef %893.sub0:vreg_64, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, %3962.sub0:vreg_64, 0, implicit $exec +19976B| %893.sub1:vreg_64, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, killed $sgpr16_sgpr17, 0, implicit $exec +20064B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1) +20080B| S_WAITCNT 0 +20096B| S_BARRIER +``` + +We can see the shift by 3, with virtual register `%3078`, that goes into +`%3963`, which is then copied to `%3962`, and then used as operand for the +addition, but as you can see at this point it is not yet marked killed. + +And looking at the pass before the register allocation we see the following +code: + +``` +4376B| %889:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec↩ +[...] +19776B| S_WAITCNT 0 +19792B| S_BARRIER +19824B| %887.sub1:sreg_64 = S_ASHR_I32 %887.sub0:sreg_64, 31, implicit-def dead $scc +19872B| %891:sreg_64 = S_LSHL_B64 %887:sreg_64, 5, implicit-def dead $scc +19888B| %3326:sreg_32 = S_ADD_U32 %692.sub2:sgpr_128, %891.sub0:sreg_64, implicit-def $scc +19904B| %3327:sreg_32 = S_ADDC_U32 %692.sub3:sgpr_128, %891.sub1:sreg_64, implicit-def dead $scc, implicit $scc +19936B| %3340:vgpr_32 = COPY %3327:sreg_32 +19944B| undef %893.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %889.sub0:vreg_64, 0, implicit $exec +19952B| %893.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %889.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec +20064B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1) +20080B| S_WAITCNT 0 +20096B| S_BARRIER +``` + +Looking at this code it's important to note that the shift by 3 is actually not +inside of the loop, it's before the loop starts and its value is simply used +inside of the loop for the additions. This code should work absolutely fine, but +it seems that the greedy register allocator moves this shift inside of the loop, +and then the virtual register rewriter marks the `v8` operand as killed. + +Now at this point in our investigation, we need to start debugging the actual +code of these passes to understand why they are doing these transformations and +try to figure out what is going wrong. One good way to start with that is to use +`-debug-only=regalloc`, this will print debugging output for the register +allocation. However it doesn't support the `-filter-print-funcs=` flag like +`-print-after-all` which leads to huge output so we need to reduce our bitcode +file a little further. + +To do that we can use `opt` to remove from the bitcode all the kernels we're not +interested in: + +``` +/path/to/llvm/build/bin/opt --internalize-public-api-list=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE --internalize --globaldce fv_prune.bc -o fv_prune_trimmed.bc +``` + +This command will mark `internal` all the functions that are not listed in the +`--internalize-public-api-list` flag, and then run a `globaldce` pass which will +eliminate all the internal symbols. This trims down our bitcode file quite +significantly. + +Then we can also run the compilation up until right before the greedy register +allocator, this way we can just run the passes we're trying to debug: + +``` +"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "pre_greedy.mir" "fv_prune_trimmed.bc" -mllvm -stop-before=greedy +``` + +`lld` is not really setup to use this `stop-before` parameter and so it will +crash but not before giving us the output we want in `pre_greedy.mir`, this will +now contain the Machine IR before the greedy register allocator, the one with +the shift by 3 outside of the loop. + +We can then use `llc` to run only the specific pass and get debug output from +register allocation: + +``` +/path/to/llvm/build/bin/llc -start-before=greedy -stop-after=virtregrewriter -mcpu=gfx906 -debug-only=regalloc pre_greedy.mir -o post_regalloc.mir &> regalloc.txt +``` + +This command will run just the passes between `greedy` and `virtregrewriter`, +and print debug output for the register allocation `-debug-only=regalloc`. The +names for the passes can be seen in the `-print-after-all` dumps in parenthesis +next to the longer name of the passes. Note that this `regalloc` name can be +found in the source of the passes mentioned above, for example in: + +``` +/path/to/llvm/llvm/lib/CodeGen/RegAllocGreedy.cpp +``` + +Under: + +``` +#define DEBUG_TYPE "regalloc"↩ +``` + +This in turns control the `LLVM_DEBUG` directives in that file and places them +under the `regalloc` keyword. + +So now that we have the debug output from the register allocation we can look +through that, and to find what we're looking for we can simply look for the +virtual registers that we spotted above so `%3078`, and we find the following +debug output: + +``` +Removing 1 back-copies. +Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64 + blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc) + blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc) + rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec + rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec + rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec + rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64 +queuing new interval: %3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1)[1 6192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 +Enqueuing %3961 +queuing new interval: %3962 [16312r,16336r:0) 0@16312r L000000000000000C [16312r,16336r:0) 0@16312r L0000000000000003 [16312r,16320r:0) 0@16312r weight:5.705815e-02↩ +Enqueuing %3962 + +selectOrSplit VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1)[1 6192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 w=3.681593e-04 +RS_Spill Cascade 0 +should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r weight:6.930721e-05 w= 6.930721e-05 +should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r weight:6.930721e-05 w= 6.930721e-05 +Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 +From original %890 +| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64 + +All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +Remat created 1 dead defs. +Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16 +Enqueuing %3078 +Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B :0) 0@2240r weight:1.520717e-01 +``` + +Now this is quite a lot so first let's roll back to the beginning of the debug +output where the full kernel is shown, we can see that at that point, the code +looks like this: + +``` +3440B| %890:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +[...] +16320B| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %890.sub0:vreg_64, 0, implicit $exec↩ +16336B| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %890.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec↩ +16352B| %3772:vgpr_32 = GLOBAL_LOAD_DWORD %894:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.imask97.i, !tbaa !75, addrspace 1)↩ +16368B| S_WAITCNT 0↩ +16384B| S_BARRIER↩ +``` + +In this snippet we can see that the shift by 3 is at `3440` in the code and that +the loop is around `163XX`, and also that at the beginning of the register +allocation, the shift by 3 is indeed outside of the loop as expected, so coming +back to the debug output, we first have: + +``` +Removing 1 back-copies. +Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64 + blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc) + blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc) + rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec + rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec + rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec + rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64 +``` + +Now this is in the middle of the register allocation modifications so the code +changed a little bit but looks sort of the same, it seems that there's now a +copy inside of the loop of `%890` into `%3961` and that this is trying to remove +it. And we can see the modified instructions in the `rewr` part: + +``` + rewr %bb.6| 3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec + rewr %bb.90| 16320B:1| undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec + rewr %bb.90| 16336B:1| %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec + rewr %bb.90| 16312B:0| %3962:vreg_64 = COPY %3961:vreg_64 +``` + +So at this point the addition instructions that used to take `%890` take +`%3962`, the shift is now writing to `%3961` instead of `%890` and `%3961` is +copied into `%3962` before the additions. + +And at this stage the shift is still outside of the loop which seems fine, so we +can move on to the next part of the output, and namely the inline spilling part: + +``` +Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 +From original %890 +| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64 +``` + +This is important because this is the first time we see the shift by 3 +instruction moved into the loop, as you can see in `16308` instead of `3440`. + +And you can then see the compiler delete the original shift instruction: + +``` +All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +Remat created 1 dead defs. +Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +``` + +And then restart the register allocation for it since the instruction was moved: + +``` +unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16 +Enqueuing %3078 +Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B :0) 0@2240r weight:1.520717e-01 +``` + +Now in practice it took a bit more debugging but here we can start to see where +the problem is coming from on the last line: + +``` +%3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r L0000000000000003 [2224r,3440r:0) 0@2224r L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r +``` + +What this represents is the live ranges for the virtual register `%3078`, which +means the parts of the code where the register is considered "alive", so +essentially all the code between the instruction that defines the register and +its last use, but let's break it down a bit further: + +``` +[2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r +``` + +This first part tracks the liveness of the full register, these live intervals +contain two "values" (and/or definitions), `0@2224r` and `1@2240r`, indeed if we +look at the full Machine IR: + +``` +2224B| undef %3078.sub0:vreg_64 = V_LSHRREV_B32_e32 5, %19:vgpr_32, implicit $exec +2240B| %3078.sub1:vreg_64 = V_MOV_B32_e32 0, implicit $exec↩ +``` + +This defines the two sub-registers of `%3078`. And so what the rest of the +notation shows us is that `%3078` is alive between `2224` and `2240` for value +`0`: `[2224r,2240r:0)`, then is alive between `2240` to `3488` for value `1`: +`[2240r,3488B:1)`, and then between `16192` to `38336` for value `1`: +`[16192B,38336B:1)`. + +Now as you can probably tell this notation of the live range seems a little +strange because it is for the full register, but here the values `0` and `1` are +only defining parts of the register, and so we need to look at the rest of the +representation of the live range which shows us information about the +sub-registers: + +``` +L0000000000000003 [2224r,3440r:0) 0@2224r +L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r +``` + +The notation here is similar to the notation above, with the initial value being +a mask representing a sub-register, in binary `0x3` is `0011` and `0xC` is +`1100`. And so looking at these two lines you can see that the first, `0x3` +represents `%3078.sub0` and `0xC` represents `%3078.sub1`. + +And so now we know that `%3078.sub0` is defined at `2224` and is alive between +`2224` and `3440`, which means it's alive until the shift by 3 but not further. +However `%3078.sub1` is defined at `2240` and is alive between `2240` and `3488` +which means it's alive past the shift, but not only that, it is also alive +between `16192` and `38336`, this second interval is important because it covers +the loop with our problematic load. + +Now that we understand the live intervals of `%3078` the problem becomes fairly +obvious when looking back at the inline spilling: + +``` +Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 +From original %890 +| remat: 16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec +| 16312e| %3962:vreg_64 = COPY killed %3963:vreg_64 +``` + +This code moved the shift by 3 inside the loop, but at `16308` where we know +that `%3078.sub0` is not alive, and this is the virtual register that ultimately +gets assigned to `v8`. And so it makes sense that `v8` was being re-used, +because as far as the compiler is aware it is not supposed to be used at that +point. + +Thus we're almost done, we know that the instruction is being incorrectly moved +into the loop, we just need to find how this error manifests in the code and how +to fix it. To do that we can simply search for the printed keywords such as +`remat: ` or `Inline spilling` and then follow the code from there. It took a +little more time until reaching the following code in +`LiveRangeEdit::allUsesAvailableAt`: + +```cpp +// Check that subrange is live at UseIdx. +if (MO.getSubReg()) { + const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo(); + LaneBitmask LM = TRI->getSubRegIndexLaneMask(MO.getSubReg()); + for (LiveInterval::SubRange &SR : li.subranges()) { + if ((SR.LaneMask & LM).none()) + continue; + if (!SR.liveAt(UseIdx)) + return false; + + // Early exit if all used lanes are checked. No need to continue. + LM &= ~SR.LaneMask; + if (LM.none()) + break; + } +} +``` + +The function this is in is trying to determine if all the uses of an +instructions are available at a given point, which is in turn used to determine +if it is valid to move the instruction there. And it is checking for +sub-register ranges, but only if the operand of the instruction is using a +sub-register. This is not the case for us, our shift instruction uses the full +register, but one of the sub-register is not alive at the destination, and so +this code ends up not checking our instruction for sub-ranges and incorrectly +moves the shift into the loop. + +The fix is then fairly simple, we just need to always check the sub-ranges when +they're available, and so tweaking the code above as follows resolved the issue: + +```cpp +// Check that subrange is live at UseIdx. +if (li.hasSubRanges()) { + const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo(); + unsigned SubReg = MO.getSubReg(); + LaneBitmask LM = SubReg ? TRI->getSubRegIndexLaneMask(SubReg) + : MRI.getMaxLaneMaskForVReg(MO.getReg()); +``` + +With this patch instead of checking the sub-ranges only if the instruction is +using a sub-register, it checks them if they exist, and if the instruction is +using the full register, it uses a mask containing both sub-registers to check +them both. + +And so, the inline spilling fails to move the shift instruction in the loop: + +``` +Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r L0000000000000003 [3440r,3488B:1) [16192B,38336B:1) 0@x 1@3440r weight:3.681593e-04 +From original %890 +| cannot remat for 16312e|%3962:vreg_64 = COPY %3961:vreg_64 +``` + +And since it can't "inline" the spill it simply resorts to a normal spill: + +``` +Merged spilled regs: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x weight:0.000000e+00 +spillAroundUses %3961 +| rewrite: 3440r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec + +| spill: 3448r| SI_SPILL_V64_SAVE killed %3963:vreg_64, %stack.2, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.2, align 4, addrspace 5) +Checking redundant spills for 0@16312r in %3962 [16312r,16336r:0) 0@16312r L000000000000000C [16312r,16336r:0) 0@16312r L0000000000000003 [16312r,16320r:0) 0@16312r w eight:5.705815e-02 +Merged to stack int: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x weight:0.000000e+00 +| folded: 16312r| %3962:vreg_64 = SI_SPILL_V64_RESTORE %stack.2, $sgpr32, 0, implicit $exec :: (load (s64) from %stack.2, align 4, addrspace 5) +``` + +And the code generated this way is now correct meaning that the GROMACS test now +passes on `gfx906`. + +## Conclusion + +Finding the root cause of this issue was an interesting journey that took us +from a molecular dynamics kernel all the way down to the AMD GPU ISAs. + +Through describing this journey, this blog post provides some insight on a +number of techniques that can be used either when debugging issues in the LLVM +project, in DPC++ , or when working in an environment with limited debugging +capabilities. And it also shows a glimpse into the DPC++ and LLVM components +used during the compilation of GPU kernels. Which may hopefully be helpful to +anyone wanting to learn more about these technologies. + +And in closing, the full patch fixing this specific issue was submitted to +upstream LLVM and promptly merged: + +* https://reviews.llvm.org/D131884 diff --git a/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp b/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp new file mode 100644 index 0000000..cceba26 Binary files /dev/null and b/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp differ