diff --git a/_collections/_authors/nicolas-miller.markdown b/_collections/_authors/nicolas-miller.markdown
new file mode 100644
index 0000000..586737e
--- /dev/null
+++ b/_collections/_authors/nicolas-miller.markdown
@@ -0,0 +1,7 @@
+---
+user_id: 5011213131
+disabled: 0
+title: "Nicolas Miller"
+position: Senior Software Engineer
+avatar: /assets/images/portal/authors/missing.png
+---
diff --git a/_collections/_portal_posts/2025-08-20-gromacs.md b/_collections/_portal_posts/2025-08-20-gromacs.md
new file mode 100644
index 0000000..883e6f5
--- /dev/null
+++ b/_collections/_portal_posts/2025-08-20-gromacs.md
@@ -0,0 +1,995 @@
+---
+category: blogs
+date: '2025-08-20T02:00:00.0'
+hidden: false
+layout: portal/portal-article-view
+thumbnail: /assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp
+title: 'Tracking down a register allocator bug'
+user_id: 5011213131
+---
+
+This article describes the investigation of a bug found when running
+[GROMACS](https://www.gromacs.org/) testing using SYCL with the [DPC++
+compiler](https://github.com/intel/llvm/) while targeting AMD GPUs.
+
+This was initially reported in the `intel/llvm` GitHub repository issue-tracker
+[intel/llvm#6209](https://github.com/intel/llvm/issues/6209/) and was using a
+[specific
+version](https://gitlab.com/gromacs/gromacs/-/tree/aa-hwe-release-2022-dpcpp-hip)
+of GROMACS with SYCL support.
+
+This bug will take us from the GROMACS source code all the way down to the
+register allocation in the compiler backend.
+
+## Description of the issue
+
+When running one of the GROMACS tests built with DPC++ targeting AMD, the test
+crashes with the following output:
+
+```console
+Memory access fault by GPU node-4 (Agent handle: 0x215c1f0) on address 0x7ff5f0d6d000. Reason: Page not present or supervisor privilege.
+Aborted (core dumped)
+```
+
+This error indicates that the kernel is accessing invalid memory, for example
+this can happen in simple out-of-bounds accesses.
+
+It is reported that this issue appears on MI50 ([gfx906
+ISA](https://developer.amd.com/wp-content/resources/Vega_7nm_Shader_ISA.pdf)),
+and MI200 ([gfx90a
+ISA](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf)),
+but that it doesn't fail on MI100 ([gfx908
+ISA](https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf)).
+
+The specific test is being run as follows:
+
+```console
+SYCL_DEVICE_FILTER=hip:gpu ./bin/mdrun-pull-test
+```
+
+## Finding the kernel
+
+The GROMACS application contains a lot of kernels so the first step is to figure
+out exactly which kernel is causing the issue.
+
+One easy way to do that is to use the ROCm debug environment variable
+`AMD_LOG_LEVEL=4`. We can then find the last `ShaderName` debug output before
+the crash:
+
+```
+:3:rocvirtual.cpp           :2738: 6536108370963 us: 24086: [tid:0x7f5e78ab4740] ShaderName : _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE
+```
+
+Which gives us the name of the last kernel to run on the GPU, which is almost
+certainly the kernel that caused the access fault. This debug output gives us
+the kernel name in its mangled form, but using the `c++filt` tool we can easily
+turn it into a more readable format, and so we get:
+
+```
+c++filt _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE
+typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1>
+```
+
+Alternatively we can use the ROCm debugger `rocgdb`, simply running the program
+under `rocgdb` after the segfault the debugger will show which kernel it
+happened in and also allow us to disassemble the kernel. Using the debugger can
+unfortunately only go so far because debug information is not supported in DPC++
+for AMD targets.
+
+With the debugger we get the following:
+
+```
+Thread 3 "mdrun-pull-test" received signal SIGBUS, Bus error.↩
+[Switching to thread 3, lane 0 (AMDGPU Lane 4:13:1:1/0 (0,0,0)[0,0,0])]↩
+0x00007fffe5297078 in typeinfo name for NbnxmKernel<true, true, (Nbnxm::ElecType)1, (Nbnxm::VdwType)1> () from file:///path/to/gromacs/build/lib/libgromacs.so.7    #offset=75955456&size=17664↩
+```
+
+Using the `gdb` command `disas` we can then do our first proper analysis step
+and look at the assembly dump showing on which instruction the memory access
+fault is happening.
+
+```
+Dump of assembler code for function _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE:
+[...]
+   0x00007fffe5297064 <+1124>:| global_load_dwordx2 v[14:15], v[5:6], off offset:16
+   0x00007fffe529706c <+1132>:| s_waitcnt lgkmcnt(0)
+   0x00007fffe5297070 <+1136>:| v_mov_b32_e32 v10, s41
+   0x00007fffe5297074 <+1140>:| s_mov_b32 s64, s80
+=> 0x00007fffe5297078 <+1144>:| s_waitcnt vmcnt(0)
+[...]
+```
+
+The arrow indicates that the error is happening on the `s_waitcnt vmcnt(0)`.
+This instruction waits for all the memory operations using vector registers to
+be completed. Therefore the likely culprit in this case is actually the
+`global_load_dwordx2` instruction above.
+
+At this point we know that the issue happens in the `NbnxmKernel` and from the
+disassembly it seems likely to be caused by a load instruction, this is
+consistent with the memory access fault issue we're seeing, the address used for
+this load instruction must be incorrect in some way.
+
+## Comparing the assembly
+
+We know that the application works on `gfx908` but not on `gfx906` so a first
+step can be to compare the assembly generated by both and see if any significant
+differences can be identified.
+
+To obtain the assembly we can use the environment variable `SYCL_DUMP_IMAGES=1`,
+this is a DPC++ environment variable which will dump all the available kernels
+images when running an application. GROMACS is a fairly large application so we
+end up with 309 `.bin` images.
+
+We can then use `grep` to search for the mangled kernel name we got out of
+the debugger:
+
+```
+% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *.bin
+Binary file sycl_amdgcn142.bin matches
+```
+
+The `.bin` files are HIP fat binaries, we can then extract the actual device
+object file from them with:
+
+```
+clang-offload-bundler --unbundle --type=o --targets=hipv4-amdgcn-amd-amdhsa--gfx906 --input=sycl_amdgcn142.bin --output=sycl_amdgcn142.bin.o
+```
+
+And then disassemble the device objects with:
+
+```
+llvm-objdump -d sycl_amdgcn142.bin.o &> sycl_amdgcn142.s
+```
+
+Doing that for both architectures we end up with the assembly for both and we
+can then use a diff tool to compare them.
+
+Unfortunately at this point the differences between the two are very significant
+and it's very difficult to track down exactly where the offending instruction is
+in the `gfx908` assembly, so more work will be required to make use of this and
+we'll come back to it later.
+
+## Tracking down the issue in the source
+
+Knowing the kernel name we can find the source for it in GROMACS:
+
+```
+src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h
+```
+
+* https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L663
+
+Unfortunately for us this kernel is quite large and complicated, roughly 400
+lines of code with loops and many conditionals. And our debugging options are
+very limited, indeed as mentioned above source level debugging is not available,
+and neither is `printf`! Both of these are available in the regular AMD
+toolchain but not yet enabled in DPC++ for AMD at the time of this
+investigation.
+
+So as a way to get source level information on where this bug is happening we
+used the horrible snippet of code below:
+
+```cpp
+volatile int* crash = nullptr;
+*crash++;
+```
+
+You see adding this in the body of a kernel will cause it to crash with a memory
+access fault error nearly identical to the one caused by our bug, but since the
+error message prints the address of the memory access fault it will always show
+`0` if this piece of code caused it. And so this will show us if `*crash++` was
+run before or after the offending line of code, which means that by moving
+`*crash++` around in our kernel code and looking at the resulting error we can
+eventually track down which line of code is causing the issue.
+
+After a lot of trial and error the [offending
+line](https://gitlab.com/gromacs/gromacs/-/blob/aa-hwe-release-2022-dpcpp-hip/src/gromacs/nbnxm/sycl/nbnxm_sycl_kernel_body.h#L793/)
+in the kernel source was identified to be:
+
+```
+unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;
+```
+
+This matches what we were expecting since this line loads from both `a_plistCJ4`
+and then from `imei`, so it must be that either one of these loads triggers the
+error.
+
+In addition, we also observed that the error never happens on the first
+iteration of the loop this line is in, this information will become important
+later on.
+
+At this stage the obvious next step is to check whether `j4` or `imeiIdx` are
+out of bounds for what they're indexing. A cursory look through the code didn't
+spot anything obvious that would suggest these indices may be wrong. But by that
+point even though we still didn't have access to a proper `printf` we figured
+out that it was possible to use lower level `printf` primitives provided by the
+HIP toolchain from within the kernel, so we were able to try printing these
+indices:
+
+```
+long msg = __ockl_fprintf_stdout_begin();
+msg = __ockl_fprintf_append_string_n(msg, "debug: %lu, %lu\n", 16, 0);
+msg = __ockl_fprintf_append_args(msg, 2, j4, imeiIdx, 0, 0, 0, 0, 1);
+```
+
+Now surprisingly after adding this bit of code before the offending line the
+program started working correctly! This is usually indicative that whatever we
+did to the code caused the compiler take a different path and avoided triggering
+the issue.
+
+Digging a little further around that idea, we found out that printing just
+`imeiIdx` fixed the issue, and furthermore marking `imeiIdx` as `volatile` also
+fixed the problem, and finally since we also know that the first iteration is
+always correct, we also ended up figuring out that disabling loop unrolling with
+`-fno-unroll-loops` also fixes the issue.
+
+So we now have a lot more information about what's going on and multiple ways of
+avoiding the issue, so we're almost ready to dive back in the assembly to see
+what we can figure out. But before we do, one of the helpful thing we can do is
+to surround the offending line with barriers:
+
+```
+itemIdx.barrier(fence_space::local_space);
+unsigned imask = a_plistCJ4[j4].imei[imeiIdx].imask;
+itemIdx.barrier(fence_space::local_space);
+```
+
+Barriers are helpful because they're lowered to recognizable `s_barrier`
+instructions and they limit how the compiler can re-organize the assembly around
+them which makes it a lot easier to identify in the assembly where this line of
+source code is. And luckily for us adding them doesn't fix the issue we're
+seeing, so we can use them as a handy marker.
+
+## Diving back into the assembly
+
+Now we can look again at comparing assembly between all the different cases
+we've identified, and see if we can spot what's incorrect:
+
+Assembly for `gfx906` with the extra barriers and no other modification, the
+offending instruction is the `global_load_dword` at the bottom. The generated
+assembly looks different than what we originally got in `rocgdb` but it still
+triggers the error:
+
+```
+s_cbranch_scc1 2000                                        // 0000000020AC: BF8507D0 <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x23f0>
+s_ashr_i32 s43, s42, 31                                    // 0000000020B0: 902B9F2A
+s_lshl_b64 s[48:49], s[42:43], 5                           // 0000000020B4: 8EB0852A
+s_add_u32 s16, s40, s48                                    // 0000000020B8: 80103028
+v_lshlrev_b64 v[7:8], 3, v[8:9]                            // 0000000020BC: D28F0007 00021083
+s_addc_u32 s17, s41, s49                                   // 0000000020C4: 82113129
+s_waitcnt lgkmcnt(0)                                       // 0000000020C8: BF8CC07F
+v_mov_b32_e32 v6, s17                                      // 0000000020CC: 7E0C0211
+v_add_co_u32_e64 v5, s[16:17], s16, v7                     // 0000000020D0: D1191005 00020E10
+v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17]           // 0000000020D8: D11C1006 00421106
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020E0: BF8C0000
+s_barrier                                                  // 0000000020E4: BF8A0000
+global_load_dword v52, v[5:6], off offset:16               // 0000000020E8: DC508010 347F0005
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020F0: BF8C0000
+s_barrier
+```
+
+Assembly for `gfx906` with `imeiIdx` marked as `volatile` (test passing):
+
+```
+s_cbranch_scc1 2012                                        // 000000002114: BF8507DC <_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE+0x2488>
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+s_barrier                                                  // 00000000211C: BF8A0000
+buffer_load_dword v8, off, s[0:3], s33 offset:16           // 000000002120: E0500010 21000800
+s_nop 0                                                    // 000000002128: BF800000
+buffer_load_dword v9, off, s[0:3], s33 offset:20           // 00000000212C: E0500014 21000900
+s_ashr_i32 s47, s46, 31                                    // 000000002134: 902F9F2E
+s_lshl_b64 s[52:53], s[46:47], 5                           // 000000002138: 8EB4852E
+s_add_u32 s20, s44, s52                                    // 00000000213C: 8014342C
+s_addc_u32 s21, s45, s53                                   // 000000002140: 8215352D
+s_add_u32 s24, s20, 16                                     // 000000002144: 80189014
+s_addc_u32 s25, s21, 0                                     // 000000002148: 82198015
+v_mov_b32_e32 v7, s25                                      // 00000000214C: 7E0E0219
+s_mov_b32 s58, s81                                         // 000000002150: BEBA0051
+s_waitcnt vmcnt(0)                                         // 000000002154: BF8C0F70
+flat_load_dword v11, v[8:9] glc                            // 000000002158: DC510000 0B000008
+s_waitcnt vmcnt(0) lgkmcnt(0)                              // 000000002160: BF8C0070
+v_lshlrev_b64 v[5:6], 3, v[11:12]                          // 000000002164: D28F0005 00021683
+v_add_co_u32_e64 v5, s[20:21], s24, v5                     // 00000000216C: D1191405 00020A18
+v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21]           // 000000002174: D11C1406 00520D07
+global_load_dword v51, v[5:6], off                         // 00000000217C: DC508000 337F0005
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 000000002184: BF8C0000
+s_barrier                                                  // 000000002188: BF8A0000
+```
+
+With `volatile` the code is changed a fair bit, in particular a lot of the
+address calculation code is moved inside the two barriers. But looking at the
+assembly closely we can see that it is doing the same operations and the
+`global_load_dword` at the bottom of that assembly snippet corresponds to the
+offending `global_load_dword` in the previous snippet.
+
+The interesting part about the assembly with `volatile` is that it can help us
+figure out what corresponds to `imeiIdx` in the assembly, here we can see the
+instruction:
+
+```
+flat_load_dword v11, v[8:9] glc                            // 000000002158: DC510000 0B000008
+```
+
+Just before the last stages of address computation before the
+`global_load_dword` and that instruction is not in the snippet without
+`volatile`. In addition looking at the ISA document for `gfx906` you can see
+that `glc` stands for `Globally Coherent` and marks that this instruction
+bypasses the L1 cache. So it is quite likely that `imeiIdx`, our `volatile`
+variable is being loaded by this instruction, and so that in this assembly
+snippet it would correspond to the vector register `v11`.
+
+We can then see how `v11` is used:
+
+```
+v_lshlrev_b64 v[5:6], 3, v[11:12]                          // 000000002164: D28F0005 00021683
+v_add_co_u32_e64 v5, s[20:21], s24, v5                     // 00000000216C: D1191405 00020A18
+v_addc_co_u32_e64 v6, s[20:21], v7, v6, s[20:21]           // 000000002174: D11C1406 00520D07
+global_load_dword v51, v[5:6], off                         // 00000000217C: DC508000 337F0005
+```
+
+It is shifted by 3 into `v[5:6]`, as a register pair `v[11:12]`, then `v5` and
+`v6` are used in the addition instructions, which outputs are then used as the
+address for `global_load_dword`.
+
+We can then go back to the assembly of the case with errors and look for this
+address computation pattern, to find which register corresponds to `imeiIdx`,
+and we can easily find:
+
+```
+[...]
+v_lshlrev_b64 v[7:8], 3, v[8:9]                            // 0000000020BC: D28F0007 00021083
+[...]
+v_add_co_u32_e64 v5, s[16:17], s16, v7                     // 0000000020D0: D1191005 00020E10
+v_addc_co_u32_e64 v6, s[16:17], v6, v8, s[16:17]           // 0000000020D8: D11C1006 00421106
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020E0: BF8C0000
+s_barrier                                                  // 0000000020E4: BF8A0000
+global_load_dword v52, v[5:6], off offset:16               // 0000000020E8: DC508010 347F0005
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020F0: BF8C0000
+s_barrier
+```
+
+The shift by 3, followed by the two add instructions and then the load,
+therefore in the broken assembly `imeiIdx` is stored in `v8`.
+
+So now that we know `v8` is supposed to be `imeiIdx` we can scroll up in the
+assembly without `volatile` and find out how it's calculated, knowing that `v0`
+and `v1` contain thread indices:
+
+```
+// v3 = v1
+v_mov_b32_e32 v3, v1                                       // 000000001C90: 7E060301
+[...]
+//  v15 = (v3 << 3) + v0
+//  v15 = idy * 8 + idx  // index flipped
+v_lshl_add_u32 v15, v3, 3, v0                              // 000000001D30: D1FD000F 04010703
+[...]
+// v8 = v15 >> 5
+// v8 = v15 / 32
+v_lshrrev_b32_e32 v8, 5, v15                               // 000000001F3C: 20101E85
+```
+
+And noting that in the source `imeiIdx` is calculated by:
+
+```
+const unsigned tidxi = itemIdx.get_local_id(2);↩
+const unsigned tidxj = itemIdx.get_local_id(1);↩
+const unsigned tidx  = tidxj * c_clSize + tidxi;
+const unsigned imeiIdx = tidx / prunedClusterPairSize;↩
+```
+
+With `c_clSize = 8` and `prunedClusterPairSize = 32`.
+
+So we can see that the calculation of `imeiIdx` before the loop is correct, this
+is consistent with what we were seeing that the first iteration of the loop was
+always correct. But we can look further into the loop after the load instruction
+to see what happens to `v8`.
+
+And very quickly we can see that after the incorrect instruction `v8` is almost
+immediately re-used and its value erased, and it doesn't seem to be spilled to
+memory and re-loaded later or to have its value re-calculated:
+
+```
+v_ashrrev_i32_e32 v8, 31, v7                               // 00000000210C: 22100E9F
+```
+
+At this point the suspicion is clear, `v8` which is supposed to contain
+`imeiIdx` is overridden when it shouldn't be. However it's hard to say for sure
+as the loop is thousands of instructions long. But we can try to confirm this by
+comparing further with other cases that we know are working.
+
+Now comparing the same assembly for `gfx908` that we also know to be working, we
+observe the following:
+
+```
+v_accvgpr_read_b32 v2, a8                                  // 000000002084: D3D84002 18000107
+v_add_co_u32_e64 v5, s[16:17], s16, v1                     // 00000000208C: D1191005 00020210
+v_addc_co_u32_e64 v6, s[16:17], v6, v2, s[16:17]           // 000000002094: D11C1006 00420506
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 00000000209C: BF8C0000
+s_barrier                                                  // 0000000020A0: BF8A0000
+global_load_dword v52, v[5:6], off offset:16               // 0000000020A4: DC508010 347F0005
+s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)                    // 0000000020AC: BF8C0000
+s_barrier 
+```
+
+In the `gfx908` snippet we see that the shift by 3 that we were observing
+previously is not near the load but instead there is a `v_accvgpr_read_b32`
+instruction. Looking further up in the assembly we can see that `imeiIdx` is
+calculated outside of the loop as previously, but then its value is stored in a
+register `a8`, and then loaded before the `global_load_dword`. And `a8` is not
+written to anywhere else in the program so the value of `imeiIdx` will be
+correct for the entire loop.
+
+Looking at the ISA documents we can see that `a8` is an ACC vector register,
+these registers are part of the matrix multiplication unit, but it seems here
+that it is used for spilling the value of `imeiIdx`, it is important to note
+that `gfx906` doesn't have this matrix multiplication unit or ACC register and
+that `gfx90a` MI200, which we know also fails, does have these registers, but
+looking through the LLVM code base we could figure out that on `gfx90a` these
+can be used as general purpose vector registers and so they are not used for
+spilling like they are on `gfx908`.
+
+Furthermore we can also look at the assembly for `gfx906` but when loop
+unrolling is disabled. In that scenario it looks just like the regular `gfx906`
+however, the register matching `imeiIdx` is not re-used in the body of the loop.
+
+So we have a fairly strong idea of why it is failing, that is to say `v8` is
+being incorrectly re-used, but we don't now why or how to fix it yet.
+
+## Diving further into the compiler
+
+Now we need to analyze what the compiler is doing and try to understand why `v8`
+is re-used, but to do that we first need to narrow down a bit the compilation of
+the kernels so we can look at what the compiler is doing without too much noise
+from unrelated kernels.
+
+So first we build with verbose output to try and identify the compilation
+commands for the `Nbnxm` kernel:
+
+```
+make VERBOSE=1 mdrun-pull-test -j$(nproc)
+```
+
+From that we see that the `Nbnxm` kernel is actually built in four different
+configurations:
+
+```
+[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_prune.cpp.o
+[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_f_noprune.cpp.o
+[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_noprune.cpp.o
+[ 28%] Building CXX object src/gromacs/CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o
+```
+
+Going into that build directory we can use `grep` again to figure out which of
+these contain our offending kernel:
+
+```
+% grep _ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE *
+Binary file nbnxm_sycl_kernel_body_fv_prune.cpp.o matches
+```
+
+Which means it's the `fv_prune` configuration that is causing issues. This `.o`
+file here doesn't actually contain assembly, but LLVM IR bitcode, with DPC++ for
+AMD the actual assembly is generated during the linking stage by `lld`.
+
+We can extract the bitcode with the `clang-offload-bundler`:
+
+```
+clang-offload-bundler --unbundle --type=o --targets=sycl-amdgcn-amd-amdhsa-gfx906 --input=nbnxm_sycl_kernel_body_fv_prune.cpp.o --output=fv_prune.bc
+```
+
+Now in theory we could then simply build this bitcode file with `llc` and get
+similar assembly and use that to investigate the compiler, however trying that
+produces assembly that is fairly different, so to manually reproduce what the
+compiler is doing in the regular build we need to look further.
+
+As stated previously on AMD the assembly is generated during the link step by
+`lld`, looking at the output of the verbose make command, we can identify the
+linking stage:
+
+```
+/path/to/llvm/build/bin/clang++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libgromacs.so.7 -o ../../lib/libgromacs.so.7.0.0 @CMakeFiles/libgromacs.dir/objects1.rsp -Wl,-rpath,/path/to/gromacs/build2/lib: -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload -arch=gfx906 -lrt ../external/build-fftw/fftwBuild-prefix/lib/libfftw3f.a -lpthread -ffast-math -fsycl -fsycl-device-code-split=per_kernel -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 -lm ../../lib/libmuparser.so.2.3.2 -lm
+```
+
+The input files, including the `nbnxm_sycl_kernel_body_fv_prune.cpp.o` file
+we're interested in are listed in the `objects1.rsp` file. We can then run this
+command with `-###`, this will make the clang driver list all the underlying
+commands that would be executed during this specific operation.
+
+The output of this is quite large as GROMACS is a large application so we'll
+just show the interesting lines, but looking at it you can find a command
+similar to the one we've used above to extract the bitcode:
+
+```
+"/path/to/llvm/build/bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux-gnu,sycl-amdgcn-amd-amdhsa-gfx906" "-input=CMakeFiles/libgromacs.dir/nbnxm/sycl/nbnxm_sycl_kernel_body_fv_prune.cpp.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-b557cf.o" "-output=/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" "-unbundle" "-allow-missing-bundles"
+```
+
+Then you can track the bitcode file being used in a very long `llvm-link`
+command:
+
+```
+"/path/to/llvm/build/bin/llvm-link" [...] "/tmp/nbnxm_sycl_kernel_body_fv_prune-ec4a19/nbnxm_sycl_kernel_body_fv_prune-gfx906.o" [...] "-o" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc" "--suppress-warnings"
+```
+
+This command links all the bitcode files from all the kernels in GROMACS
+together into one very large bitcode file `alignedallocator-gfx906.bc`, this is
+because with `SYCL_EXPORT` some symbols may be defined in different translation
+units.
+
+Then `sycl-post-link` is used and splits this large bitcode file per kernel
+instead of originally per translation unit:
+
+```
+"/path/to/llvm/build/bin/sycl-post-link" "-split=kernel" "-symbols" "-emit-exported-symbols" "-lower-esimd" "-O3" "-spec-const=default" "-o" "/tmp/alignedallocator-c32144/alignedallocator-gfx906.bc" "/tmp/alignedallocator-a0f82a/alignedallocator-gfx906.bc"
+```
+
+Since this is splitting one large bitcode files into a number of other bitcode
+files the following commands working on the bitcode files will be wrapped in
+`llvm-foreach`. And this is where we reach the command we're actually interested
+in:
+
+```
+"/path/to/llvm/build/bin/llvm-foreach" "--out-ext=out" "--in-file-list=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--in-replace=/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o" "--out-file-list=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--out-replace=/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "--" "/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "/tmp/alignedallocator-d6ebb0/alignedallocator-gfx906.out" "/tmp/alignedallocator-7746ed/alignedallocator-gfx906.o"
+```
+
+This calls `lld` over all of the bitcode files, and this is the step that
+actually ends up generating the assembly and object file. So we can try to
+extract just the `lld` command and use that on our bitcode file we extracted
+manually earlier:
+
+```
+"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc"
+```
+
+And then we can disassemble `fv_prune.out` with `llvm-objdump` and as opposed to
+`llc` this file is actually very similar to our problematic assembly and shows
+the pattern we are looking for of `v8` being overridden.
+
+Now that we have narrowed down building the `Nbnxm` kernel to a fairly simple
+command we can move on to using one of the most powerful LLVM debugging tool:
+`-print-after-all`, with this flag the LLVM compiler will print the IR and
+Machine IR after every single pass or stage of the compiler. This is extremely
+helpful to debug however it does produce huge amounts of output which is why we
+couldn't use it on the commands building the entirety of GROMACS and had to
+narrow it down first.
+
+```
+"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "fv_prune.out" "fv_prune.bc" -mllvm -print-after-all -mllvm -filter-print-funcs=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE &> print-after-all.txt
+```
+
+And `-filter-print-funcs` narrows down the output further to only the function
+we're interested in.
+
+Now we can look through the output of this command and look for our problematic
+instructions, so again the pattern of a load between two barriers, and starting
+from the bottom of the file we end up finding the following Machine IR:
+
+```
+renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
+renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
+renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
+renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
+renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit killed $scc
+S_WAITCNT 49279
+$vgpr6 = V_MOV_B32_e32 killed $sgpr17, implicit $exec, implicit $exec
+renamable $vgpr5, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, killed $vgpr7, 0, implicit $exec
+renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec
+S_WAITCNT 0
+S_BARRIER
+renamable $vgpr50 = GLOBAL_LOAD_DWORD renamable $vgpr5_vgpr6, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
+S_WAITCNT 0
+S_BARRIER
+```
+
+This is clearly our problematic code, we can see the two barriers, the load, the
+shift by 3 and the additions, and we can see `vgpr8` being used, the important
+instructions are the following:
+
+```
+[...]
+renamable $vgpr7_vgpr8 = V_LSHLREV_B64_e64 3, $vgpr8_vgpr9, implicit $exec
+[...]
+renamable $vgpr6, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 killed $vgpr6, killed $vgpr8, killed $sgpr16_sgpr17, 0, implicit $exec
+```
+
+What is very interesting here is that in the `V_ADDC` instruction the `$vgpr8`
+operand is marked `killed` which tells the compiler that the register is not
+used after this instruction and can be re-used. This is why `v8` is being
+re-used in the body of the loop. In addition even the shift by 3 is overriding
+`v8` so it definitely won't be correct in the next iteration.
+
+So we scroll back up our `print-after-all.txt` file up until we find the first
+pass that introduced these seemingly incorrect instructions, and we track it
+down to:
+
+```
+# *** IR Dump After Virtual Register Rewriter (virtregrewriter) ***:
+```
+
+This is the first pass that introduces this killed `$vgpr8`, but this pass is
+also the first pass where the Machine IR has machine registers, it runs right
+after the register allocator to do the actual replacement between the virtual
+registers and the newly allocated physical registers.
+
+Scrolling up further to look at the code after the greedy register allocator, we
+can see that the code at that point looks like:
+
+```
+# *** IR Dump After Greedy Register Allocator (greedy) ***:
+[...]
+19776B|   S_WAITCNT 0
+19792B|   S_BARRIER
+19824B|   renamable $sgpr57 = S_ASHR_I32 renamable $sgpr56, 31, implicit-def dead $scc
+19872B|   renamable $sgpr58_sgpr59 = S_LSHL_B64 renamable $sgpr56_sgpr57, 5, implicit-def dead $scc
+19888B|   renamable $sgpr16 = S_ADD_U32 renamable $sgpr42, renamable $sgpr58, implicit-def $scc
+19904B|   renamable $sgpr17 = S_ADDC_U32 renamable $sgpr43, renamable $sgpr59, implicit-def dead $scc, implicit $scc
+19936B|   %3340:vgpr_32 = COPY killed renamable $sgpr17
+19944B|   %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+19952B|   %3962:vreg_64 = COPY %3963:vreg_64
+19960B|   undef %893.sub0:vreg_64, renamable $sgpr16_sgpr17 = V_ADD_CO_U32_e64 killed $sgpr16, %3962.sub0:vreg_64, 0, implicit $exec
+19976B|   %893.sub1:vreg_64, dead renamable $sgpr16_sgpr17 = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, killed $sgpr16_sgpr17, 0, implicit $exec
+20064B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
+20080B|   S_WAITCNT 0
+20096B|   S_BARRIER
+```
+
+We can see the shift by 3, with virtual register `%3078`, that goes into
+`%3963`, which is then copied to `%3962`, and then used as operand for the
+addition, but as you can see at this point it is not yet marked killed.
+
+And looking at the pass before the register allocation we see the following
+code:
+
+```
+4376B|    %889:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec↩
+[...]
+19776B|   S_WAITCNT 0
+19792B|   S_BARRIER
+19824B|   %887.sub1:sreg_64 = S_ASHR_I32 %887.sub0:sreg_64, 31, implicit-def dead $scc
+19872B|   %891:sreg_64 = S_LSHL_B64 %887:sreg_64, 5, implicit-def dead $scc
+19888B|   %3326:sreg_32 = S_ADD_U32 %692.sub2:sgpr_128, %891.sub0:sreg_64, implicit-def $scc
+19904B|   %3327:sreg_32 = S_ADDC_U32 %692.sub3:sgpr_128, %891.sub1:sreg_64, implicit-def dead $scc, implicit $scc
+19936B|   %3340:vgpr_32 = COPY %3327:sreg_32
+19944B|   undef %893.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %889.sub0:vreg_64, 0, implicit $exec
+19952B|   %893.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %889.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
+20064B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %893:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.1262, !tbaa !84, addrspace 1)
+20080B|   S_WAITCNT 0
+20096B|   S_BARRIER
+```
+
+Looking at this code it's important to note that the shift by 3 is actually not
+inside of the loop, it's before the loop starts and its value is simply used
+inside of the loop for the additions. This code should work absolutely fine, but
+it seems that the greedy register allocator moves this shift inside of the loop,
+and then the virtual register rewriter marks the `v8` operand as killed.
+
+Now at this point in our investigation, we need to start debugging the actual
+code of these passes to understand why they are doing these transformations and
+try to figure out what is going wrong. One good way to start with that is to use
+`-debug-only=regalloc`, this will print debugging output for the register
+allocation. However it doesn't support the `-filter-print-funcs=` flag like
+`-print-after-all` which leads to huge output so we need to reduce our bitcode
+file a little further.
+
+To do that we can use `opt` to remove from the bitcode all the kernels we're not
+interested in:
+
+```
+/path/to/llvm/build/bin/opt --internalize-public-api-list=_ZTS11NbnxmKernelILb1ELb1ELN5Nbnxm8ElecTypeE1ELNS0_7VdwTypeE1EE --internalize --globaldce fv_prune.bc -o fv_prune_trimmed.bc
+```
+
+This command will mark `internal` all the functions that are not listed in the
+`--internalize-public-api-list` flag, and then run a `globaldce` pass which will
+eliminate all the internal symbols. This trims down our bitcode file quite
+significantly.
+
+Then we can also run the compilation up until right before the greedy register
+allocator, this way we can just run the passes we're trying to debug:
+
+```
+"/path/to/llvm/build/bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx906" "-plugin-opt=O3" "-o" "pre_greedy.mir" "fv_prune_trimmed.bc" -mllvm -stop-before=greedy
+```
+
+`lld` is not really setup to use this `stop-before` parameter and so it will
+crash but not before giving us the output we want in `pre_greedy.mir`, this will
+now contain the Machine IR before the greedy register allocator, the one with
+the shift by 3 outside of the loop.
+
+We can then use `llc` to run only the specific pass and get debug output from
+register allocation:
+
+```
+/path/to/llvm/build/bin/llc -start-before=greedy -stop-after=virtregrewriter -mcpu=gfx906 -debug-only=regalloc pre_greedy.mir -o post_regalloc.mir &> regalloc.txt
+```
+
+This command will run just the passes between `greedy` and `virtregrewriter`,
+and print debug output for the register allocation `-debug-only=regalloc`. The
+names for the passes can be seen in the `-print-after-all` dumps in parenthesis
+next to the longer name of the passes. Note that this `regalloc` name can be
+found in the source of the passes mentioned above, for example in:
+
+```
+/path/to/llvm/llvm/lib/CodeGen/RegAllocGreedy.cpp
+```
+
+Under:
+
+```
+#define DEBUG_TYPE "regalloc"↩
+```
+
+This in turns control the `LLVM_DEBUG` directives in that file and places them
+under the `regalloc` keyword.
+
+So now that we have the debug output from the register allocation we can look
+through that, and to find what we're looking for we can simply look for the
+virtual registers that we spotted above so `%3078`, and we find the following
+debug output:
+
+```
+Removing 1 back-copies.
+Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
+  blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
+  blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
+  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
+  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
+  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64
+queuing new interval: %3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)[1    6192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
+Enqueuing %3961
+queuing new interval: %3962 [16312r,16336r:0) 0@16312r  L000000000000000C [16312r,16336r:0) 0@16312r  L0000000000000003 [16312r,16320r:0) 0@16312r  weight:5.705815e-02↩
+Enqueuing %3962
+
+selectOrSplit VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)[1    6192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04 w=3.681593e-04
+RS_Spill Cascade 0
+should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r  weight:6.930721e-05 w= 6.930721e-05
+should evict: %677 [96r,13936r:0)[16144B,38336B:0) 0@96r  weight:6.930721e-05 w= 6.930721e-05
+Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
+From original %890
+|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
+
+All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+Remat created 1 dead defs.
+Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
+Enqueuing %3078
+Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B    :0) 0@2240r  weight:1.520717e-01
+```
+
+Now this is quite a lot so first let's roll back to the beginning of the debug
+output where the full kernel is shown, we can see that at that point, the code
+looks like this:
+
+```
+3440B|    %890:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+[...]
+16320B|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %890.sub0:vreg_64, 0, implicit $exec↩
+16336B|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %890.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec↩
+16352B|   %3772:vgpr_32 = GLOBAL_LOAD_DWORD %894:vreg_64, 16, 0, implicit $exec :: (load (s32) from %ir.imask97.i, !tbaa !75, addrspace 1)↩
+16368B|   S_WAITCNT 0↩
+16384B|   S_BARRIER↩
+```
+
+In this snippet we can see that the shift by 3 is at `3440` in the code and that
+the loop is around `163XX`, and also that at the beginning of the register
+allocation, the shift by 3 is indeed outside of the loop as expected, so coming
+back to the debug output, we first have:
+
+```
+Removing 1 back-copies.
+Removing 16328r|%3961:vreg_64 = COPY %890:vreg_64
+  blit [3440r,3488B:0): [3440r;3488B)=0(%3961)(recalc)
+  blit [16192B,38336B:0): [16192B;16312r)=0(%3961)(recalc) [16312r;16336r)=1(%3962)(recalc) [16336r;38336B)=0(%3961)(recalc)
+  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
+  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
+  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64
+```
+
+Now this is in the middle of the register allocation modifications so the code
+changed a little bit but looks sort of the same, it seems that there's now a
+copy inside of the loop of `%890` into `%3961` and that this is trying to remove
+it. And we can see the modified instructions in the `rewr` part:
+
+```
+  rewr %bb.6|   3440r:0|%3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+  rewr %bb.90|  16320B:1|   undef %894.sub0:vreg_64, %3334:sreg_64_xexec = V_ADD_CO_U32_e64 %3326:sreg_32, %3962.sub0:vreg_64, 0, implicit $exec
+  rewr %bb.90|  16336B:1|   %894.sub1:vreg_64, dead %3335:sreg_64_xexec = V_ADDC_U32_e64 %3340:vgpr_32, %3962.sub1:vreg_64, %3334:sreg_64_xexec, 0, implicit $exec
+  rewr %bb.90|  16312B:0|   %3962:vreg_64 = COPY %3961:vreg_64
+```
+
+So at this point the addition instructions that used to take `%890` take
+`%3962`, the shift is now writing to `%3961` instead of `%890` and `%3961` is
+copied into `%3962` before the additions.
+
+And at this stage the shift is still outside of the loop which seems fine, so we
+can move on to the next part of the output, and namely the inline spilling part:
+
+```
+Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
+From original %890
+|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
+```
+
+This is important because this is the first time we see the shift by 3
+instruction moved into the loop, as you can see in `16308` instead of `3440`.
+
+And you can then see the compiler delete the original shift instruction:
+
+```
+All defs dead: dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+Remat created 1 dead defs.
+Deleting dead def 3440r|dead %3961:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+```
+
+And then restart the register allocation for it since the instruction was moved:
+
+```
+unassigning %3078 from $vgpr8_vgpr9: VGPR8_LO16 VGPR8_HI16 VGPR9_LO16 VGPR9_HI16
+Enqueuing %3078
+Shrink: %3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B    :0) 0@2240r  weight:1.520717e-01
+```
+
+Now in practice it took a bit more debugging but here we can start to see where
+the problem is coming from on the last line:
+
+```
+%3078 [2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r  L0000000000000003 [2224r,3440r:0) 0@2224r  L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r
+```
+
+What this represents is the live ranges for the virtual register `%3078`, which
+means the parts of the code where the register is considered "alive", so
+essentially all the code between the instruction that defines the register and
+its last use, but let's break it down a bit further:
+
+```
+[2224r,2240r:0)[2240r,3488B:1)[16192B,38336B:1) 0@2224r 1@2240r
+```
+
+This first part tracks the liveness of the full register, these live intervals
+contain two "values" (and/or definitions), `0@2224r` and `1@2240r`, indeed if we
+look at the full Machine IR:
+
+```
+2224B|    undef %3078.sub0:vreg_64 = V_LSHRREV_B32_e32 5, %19:vgpr_32, implicit $exec
+2240B|    %3078.sub1:vreg_64 = V_MOV_B32_e32 0, implicit $exec↩
+```
+
+This defines the two sub-registers of `%3078`. And so what the rest of the
+notation shows us is that `%3078` is alive between `2224` and `2240` for value
+`0`: `[2224r,2240r:0)`, then is alive between `2240` to `3488` for value `1`:
+`[2240r,3488B:1)`, and then between `16192` to `38336` for value `1`:
+`[16192B,38336B:1)`.
+
+Now as you can probably tell this notation of the live range seems a little
+strange because it is for the full register, but here the values `0` and `1` are
+only defining parts of the register, and so we need to look at the rest of the
+representation of the live range which shows us information about the
+sub-registers:
+
+```
+L0000000000000003 [2224r,3440r:0) 0@2224r
+L000000000000000C [2240r,3488B:0)[16192B,38336B:0) 0@2240r
+```
+
+The notation here is similar to the notation above, with the initial value being
+a mask representing a sub-register, in binary `0x3` is `0011` and `0xC` is
+`1100`. And so looking at these two lines you can see that the first, `0x3`
+represents `%3078.sub0` and `0xC` represents `%3078.sub1`.
+
+And so now we know that `%3078.sub0` is defined at `2224` and is alive between
+`2224` and `3440`, which means it's alive until the shift by 3 but not further.
+However `%3078.sub1` is defined at `2240` and is alive between `2240` and `3488`
+which means it's alive past the shift, but not only that, it is also alive
+between `16192` and `38336`, this second interval is important because it covers
+the loop with our problematic load.
+
+Now that we understand the live intervals of `%3078` the problem becomes fairly
+obvious when looking back at the inline spilling:
+
+```
+Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
+From original %890
+|   remat:  16308r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+|           16312e| %3962:vreg_64 = COPY killed %3963:vreg_64
+```
+
+This code moved the shift by 3 inside the loop, but at `16308` where we know
+that `%3078.sub0` is not alive, and this is the virtual register that ultimately
+gets assigned to `v8`. And so it makes sense that `v8` was being re-used,
+because as far as the compiler is aware it is not supposed to be used at that
+point.
+
+Thus we're almost done, we know that the instruction is being incorrectly moved
+into the loop, we just need to find how this error manifests in the code and how
+to fix it. To do that we can simply search for the printed keywords such as
+`remat: ` or `Inline spilling` and then follow the code from there. It took a
+little more time until reaching the following code in
+`LiveRangeEdit::allUsesAvailableAt`:
+
+```cpp
+// Check that subrange is live at UseIdx.
+if (MO.getSubReg()) {
+  const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
+  LaneBitmask LM = TRI->getSubRegIndexLaneMask(MO.getSubReg());
+  for (LiveInterval::SubRange &SR : li.subranges()) {
+    if ((SR.LaneMask & LM).none())
+      continue;
+    if (!SR.liveAt(UseIdx))
+      return false;
+
+    // Early exit if all used lanes are checked. No need to continue.
+    LM &= ~SR.LaneMask;
+    if (LM.none())
+      break;
+  }
+}
+```
+
+The function this is in is trying to determine if all the uses of an
+instructions are available at a given point, which is in turn used to determine
+if it is valid to move the instruction there. And it is checking for
+sub-register ranges, but only if the operand of the instruction is using a
+sub-register. This is not the case for us, our shift instruction uses the full
+register, but one of the sub-register is not alive at the destination, and so
+this code ends up not checking our instruction for sub-ranges and incorrectly
+moves the shift into the loop.
+
+The fix is then fairly simple, we just need to always check the sub-ranges when
+they're available, and so tweaking the code above as follows resolved the issue:
+
+```cpp
+// Check that subrange is live at UseIdx.
+if (li.hasSubRanges()) {
+  const TargetRegisterInfo *TRI = MRI.getTargetRegisterInfo();
+  unsigned SubReg = MO.getSubReg();
+  LaneBitmask LM = SubReg ? TRI->getSubRegIndexLaneMask(SubReg)
+                          : MRI.getMaxLaneMaskForVReg(MO.getReg());
+```
+
+With this patch instead of checking the sub-ranges only if the instruction is
+using a sub-register, it checks them if they exist, and if the instruction is
+using the full register, it uses a mask containing both sub-registers to check
+them both.
+
+And so, the inline spilling fails to move the shift instruction in the loop:
+
+```
+Inline spilling VReg_64:%3961 [3440r,3488B:0)[16192B,38336B:0) 0@3440r  L000000000000000C [3440r,3488B:1)[16192B,38336B:1) 0@x 1@3440r  L0000000000000003 [3440r,3488B:1)    [16192B,38336B:1) 0@x 1@3440r  weight:3.681593e-04
+From original %890
+|   cannot remat for 16312e|%3962:vreg_64 = COPY %3961:vreg_64
+```
+
+And since it can't "inline" the spill it simply resorts to a normal spill:
+
+```
+Merged spilled regs: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x  weight:0.000000e+00
+spillAroundUses %3961
+|   rewrite: 3440r| %3963:vreg_64 = V_LSHLREV_B64_e64 3, %3078:vreg_64, implicit $exec
+
+|   spill:   3448r| SI_SPILL_V64_SAVE killed %3963:vreg_64, %stack.2, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.2, align 4, addrspace 5)
+Checking redundant spills for 0@16312r in %3962 [16312r,16336r:0) 0@16312r  L000000000000000C [16312r,16336r:0) 0@16312r  L0000000000000003 [16312r,16320r:0) 0@16312r  w    eight:5.705815e-02
+Merged to stack int: SS#2 [3440r,3488B:0)[16192B,38336B:0) 0@x  weight:0.000000e+00
+|   folded:   16312r|   %3962:vreg_64 = SI_SPILL_V64_RESTORE %stack.2, $sgpr32, 0, implicit $exec :: (load (s64) from %stack.2, align 4, addrspace 5)
+```
+
+And the code generated this way is now correct meaning that the GROMACS test now
+passes on `gfx906`.
+
+## Conclusion
+
+Finding the root cause of this issue was an interesting journey that took us
+from a molecular dynamics kernel all the way down to the AMD GPU ISAs.
+
+Through describing this journey, this blog post provides some insight on a
+number of techniques that can be used either when debugging issues in the LLVM
+project, in DPC++ , or when working in an environment with limited debugging
+capabilities. And it also shows a glimpse into the DPC++ and LLVM components
+used during the compilation of GPU kernels. Which may hopefully be helpful to
+anyone wanting to learn more about these technologies.
+
+And in closing, the full patch fixing this specific issue was submitted to
+upstream LLVM and promptly merged:
+
+* https://reviews.llvm.org/D131884
diff --git a/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp b/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp
new file mode 100644
index 0000000..cceba26
Binary files /dev/null and b/assets/images/portal/article-images/2025-08-20-gromacs/thumbnail.webp differ