[AMDGPU] Add intrinsic exposing s_alloc_vgpr #163951

rovka · 2025-10-17T12:16:51Z

Make it possible to use s_alloc_vgpr at the IR level. This is a huge footgun and use for anything other than compiler internal purposes is heavily discouraged. The calling code must make sure that it does not allocate fewer VGPRs than necessary - the intrinsic is NOT a request to the backend to limit the number of VGPRs it uses (in essence it's not so different from what we do with the dynamic VGPR flags of the amdgcn.cs.chain intrinsic, it just makes it possible to use this functionality in other scenarios).

Make it possible to use `s_alloc_vgpr` at the IR level. This is a huge footgun and use for anything other than compiler internal purposes is heavily discouraged. The calling code must make sure that it does not allocate fewer VGPRs than necessary - the intrinsic is NOT a request to the backend to limit the number of VGPRs it uses (in essence it's not so different from what we do with the dynamic VGPR flags of the `amdgcn.cs.chain` intrinsic, it just makes it possible to use this functionality in other scenarios).

llvmbot · 2025-10-17T12:17:30Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-analysis

Author: Diana Picus (rovka)

Changes

Make it possible to use s_alloc_vgpr at the IR level. This is a huge footgun and use for anything other than compiler internal purposes is heavily discouraged. The calling code must make sure that it does not allocate fewer VGPRs than necessary - the intrinsic is NOT a request to the backend to limit the number of VGPRs it uses (in essence it's not so different from what we do with the dynamic VGPR flags of the amdgcn.cs.chain intrinsic, it just makes it possible to use this functionality in other scenarios).

Full diff: https://github.com/llvm/llvm-project/pull/163951.diff

7 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+11)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+16)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
(modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+4-2)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll (+9)
(added) llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll (+59)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index ded00b1274670..9bb305823e932 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -391,6 +391,17 @@ def int_amdgcn_s_wait_loadcnt        : AMDGPUWaitIntrinsic;
 def int_amdgcn_s_wait_samplecnt      : AMDGPUWaitIntrinsic;
 def int_amdgcn_s_wait_storecnt       : AMDGPUWaitIntrinsic;
 
+// Force the VGPR allocation of the current wave to (at least) the given value.
+// The actual number of allocated VGPRs may be rounded up to match hardware
+// block boundaries.
+// It is the responsibility of the calling code to ensure it does not allocate
+// below the VGPR requirements of the current shader.
+def int_amdgcn_s_alloc_vgpr :
+  Intrinsic<
+    [llvm_i1_ty], // Returns true if the allocation succeeded, false otherwise.
+    [llvm_i32_ty], // The number of VGPRs to allocate.
+    [NoUndef<RetIndex>, IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+
 def int_amdgcn_div_scale : DefaultAttrsIntrinsic<
   // 1st parameter: Numerator
   // 2nd parameter: Denominator
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 12915c7344426..2f9c87cb5f20e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -2331,6 +2331,22 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
   case Intrinsic::amdgcn_ds_bvh_stack_push8_pop1_rtn:
   case Intrinsic::amdgcn_ds_bvh_stack_push8_pop2_rtn:
     return selectDSBvhStackIntrinsic(I);
+  case Intrinsic::amdgcn_s_alloc_vgpr: {
+    // S_ALLOC_VGPR doesn't have a destination register, it just implicitly sets
+    // SCC. We then need to COPY it into the result vreg.
+    MachineBasicBlock *MBB = I.getParent();
+    const DebugLoc &DL = I.getDebugLoc();
+
+    Register ResReg = I.getOperand(0).getReg();
+
+    MachineInstr *AllocMI = BuildMI(*MBB, &I, DL, TII.get(AMDGPU::S_ALLOC_VGPR))
+                                .add(I.getOperand(2));
+    MachineInstr *CopyMI = BuildMI(*MBB, &I, DL, TII.get(AMDGPU::COPY), ResReg)
+                               .addReg(AMDGPU::SCC);
+    I.eraseFromParent();
+    return constrainSelectedInstRegOperands(*AllocMI, TII, TRI, RBI) &&
+           RBI.constrainGenericRegister(ResReg, AMDGPU::SReg_32RegClass, *MRI);
+  }
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_signal_var:
     return selectNamedBarrierInit(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 56807a475537d..dda73f13f7487 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5359,6 +5359,10 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[6] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
       OpdsMapping[8] = getSGPROpMapping(MI.getOperand(8).getReg(), MRI, *TRI);
       break;
+    case Intrinsic::amdgcn_s_alloc_vgpr:
+      OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1);
+      OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
+      break;
     case Intrinsic::amdgcn_s_sendmsg:
     case Intrinsic::amdgcn_s_sendmsghalt: {
       // This must be an SGPR, but accept a VGPR.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 2393346839707..b82b2416a57f6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -409,6 +409,7 @@ def : AlwaysUniform<int_amdgcn_cluster_workgroup_max_flat_id>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_x>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_y>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_z>;
+def : AlwaysUniform<int_amdgcn_s_alloc_vgpr>;
 def : AlwaysUniform<int_amdgcn_s_getpc>;
 def : AlwaysUniform<int_amdgcn_s_getreg>;
 def : AlwaysUniform<int_amdgcn_s_memrealtime>;
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 84287b621fe78..9496087aec20c 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -433,8 +433,10 @@ let SubtargetPredicate = isGFX11Plus in {
 } // End SubtargetPredicate = isGFX11Plus
 
 let SubtargetPredicate = isGFX12Plus in {
-  let hasSideEffects = 1, Defs = [SCC] in {
-    def S_ALLOC_VGPR : SOP1_0_32 <"s_alloc_vgpr">;
+  let hasSideEffects = 1, isConvergent = 1, Defs = [SCC] in {
+    def S_ALLOC_VGPR : SOP1_0_32 <"s_alloc_vgpr",
+      [(set SCC, (int_amdgcn_s_alloc_vgpr SSrc_b32:$src0))]
+    >;
   }
 } // End SubtargetPredicate = isGFX12Plus
 
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
index 9ff670bee0f89..3f56f12f3cb34 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
@@ -183,6 +183,15 @@ define void @cluster_workgroup_max_flat_id(ptr addrspace(1) inreg %out) {
   ret void
 }
 
+; CHECK-LABEL: for function 's_alloc_vgpr':
+; CHECK: ALL VALUES UNIFORM
+define void @s_alloc_vgpr(i32 inreg %n, ptr addrspace(1) inreg %out) {
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 %n)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
 ; CHECK-LABEL: for function 's_memtime':
 ; CHECK: ALL VALUES UNIFORM
 define void @s_memtime(ptr addrspace(1) inreg %out) {
diff --git a/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll b/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll
new file mode 100644
index 0000000000000..74c42b7bffd04
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll
@@ -0,0 +1,59 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1250 < %s | FileCheck %s --check-prefix=GISEL
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1250 < %s | FileCheck %s --check-prefix=DAGISEL
+
+declare i1 @llvm.amdgcn.s.alloc.vgpr(i32)
+
+define amdgpu_cs void @test_alloc_vreg_const(ptr addrspace(1) %out) #0 {
+; GISEL-LABEL: test_alloc_vreg_const:
+; GISEL:       ; %bb.0: ; %entry
+; GISEL-NEXT:    s_alloc_vgpr 45
+; GISEL-NEXT:    s_cselect_b32 s0, 1, 0
+; GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GISEL-NEXT:    s_and_b32 s0, s0, 1
+; GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GISEL-NEXT:    s_endpgm
+;
+; DAGISEL-LABEL: test_alloc_vreg_const:
+; DAGISEL:       ; %bb.0: ; %entry
+; DAGISEL-NEXT:    s_alloc_vgpr 45
+; DAGISEL-NEXT:    s_cselect_b32 s0, -1, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
+; DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; DAGISEL-NEXT:    s_endpgm
+entry:
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 45)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_cs void @test_alloc_vreg_var(i32 inreg %n, ptr addrspace(1) %out) #0 {
+; GISEL-LABEL: test_alloc_vreg_var:
+; GISEL:       ; %bb.0: ; %entry
+; GISEL-NEXT:    s_alloc_vgpr s0
+; GISEL-NEXT:    s_cselect_b32 s0, 1, 0
+; GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GISEL-NEXT:    s_and_b32 s0, s0, 1
+; GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GISEL-NEXT:    s_endpgm
+;
+; DAGISEL-LABEL: test_alloc_vreg_var:
+; DAGISEL:       ; %bb.0: ; %entry
+; DAGISEL-NEXT:    s_alloc_vgpr s0
+; DAGISEL-NEXT:    s_cselect_b32 s0, -1, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
+; DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; DAGISEL-NEXT:    s_endpgm
+entry:
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 %n)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
+attributes #0 = { "amdgpu-dynamic-vgpr-block-sze" = "16" }

llvmbot · 2025-10-17T12:17:30Z

@llvm/pr-subscribers-llvm-ir

Author: Diana Picus (rovka)

Changes

Make it possible to use s_alloc_vgpr at the IR level. This is a huge footgun and use for anything other than compiler internal purposes is heavily discouraged. The calling code must make sure that it does not allocate fewer VGPRs than necessary - the intrinsic is NOT a request to the backend to limit the number of VGPRs it uses (in essence it's not so different from what we do with the dynamic VGPR flags of the amdgcn.cs.chain intrinsic, it just makes it possible to use this functionality in other scenarios).

Full diff: https://github.com/llvm/llvm-project/pull/163951.diff

7 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+11)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+16)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td (+1)
(modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+4-2)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll (+9)
(added) llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll (+59)

diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index ded00b1274670..9bb305823e932 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -391,6 +391,17 @@ def int_amdgcn_s_wait_loadcnt        : AMDGPUWaitIntrinsic;
 def int_amdgcn_s_wait_samplecnt      : AMDGPUWaitIntrinsic;
 def int_amdgcn_s_wait_storecnt       : AMDGPUWaitIntrinsic;
 
+// Force the VGPR allocation of the current wave to (at least) the given value.
+// The actual number of allocated VGPRs may be rounded up to match hardware
+// block boundaries.
+// It is the responsibility of the calling code to ensure it does not allocate
+// below the VGPR requirements of the current shader.
+def int_amdgcn_s_alloc_vgpr :
+  Intrinsic<
+    [llvm_i1_ty], // Returns true if the allocation succeeded, false otherwise.
+    [llvm_i32_ty], // The number of VGPRs to allocate.
+    [NoUndef<RetIndex>, IntrNoMem, IntrHasSideEffects, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+
 def int_amdgcn_div_scale : DefaultAttrsIntrinsic<
   // 1st parameter: Numerator
   // 2nd parameter: Denominator
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 12915c7344426..2f9c87cb5f20e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -2331,6 +2331,22 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
   case Intrinsic::amdgcn_ds_bvh_stack_push8_pop1_rtn:
   case Intrinsic::amdgcn_ds_bvh_stack_push8_pop2_rtn:
     return selectDSBvhStackIntrinsic(I);
+  case Intrinsic::amdgcn_s_alloc_vgpr: {
+    // S_ALLOC_VGPR doesn't have a destination register, it just implicitly sets
+    // SCC. We then need to COPY it into the result vreg.
+    MachineBasicBlock *MBB = I.getParent();
+    const DebugLoc &DL = I.getDebugLoc();
+
+    Register ResReg = I.getOperand(0).getReg();
+
+    MachineInstr *AllocMI = BuildMI(*MBB, &I, DL, TII.get(AMDGPU::S_ALLOC_VGPR))
+                                .add(I.getOperand(2));
+    MachineInstr *CopyMI = BuildMI(*MBB, &I, DL, TII.get(AMDGPU::COPY), ResReg)
+                               .addReg(AMDGPU::SCC);
+    I.eraseFromParent();
+    return constrainSelectedInstRegOperands(*AllocMI, TII, TRI, RBI) &&
+           RBI.constrainGenericRegister(ResReg, AMDGPU::SReg_32RegClass, *MRI);
+  }
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_signal_var:
     return selectNamedBarrierInit(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 56807a475537d..dda73f13f7487 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5359,6 +5359,10 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[6] = AMDGPU::getValueMapping(AMDGPU::VGPRRegBankID, 32);
       OpdsMapping[8] = getSGPROpMapping(MI.getOperand(8).getReg(), MRI, *TRI);
       break;
+    case Intrinsic::amdgcn_s_alloc_vgpr:
+      OpdsMapping[0] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 1);
+      OpdsMapping[2] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, 32);
+      break;
     case Intrinsic::amdgcn_s_sendmsg:
     case Intrinsic::amdgcn_s_sendmsghalt: {
       // This must be an SGPR, but accept a VGPR.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
index 2393346839707..b82b2416a57f6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSearchableTables.td
@@ -409,6 +409,7 @@ def : AlwaysUniform<int_amdgcn_cluster_workgroup_max_flat_id>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_x>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_y>;
 def : AlwaysUniform<int_amdgcn_workgroup_id_z>;
+def : AlwaysUniform<int_amdgcn_s_alloc_vgpr>;
 def : AlwaysUniform<int_amdgcn_s_getpc>;
 def : AlwaysUniform<int_amdgcn_s_getreg>;
 def : AlwaysUniform<int_amdgcn_s_memrealtime>;
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 84287b621fe78..9496087aec20c 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -433,8 +433,10 @@ let SubtargetPredicate = isGFX11Plus in {
 } // End SubtargetPredicate = isGFX11Plus
 
 let SubtargetPredicate = isGFX12Plus in {
-  let hasSideEffects = 1, Defs = [SCC] in {
-    def S_ALLOC_VGPR : SOP1_0_32 <"s_alloc_vgpr">;
+  let hasSideEffects = 1, isConvergent = 1, Defs = [SCC] in {
+    def S_ALLOC_VGPR : SOP1_0_32 <"s_alloc_vgpr",
+      [(set SCC, (int_amdgcn_s_alloc_vgpr SSrc_b32:$src0))]
+    >;
   }
 } // End SubtargetPredicate = isGFX12Plus
 
diff --git a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
index 9ff670bee0f89..3f56f12f3cb34 100644
--- a/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
+++ b/llvm/test/Analysis/UniformityAnalysis/AMDGPU/always_uniform.ll
@@ -183,6 +183,15 @@ define void @cluster_workgroup_max_flat_id(ptr addrspace(1) inreg %out) {
   ret void
 }
 
+; CHECK-LABEL: for function 's_alloc_vgpr':
+; CHECK: ALL VALUES UNIFORM
+define void @s_alloc_vgpr(i32 inreg %n, ptr addrspace(1) inreg %out) {
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 %n)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
 ; CHECK-LABEL: for function 's_memtime':
 ; CHECK: ALL VALUES UNIFORM
 define void @s_memtime(ptr addrspace(1) inreg %out) {
diff --git a/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll b/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll
new file mode 100644
index 0000000000000..74c42b7bffd04
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/intrinsic-amdgcn-s-alloc-vgpr.ll
@@ -0,0 +1,59 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -global-isel=1 -mtriple=amdgcn--amdpal -mcpu=gfx1250 < %s | FileCheck %s --check-prefix=GISEL
+; RUN: llc -global-isel=0 -mtriple=amdgcn--amdpal -mcpu=gfx1250 < %s | FileCheck %s --check-prefix=DAGISEL
+
+declare i1 @llvm.amdgcn.s.alloc.vgpr(i32)
+
+define amdgpu_cs void @test_alloc_vreg_const(ptr addrspace(1) %out) #0 {
+; GISEL-LABEL: test_alloc_vreg_const:
+; GISEL:       ; %bb.0: ; %entry
+; GISEL-NEXT:    s_alloc_vgpr 45
+; GISEL-NEXT:    s_cselect_b32 s0, 1, 0
+; GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GISEL-NEXT:    s_and_b32 s0, s0, 1
+; GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GISEL-NEXT:    s_endpgm
+;
+; DAGISEL-LABEL: test_alloc_vreg_const:
+; DAGISEL:       ; %bb.0: ; %entry
+; DAGISEL-NEXT:    s_alloc_vgpr 45
+; DAGISEL-NEXT:    s_cselect_b32 s0, -1, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
+; DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; DAGISEL-NEXT:    s_endpgm
+entry:
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 45)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_cs void @test_alloc_vreg_var(i32 inreg %n, ptr addrspace(1) %out) #0 {
+; GISEL-LABEL: test_alloc_vreg_var:
+; GISEL:       ; %bb.0: ; %entry
+; GISEL-NEXT:    s_alloc_vgpr s0
+; GISEL-NEXT:    s_cselect_b32 s0, 1, 0
+; GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GISEL-NEXT:    s_and_b32 s0, s0, 1
+; GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GISEL-NEXT:    s_endpgm
+;
+; DAGISEL-LABEL: test_alloc_vreg_var:
+; DAGISEL:       ; %bb.0: ; %entry
+; DAGISEL-NEXT:    s_alloc_vgpr s0
+; DAGISEL-NEXT:    s_cselect_b32 s0, -1, 0
+; DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; DAGISEL-NEXT:    v_cndmask_b32_e64 v2, 0, 1, s0
+; DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; DAGISEL-NEXT:    s_endpgm
+entry:
+  %scc = call i1 @llvm.amdgcn.s.alloc.vgpr(i32 %n)
+  %sel = select i1 %scc, i32 1, i32 0
+  store i32 %sel, ptr addrspace(1) %out
+  ret void
+}
+
+attributes #0 = { "amdgpu-dynamic-vgpr-block-sze" = "16" }

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

rovka · 2025-11-04T09:44:24Z

Ping

jasilvanus

The TableGen and test changes look good to me, but I'm not familiar with the backend C++ changes, so I'm leaving approval for someone else.

arsenm · 2025-11-04T20:16:50Z

I don't get why this would need direct exposure, it seems like under-defined ABI controls

jasilvanus · 2025-11-05T08:44:34Z

The plan is to give the ray tracing runtime code give more control about dVGPRs, aiming at two potential use cases:

Detecting whether there currently is dVGPR contention on a SIMD by allocating a large amount of VGPRs, and
Reducing the current VGPR allocation down to a value that is lower than the current allocation, but still safe.

For 1., if a dVGPR allocation fails, a wave needs to wait, and during this waiting a wave is consuming some VGPRs. Because there might be live state, we usually can't allocate down to 16, but have to keep a certain safe value (say 64 VGPRs). Then these VGPRs are effectively wasted while the wave is waiting, reducing effective occupancy. However, there might be specific places at which dVGPR allocations happen (known to the runtime code) where we have less state and thus can allocate down to a smaller value while waiting. Thus it is beneficial for performance to wait at such places to waste less VGPRs. But to move such waiting to another place we need to somehow detect the presence of VGPR contention early, which isn't trivial, in particular if waves on a SIMD are not part of the same workgroup. One option to solve that is to statistically sample for contention by allocating (but not using) many VGPRs.

For 2., to ensure active state of inactive lanes that do not participate in the current function is not deallocated, the dVGPR implementation might run a function with more VGPRs being allocated than what the function actually requires. Again, there might be cases in which the runtime code knows about this and can prove a limit on the active state of inactive lanes, and then safely allocate down to the current function's VGPR requirement.

In all cases, shader code is never allowed to allocate down below the containing function's VGPR requirement, and if we are allocating up, we are not actually using those VGPRs, and instead just the fact whether the allocation suceeded or not.

llvmbot added backend:AMDGPU llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding labels Oct 17, 2025

rovka requested review from arsenm, jasilvanus, jayfoad, mbrkusanin and tsymalla October 17, 2025 12:17

Silence warning

a429b95

jayfoad reviewed Oct 17, 2025

View reviewed changes

llvm/include/llvm/IR/IntrinsicsAMDGPU.td Outdated Show resolved Hide resolved

jasilvanus reviewed Oct 20, 2025

View reviewed changes

llvm/include/llvm/IR/IntrinsicsAMDGPU.td Outdated Show resolved Hide resolved

rovka and others added 2 commits October 21, 2025 10:54

Address review comments

b3bb4a5

Merge branch 'main' into users/rovka/alloc-vgpr-intrinsic

56891f1

jasilvanus reviewed Nov 4, 2025

View reviewed changes

Merge branch 'main' into users/rovka/alloc-vgpr-intrinsic

7a8d89d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add intrinsic exposing s_alloc_vgpr #163951

[AMDGPU] Add intrinsic exposing s_alloc_vgpr #163951

rovka commented Oct 17, 2025

Uh oh!

llvmbot commented Oct 17, 2025 •

edited

Loading

Uh oh!

llvmbot commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

rovka commented Nov 4, 2025

Uh oh!

jasilvanus left a comment

Uh oh!

arsenm commented Nov 4, 2025

Uh oh!

jasilvanus commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[AMDGPU] Add intrinsic exposing s_alloc_vgpr #163951

Are you sure you want to change the base?

[AMDGPU] Add intrinsic exposing s_alloc_vgpr #163951

Conversation

rovka commented Oct 17, 2025

Uh oh!

llvmbot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

rovka commented Nov 4, 2025

Uh oh!

jasilvanus left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm commented Nov 4, 2025

Uh oh!

jasilvanus commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

llvmbot commented Oct 17, 2025 •

edited

Loading