Skip to content

Conversation

@rampitec
Copy link
Collaborator

MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.

MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels.
It's been identified this message is only really needed for VGPR
limited kernels. A kernel becomes VGPR limited if a total number
of VGPRs per SIMD / number of used VGPRs is more than a number of
wave slots.
@llvmbot
Copy link
Member

llvmbot commented Oct 17, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's been identified this message is only really needed for VGPR limited kernels. A kernel becomes VGPR limited if a total number of VGPRs per SIMD / number of used VGPRs is more than a number of wave slots.


Patch is 2.08 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112765.diff

256 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+17-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll (-30)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll (-24)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll (-62)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i8.ll (-62)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.large.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.ballot.i32.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.scale.ll (-48)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i32.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.global.atomic.csub.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.if.break.i32.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.store.2d.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mov.dpp.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll (-52)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll (-52)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/load-unaligned.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/shl-ext-reduce.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-f16-f32-matrix-modifiers.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-imm.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-iu-modifiers.ll (-36)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32-swmmac-index_key.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w32.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-f16-f32-matrix-modifiers.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-imm.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-iu-modifiers.ll (-36)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64-swmmac-index_key.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/wmma-gfx12-w64.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/add.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/add.v2i16.ll (-26)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (-72)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (-128)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (-192)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (-64)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (-80)
  • (modified) llvm/test/CodeGen/AMDGPU/atomics_cond_sub.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/br_cc.f16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+21-57)
  • (modified) llvm/test/CodeGen/AMDGPU/bswap.ll (-14)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (-10)
  • (modified) llvm/test/CodeGen/AMDGPU/calling-conventions.ll (-62)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (-34)
  • (modified) llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp.ll (-196)
  • (modified) llvm/test/CodeGen/AMDGPU/commute-compares-scalar-float.ll (-64)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (-38)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-sub-offset.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-f16.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/fadd.f16.ll (-48)
  • (modified) llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll (-94)
  • (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.ll (-220)
  • (modified) llvm/test/CodeGen/AMDGPU/fcmp.f16.ll (-58)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (-38)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f64.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.f16.ll (-34)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (-36)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-nondeterminism.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (-24)
  • (modified) llvm/test/CodeGen/AMDGPU/fma-combine.ll (-90)
  • (modified) llvm/test/CodeGen/AMDGPU/fmax3.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fmaximum.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fmed3.ll (-130)
  • (modified) llvm/test/CodeGen/AMDGPU/fmin3.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fminimum.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/fmul.f16.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/fmuladd.f16.ll (-98)
  • (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.f16.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx1200.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-atomics.ll (-30)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-ptr-atomics.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-num-global-atomics.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fp16_to_fp32.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fp16_to_fp64.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fp32_to_fp16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fpext.f16.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/fptosi.f16.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoui.f16.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (-14)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fsub.f16.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/gfx12_scalar_subword_loads.ll (-92)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-store.ll (-244)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (-114)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (-24)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/half.ll (-88)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4s.ll (-34)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (-62)
  • (modified) llvm/test/CodeGen/AMDGPU/image-load-d16-tfe.ll (-14)
  • (modified) llvm/test/CodeGen/AMDGPU/imm16.ll (-66)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (-70)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ballot.i32.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.fp8.dpp.ll (-10)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.pkrtz.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.add.gs.reg.rtn.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.bvh.stack.rtn.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.sub.gs.reg.rtn.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fcmp.w32.ll (-188)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fcmp.w64.ll (-96)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.bf16.bf16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f16.f16.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.fdot2.f32.bf16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.atomic.ordered.add.b64.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.tr-w32.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.tr-w64.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.icmp.w32.ll (-128)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.icmp.w64.ll (-68)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.a16.dim.ll (+79-247)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.a16.encode.ll (-76)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.dim.ll (+83-266)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.store.a16.d16.ll (-48)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.store.a16.ll (-48)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.private.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.is.shared.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane.ll (-752)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane16.var.ll (-104)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.permlane64.ptr.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.quadmask.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.tfe.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.store.ll (-48)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.bf16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.d16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.d16.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (-32)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.barrier.wait.ll (-96)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.rtn.ll (-24)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.tfe.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.store.ll (-34)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.tbuffer.store.d16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.tbuffer.store.ll (-28)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.store.d16.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.store.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wave.id.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll (-52)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll (-52)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.writelane.ll (-76)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.ceil.f16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.cos.f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.floor.f16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.fmuladd.f16.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.get.fpmode.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.bf16.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.f16.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.is.fpclass.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.maxnum.f16.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.minnum.f16.ll (-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.mulo.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.rint.f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.set.rounding.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sin.f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.sqrt.f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.trunc.f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-always-uniform.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-f32.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-f64.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (-88)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i16.ll (-78)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i32.ll (-50)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i64.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (-104)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-hsa.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/lower-work-group-id-intrinsics-pal.ll (+54-16)
  • (modified) llvm/test/CodeGen/AMDGPU/lshr.v2i16.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/mad.u16.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/madak.ll (-40)
  • (modified) llvm/test/CodeGen/AMDGPU/match-perm-extract-vector-elt-bug.ll (-2)
  • (modified) llvm/test/CodeGen/AMDGPU/max-hard-clause-length.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/min.ll (-70)
  • (modified) llvm/test/CodeGen/AMDGPU/minimummaximum.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/minmax.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (-76)
  • (modified) llvm/test/CodeGen/AMDGPU/offset-split-global.ll (-142)
  • (modified) llvm/test/CodeGen/AMDGPU/omod.ll (-90)
  • (modified) llvm/test/CodeGen/AMDGPU/release-vgprs-dbg-loc.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/release-vgprs.mir (+79-57)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/saddo.ll (-12)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar-float-sopc.ll (-112)
  • (modified) llvm/test/CodeGen/AMDGPU/select.f16.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/shl.v2i16.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll (-74)
  • (modified) llvm/test/CodeGen/AMDGPU/sint_to_fp.i64.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/sitofp.f16.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.ll (-26)
  • (modified) llvm/test/CodeGen/AMDGPU/sub.v2i16.ll (-26)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/uint_to_fp.i64.ll (-16)
  • (modified) llvm/test/CodeGen/AMDGPU/uitofp.f16.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/v_cndmask.ll (-46)
  • (modified) llvm/test/CodeGen/AMDGPU/v_madak_f16.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/v_sat_pk_u8_i16.ll (-8)
  • (modified) llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/vgpr-mark-last-scratch-load.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/wait-before-stores-with-scope_sys.ll (-4)
  • (modified) llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll (-24)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-f16-f32-matrix-modifiers.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-imm.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-iu-modifiers.ll (-36)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32-swmmac-index_key.ll (-20)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w32.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-f16-f32-matrix-modifiers.ll (-56)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-imm.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-iu-modifiers.ll (-36)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64-swmmac-index_key.ll (-22)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma-gfx12-w64.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_32.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/wmma_multiple_64.ll (-44)
  • (modified) llvm/test/CodeGen/AMDGPU/workgroup-id-in-arch-sgprs.ll (-6)
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 8f1757db8a85f5..c2806f9cd88199 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -2606,15 +2606,24 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
 
   // Insert DEALLOC_VGPR messages before previously identified S_ENDPGM
   // instructions.
-  for (MachineInstr *MI : ReleaseVGPRInsts) {
-    if (ST->requiresNopBeforeDeallocVGPRs()) {
-      BuildMI(*MI->getParent(), MI, MI->getDebugLoc(), TII->get(AMDGPU::S_NOP))
-          .addImm(0);
+  // Skip deallocation if kernel is waveslot limited vs VGPR limited. A short
+  // waveslot limited kernel runs slower with the deallocation.
+  if (!ReleaseVGPRInsts.empty() &&
+      (MF.getFrameInfo().hasCalls() ||
+       AMDGPU::IsaInfo::getTotalNumVGPRs(ST) /
+               TRI->getNumUsedPhysRegs(*MRI, AMDGPU::VGPR_32RegClass) <
+           AMDGPU::IsaInfo::getMaxWavesPerEU(ST))) {
+    for (MachineInstr *MI : ReleaseVGPRInsts) {
+      if (ST->requiresNopBeforeDeallocVGPRs()) {
+        BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+                TII->get(AMDGPU::S_NOP))
+            .addImm(0);
+      }
+      BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+              TII->get(AMDGPU::S_SENDMSG))
+          .addImm(AMDGPU::SendMsg::ID_DEALLOC_VGPRS_GFX11Plus);
+      Modified = true;
     }
-    BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
-            TII->get(AMDGPU::S_SENDMSG))
-        .addImm(AMDGPU::SendMsg::ID_DEALLOC_VGPRS_GFX11Plus);
-    Modified = true;
   }
   ReleaseVGPRInsts.clear();
   PreheadersToFlush.clear();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
index 359c1e53de99e3..ad3c588f575512 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/addsubu64.ll
@@ -15,8 +15,6 @@ define amdgpu_kernel void @s_add_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
 ;
 ; GFX12-LABEL: s_add_u64:
@@ -30,8 +28,6 @@ define amdgpu_kernel void @s_add_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX12-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
-; GFX12-NEXT:    s_nop 0
-; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX12-NEXT:    s_endpgm
 entry:
   %add = add i64 %a, %b
@@ -45,8 +41,6 @@ define amdgpu_ps void @v_add_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GCN-NEXT:    v_add_co_u32 v2, vcc_lo, v2, v4
 ; GCN-NEXT:    v_add_co_ci_u32_e32 v3, vcc_lo, v3, v5, vcc_lo
 ; GCN-NEXT:    global_store_b64 v[0:1], v[2:3], off
-; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GCN-NEXT:    s_endpgm
 entry:
   %add = add i64 %a, %b
@@ -67,8 +61,6 @@ define amdgpu_kernel void @s_sub_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
 ;
 ; GFX12-LABEL: s_sub_u64:
@@ -82,8 +74,6 @@ define amdgpu_kernel void @s_sub_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX12-NEXT:    global_store_b64 v2, v[0:1], s[4:5]
-; GFX12-NEXT:    s_nop 0
-; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX12-NEXT:    s_endpgm
 entry:
   %sub = sub i64 %a, %b
@@ -97,8 +87,6 @@ define amdgpu_ps void @v_sub_u64(ptr addrspace(1) %out, i64 %a, i64 %b) {
 ; GCN-NEXT:    v_sub_co_u32 v2, vcc_lo, v2, v4
 ; GCN-NEXT:    v_sub_co_ci_u32_e32 v3, vcc_lo, v3, v5, vcc_lo
 ; GCN-NEXT:    global_store_b64 v[0:1], v[2:3], off
-; GCN-NEXT:    s_nop 0
-; GCN-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GCN-NEXT:    s_endpgm
 entry:
   %sub = sub i64 %a, %b
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
index 705bcbddf227a6..e28a1efb75404d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
@@ -84,8 +84,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr add
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw udec_wrap ptr addrspace(3) %ptr, i32 42 syncscope("agent") seq_cst, align 4
   store i32 %result, ptr addrspace(1) %out, align 4
@@ -163,8 +161,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i32_offset(ptr addrspace(1) %out,
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(3) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(3) %gep, i32 42 syncscope("agent") seq_cst, align 4
@@ -353,8 +349,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i32(ptr addrspace(1) %out, ptr
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw udec_wrap ptr addrspace(1) %ptr, i32 42 syncscope("agent") seq_cst, align 4
   store i32 %result, ptr addrspace(1) %out, align 4
@@ -431,8 +425,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i32_offset(ptr addrspace(1) %ou
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i32 42 syncscope("agent") seq_cst, align 4
@@ -510,8 +502,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i32_offset_system(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i32 42 seq_cst, align 4
@@ -797,8 +787,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i32_offset_addr64(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %id = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.tid = getelementptr i32, ptr addrspace(1) %ptr, i32 %id
@@ -2302,8 +2290,6 @@ define amdgpu_kernel void @atomic_dec_shl_base_lds_0(ptr addrspace(1) %out, ptr
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_store_b32 v2, v0, s[2:3]
 ; GFX11-NEXT:    global_store_b32 v2, v1, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %tid.x = tail call i32 @llvm.amdgcn.workitem.id.x() #2
   %idx.0 = add nsw i32 %tid.x, 2
@@ -2390,8 +2376,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i64(ptr addrspace(1) %out, ptr add
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw udec_wrap ptr addrspace(3) %ptr, i64 42 syncscope("agent") seq_cst, align 8
   store i64 %result, ptr addrspace(1) %out, align 4
@@ -2474,8 +2458,6 @@ define amdgpu_kernel void @lds_atomic_dec_ret_i64_offset(ptr addrspace(1) %out,
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(3) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(3) %gep, i64 42 syncscope("agent") seq_cst, align 8
@@ -2679,8 +2661,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i64(ptr addrspace(1) %out, ptr
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw udec_wrap ptr addrspace(1) %ptr, i64 42 syncscope("agent") seq_cst, align 8
   store i64 %result, ptr addrspace(1) %out, align 4
@@ -2762,8 +2742,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i64_offset(ptr addrspace(1) %ou
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i64 42 syncscope("agent") seq_cst, align 8
@@ -2846,8 +2824,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i64_offset_system(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw udec_wrap ptr addrspace(1) %gep, i64 42 seq_cst, align 8
@@ -3153,8 +3129,6 @@ define amdgpu_kernel void @global_atomic_dec_ret_i64_offset_addr64(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %id = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.tid = getelementptr i64, ptr addrspace(1) %ptr, i32 %id
@@ -3334,8 +3308,6 @@ define amdgpu_kernel void @atomic_dec_shl_base_lds_0_i64(ptr addrspace(1) %out,
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_store_b32 v3, v2, s[2:3]
 ; GFX11-NEXT:    global_store_b64 v3, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %tid.x = tail call i32 @llvm.amdgcn.workitem.id.x() #2
   %idx.0 = add nsw i32 %tid.x, 2
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll
index b3a7e65f771c43..d63044d7cec6d8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll
@@ -84,8 +84,6 @@ define amdgpu_kernel void @lds_atomic_inc_ret_i32(ptr addrspace(1) %out, ptr add
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw uinc_wrap ptr addrspace(3) %ptr, i32 42 syncscope("agent") seq_cst, align 4
   store i32 %result, ptr addrspace(1) %out, align 4
@@ -163,8 +161,6 @@ define amdgpu_kernel void @lds_atomic_inc_ret_i32_offset(ptr addrspace(1) %out,
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v1, 0
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(3) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(3) %gep, i32 42 syncscope("agent") seq_cst, align 4
@@ -353,8 +349,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i32(ptr addrspace(1) %out, ptr
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw uinc_wrap ptr addrspace(1) %ptr, i32 42 syncscope("agent") seq_cst, align 4
   store i32 %result, ptr addrspace(1) %out, align 4
@@ -431,8 +425,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i32_offset(ptr addrspace(1) %ou
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(1) %gep, i32 42 syncscope("agent") seq_cst, align 4
@@ -510,8 +502,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i32_offset_sistem(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i32, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(1) %gep, i32 42 seq_cst, align 4
@@ -797,8 +787,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i32_offset_addr64(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %id = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.tid = getelementptr i32, ptr addrspace(1) %ptr, i32 %id
@@ -967,8 +955,6 @@ define amdgpu_kernel void @atomic_inc_shl_base_lds_0_i32(ptr addrspace(1) %out,
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_store_b32 v2, v0, s[2:3]
 ; GFX11-NEXT:    global_store_b32 v2, v1, s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %tid.x = tail call i32 @llvm.amdgcn.workitem.id.x() #2
   %idx.0 = add nsw i32 %tid.x, 2
@@ -1055,8 +1041,6 @@ define amdgpu_kernel void @lds_atomic_inc_ret_i64(ptr addrspace(1) %out, ptr add
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw uinc_wrap ptr addrspace(3) %ptr, i64 42 syncscope("agent") seq_cst, align 8
   store i64 %result, ptr addrspace(1) %out, align 4
@@ -1139,8 +1123,6 @@ define amdgpu_kernel void @lds_atomic_inc_ret_i64_offset(ptr addrspace(1) %out,
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    v_mov_b32_e32 v2, 0
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(3) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(3) %gep, i64 42 syncscope("agent") seq_cst, align 8
@@ -1344,8 +1326,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i64(ptr addrspace(1) %out, ptr
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result = atomicrmw uinc_wrap ptr addrspace(1) %ptr, i64 42 syncscope("agent") seq_cst, align 8
   store i64 %result, ptr addrspace(1) %out, align 4
@@ -1427,8 +1407,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i64_offset(ptr addrspace(1) %ou
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(1) %gep, i64 42 syncscope("agent") seq_cst, align 8
@@ -1511,8 +1489,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i64_offset_system(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %gep = getelementptr i64, ptr addrspace(1) %ptr, i32 4
   %result = atomicrmw uinc_wrap ptr addrspace(1) %gep, i64 42 seq_cst, align 8
@@ -1818,8 +1794,6 @@ define amdgpu_kernel void @global_atomic_inc_ret_i64_offset_addr64(ptr addrspace
 ; GFX11-NEXT:    buffer_gl1_inv
 ; GFX11-NEXT:    buffer_gl0_inv
 ; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %id = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.tid = getelementptr i64, ptr addrspace(1) %ptr, i32 %id
@@ -2680,8 +2654,6 @@ define amdgpu_kernel void @atomic_inc_shl_base_lds_0_i64(ptr addrspace(1) %out,
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_store_b32 v3, v2, s[2:3]
 ; GFX11-NEXT:    global_store_b64 v3, v[0:1], s[0:1]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %tid.x = tail call i32 @llvm.amdgcn.workitem.id.x() #2
   %idx.0 = add nsw i32 %tid.x, 2
@@ -3541,8 +3513,6 @@ define amdgpu_kernel void @nocse_lds_atomic_inc_ret_i32(ptr addrspace(1) %out0,
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_store_b32 v1, v2, s[0:1]
 ; GFX11-NEXT:    global_store_b32 v1, v0, s[2:3]
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
   %result0 = atomicrmw uinc_wrap ptr addrspace(3) %ptr, i32 42 syncscope("agent") seq_cst, align 4
   %result1 = atomicrmw uinc_wrap ptr addrspace(3) %ptr, i32 42 syncscope("agent") seq_cst, align 4
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
index 34efb089b72bf1..ca6e5df43a0434 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.ll
@@ -480,8 +480,6 @@ define amdgpu_ps void @dyn_extract_v8i64_const_s_s(i32 inreg %sel) {
 ; GFX11-NEXT:    s_movrels_b64 s[0:1], s[4:5]
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_store_b64 v[0:1], v[0:1], off
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
 entry:
   %ext = extractelement <8 x i64> <i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8>, i32 %sel
@@ -627,8 +625,6 @@ define amdgpu_ps void @dyn_extract_v8i64_s_v(<8 x i64> inreg %vec, i32 %sel) {
 ; GFX11-NEXT:    v_cndmask_b32_e64 v0, v1, s16, vcc_lo
 ; GFX11-NEXT:    v_cndmask_b32_e64 v1, v2, s17, vcc_lo
 ; GFX11-NEXT:    global_store_b64 v[0:1], v[0:1], off
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
 entry:
   %ext = extractelement <8 x i64> %vec, i32 %sel
@@ -745,8 +741,6 @@ define amdgpu_ps void @dyn_extract_v8i64_v_s(<8 x i64> %vec, i32 inreg %sel) {
 ; GFX11-NEXT:    v_movrels_b32_e32 v16, v0
 ; GFX11-NEXT:    v_movrels_b32_e32 v17, v1
 ; GFX11-NEXT:    global_store_b64 v[0:1], v[16:17], off
-; GFX11-NEXT:    s_nop 0
-; GFX11-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
 ; GFX11-NEXT:    s_endpgm
 entry:
   %ext = extractelement <8 x i64> %vec, i32 %sel
@@ -852,8 +846,6 @@ define amdgpu_ps void @dyn_extract_v8i64_s_s(<8 x i64> inreg %vec, i32 inreg %se
 ; GFX11-NEXT:    s_movrels_b64 s[0:1], s[0:1]
 ; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
 ; GFX11-NEXT:    global_st...
[truncated]

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried to measure the performance impact of this myself, but the implementation LGTM.

@rampitec rampitec merged commit 3277c7c into llvm:main Oct 21, 2024
8 checks passed
@rampitec rampitec deleted the dealloc-vgprs branch October 21, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants