- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[AMDGPU] A SCHED_BARRIER in a bundle should not prevent other SCHED_BARRIERs to be considered #152627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] A SCHED_BARRIER in a bundle should not prevent other SCHED_BARRIERs to be considered #152627
Conversation
| 
          
 Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using  If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.  | 
    
| 
          
 @llvm/pr-subscribers-backend-amdgpu Author: Yoonseo Choi (yoonseoch) ChangesPatch is 27.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152627.diff 2 Files Affected: 
 diff --git a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
index dbe74b1b08f8c..8c92060128a12 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
@@ -2508,7 +2508,9 @@ bool SchedGroup::canAddSU(SUnit &SU) const {
     ++E;
 
   // Return true if all of the bundled MIs can be added to this group.
-  return std::all_of(B, E, [this](MachineInstr &MI) { return canAddMI(MI); });
+  return std::all_of(B, E, [this](MachineInstr &MI) {
+    return (MI.isMetaInstruction()) || canAddMI(MI);
+  });
 }
 
 void SchedGroup::initSchedGroup() {
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..a49d2188ce704 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -5,6 +5,10 @@
   define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
   define amdgpu_kernel void @sched_barrier_mask_0(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
   define amdgpu_kernel void @sched_barrier_mask_1(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @no_sched_barrier_many_loads(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @sched_barrier_mask_1924(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @extra_sched_barrier_in_bundle(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @extra_sched_barrier_in_bundle_2(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
 
   !0 = distinct !{!0}
   !1 = !{!1, !0}
@@ -120,3 +124,285 @@ body: |
     }
     S_ENDPGM 0
 ...
+
+
+# No SCHED_BARRIER - First two bundles global_loads get clustered and VALU instrucions are scheduled later.
+
+---
+name: no_sched_barrier_many_loads
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: no_sched_barrier_many_loads
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    S_ENDPGM 0
+...
+
+# MASK = 0b 0111 1000 0100  SALU, MFMA/WMMA, All DS, All DS Read, All DS Write, All Trans may be
+#                     scheduled across SCHED_BARRIER. VALU and all VMEM, VMEM Read/Write cannot be
+#                     scheduled across SCHED_BARRIER. V_MULs is not shceduled after the second bundle of loads.
+
+---
+name: sched_barrier_mask_1924
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: sched_barrier_mask_1924
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: SCHED_BARRIER 1924
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+
+    SCHED_BARRIER 1924
+
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    S_ENDPGM 0
+...
+
+# MASK = 0b 0111 1000 0100  SALU, MFMA/WMMA, All DS, All DS Read, All DS Write, All Trans may be
+#                     scheduled across SCHED_BARRIER. VALU and all VMEM, VMEM Read/Write cannot be
+#                     scheduled across SCHED_BARRIER. V_MULs is not shceduled after the second bundle of loads.
+#
+#                     A SCHED_BARRIER before the seconde bundle is honored.
+
+
+---
+name: extra_sched_barrier_in_bundle
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: extra_sched_barrier_in_bundle
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: SCHED_BARRIER 1924
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   SCHED_BARRIER 1924
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+
+    SCHED_BARRIER 1924
+
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      SCHED_BARRIER 1924
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamab...
[truncated]
 | 
    
| // Return true if all of the bundled MIs can be added to this group. | ||
| return std::all_of(B, E, [this](MachineInstr &MI) { return canAddMI(MI); }); | ||
| return std::all_of(B, E, [this](MachineInstr &MI) { | ||
| return (MI.isMetaInstruction()) || canAddMI(MI); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return (MI.isMetaInstruction()) || canAddMI(MI); | |
| return MI.isMetaInstruction() || canAddMI(MI); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic change LGTM. Though the code structure here is now unnecessarily confusing and could be cleaned up in a follow up. Inside canAddMI already tries to handle isMetaInstruction. The code here is also pre-finding the bounds of the bundle prior to doing the all_of. You can remove the isMetaInstruction inside of canAddMI, and directly check canAddMI in the initial walk through the bundle above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will do so.
Initially I did not remove isMetaInstruction inside of canAddMI as that could lead to edges between two sched_barriers, which I thought was unintentional (lines 2501, 2502).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making canAddMI() return true for a metaInstruction in effect came up with more intrusive changes to existing lit tests. I got about 14 lit-test failures from CodeGen/AMDGPU, mostly on tests of iglp_opts and sched-barrier, and sched-group-barriers.
For example, llvm.amdgcn.sched.group.barrier.ll,
// sequence after making canAddMI return true on an metaInstruction
385 define amdgpu_kernel void @test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE(ptr addrspace(1) noalias %in,
386 ; GCN-LABEL: test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE:                                           
387 ; GCN:       ; %bb.0:                                                                                                 
388 ; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24                                                                    
389 ; GCN-NEXT:    v_and_b32_e32 v0, 0x3ff, v0                                                                            
390 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 7, v0                                                                           
    ----------------------------------------------------------------------------------------------------------------------
391 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)                                                                                   
392 ; GCN-NEXT:    global_load_dwordx4 v[12:15], v16, s[0:1] offset:32                                                    
393 ; GCN-NEXT:    ; kill: killed $sgpr0_sgpr1                                                                            
394 ; GCN-NEXT:    global_load_dwordx4 v[0:3], v16, s[0:1]                                                                
395 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)                                               
396 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000002) size(2) SyncID(0)                                               
397 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000040) size(1) SyncID(0)                                               
398 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)                                               
399 ; GCN-NEXT:    s_waitcnt vmcnt(1)                                                                                     
400 ; GCN-NEXT:    v_mul_lo_u32 v13, v13, v13                                                                             
401 ; GCN-NEXT:    v_mul_lo_u32 v12, v12, v12                                                                             
402 ; GCN-NEXT:    v_mul_lo_u32 v15, v15, v15                                                                             
403 ; GCN-NEXT:    v_mul_lo_u32 v14, v14, v14                                                                             
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
404 ; GCN-NEXT:    s_waitcnt vmcnt(0)                                                                                     
405 ; GCN-NEXT:    v_mul_lo_u32 v3, v3, v3                                                                                
406 ; GCN-NEXT:    v_mul_lo_u32 v2, v2, v2                                                                                
407 ; GCN-NEXT:    global_store_dwordx4 v16, v[12:15], s[2:3] offset:32                                                   
408 ; GCN-NEXT:    global_load_dwordx4 v[4:7], v16, s[0:1] offset:48                                                      
409 ; GCN-NEXT:    v_mul_lo_u32 v1, v1, v1             
// original sequence in the test
 385 define amdgpu_kernel void @test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE(ptr addrspace(1) noalias %in
 386 ; GCN-LABEL: test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE:
 387 ; GCN:       ; %bb.0:
 388 ; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 389 ; GCN-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 390 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 7, v0
 391 ; GCN-NEXT:    ; kill: killed $sgpr0_sgpr1
 392 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 393 ; GCN-NEXT:    global_load_dwordx4 v[12:15], v16, s[0:1] offset:32
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 394 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)
 395 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000002) size(2) SyncID(0)
 396 ; GCN-NEXT:    s_waitcnt vmcnt(0)
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 397 ; GCN-NEXT:    v_mul_lo_u32 v13, v13, v13
 398 ; GCN-NEXT:    v_mul_lo_u32 v12, v12, v12
 399 ; GCN-NEXT:    v_mul_lo_u32 v15, v15, v15
 400 ; GCN-NEXT:    v_mul_lo_u32 v14, v14, v14
 401 ; GCN-NEXT:    global_store_dwordx4 v16, v[12:15], s[2:3] offset:32
 402 ; GCN-NEXT:    global_load_dwordx4 v[0:3], v16, s[0:1]
 403 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000040) size(1) SyncID(0)
 404 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)
 405 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 406 ; GCN-NEXT:    v_mul_lo_u32 v3, v3, v3
 407 ; GCN-NEXT:    v_mul_lo_u32 v2, v2, v2
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 408 ; GCN-NEXT:    v_mul_lo_u32 v1, v1, v1      
I think using the original logic of canAddMI is safer given that it is used for a non-bundle single MI.
I can still simplify the logic to move the checking into the scanning of the range and avoid doing another scan of all_of.
| // Initially, iterator is on a bundler header. | ||
| MachineBasicBlock::instr_iterator B = std::next(MI.getIterator()); | ||
| while (B != MBB->end() && B->isBundledWithPred()) { | ||
| if (!B->isMetaInstruction() && !canAddMI(*B)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also remove the isMetaInstruction handling inside canAddMI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arsenm, Did you mean cannAddMI return true on isMetaInstruction()? Or, a NFC change of removing lines 3 and 4 below?
1 bool SchedGroup::canAddMI(const MachineInstr &MI) const {
2  bool Result = false;
3  if (MI.isMetaInstruction())
4    Result = false;
5  ...
6  return Result;
As I addressed in my last comment, #152627 (comment), making canAddMI return true on meta instructions affected the results of tests related to sched_group_barrier/iglp_opt. This is due to lines 2501 and 2502.
In effect, making canAddMI return true on meta instructions will add artificial dependence edges among sched_barriers/sched_group_barriers/iglp_opts. Let me know if those potential changes are alright. Then, I will make canAddMI return true on meta instructions and further simplify line 2511 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to adjust the above usage in the unbundled case, but I meant make it not consider metainstructions at all (i.e. fully pull the meta instruction handling into this function here)
| 
           Will follow up on an alternative approach of not bundling meta instructions with memory operation.  | 
    
A SCHED_BARRIER within a bundle was prevented another SCHED_BARRIER outside of the bundle to be considered.
This was because when the src or dest SUnit of a target SCHED_BARRIER is collected, if that SUnit is a bundle, all of the constituent MI should be classified against the target SCHED_BARRIER. A SCHED_BARRIER in a bundle is never classified against the target SCHED_BARRIER. As a result, the target SCHED_BARRIER was not honored even though all non-meta instructions in the bundle were classified.
In a new function
extra_sched_barrier_in_bundlein llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir, without the changes in this PR, the second bundle of global loads were scheduled before V_MULs.An alternative fix is not to bundle a SCHED_BARRIER with other memory operations in Post RA Bundler. The alternative fix was not followed as Post RA Bunder already doesn't bundle group sched barriers or igpt_opt.