Skip to content

Conversation

@changpeng
Copy link
Contributor

@changpeng changpeng commented Nov 12, 2025

LDS block size should be 2048 bytes (512 dwords) based on current spec.

  Should be 2056 bytes (512 dwords) based on current spec.
@llvmbot
Copy link
Member

llvmbot commented Nov 12, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

LDS block size should be 2056 bytes (512 dwords) based on current spec.


Full diff: https://github.com/llvm/llvm-project/pull/167614.diff

5 Files Affected:

  • (modified) llvm/docs/AMDGPUUsage.rst (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp (+3-6)
  • (modified) llvm/test/CodeGen/AMDGPU/extra-lds-size.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-size-hsa-gfx1250.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/pal-metadata-3.0.gfx1250.ll (+1-1)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index ba0e53bceade8..98aaead96b766 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -5855,7 +5855,7 @@ The fields used by CP for code objects before V3 also match those specified in
                                                      GFX950
                                                        roundup(lds-size / (320 * 4))
                                                      GFX125*
-                                                       roundup(lds-size / (256 * 4))
+                                                       roundup(lds-size / (512 * 4))
 
      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
                      _INVALID_OPERATION              with specified exceptions
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
index 29f8f9bc8b54c..f47eaea9b136c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
@@ -1161,12 +1161,9 @@ void AMDGPUAsmPrinter::getSIProgramInfo(SIProgramInfo &ProgInfo,
   ProgInfo.DX10Clamp = Mode.DX10Clamp;
 
   unsigned LDSAlignShift;
-  if (STM.getFeatureBits().test(FeatureAddressableLocalMemorySize327680)) {
-    // LDS is allocated in 256 dword blocks.
-    LDSAlignShift = 10;
-  } else if (STM.getFeatureBits().test(
-                 FeatureAddressableLocalMemorySize163840)) {
-    // LDS is allocated in 320 dword blocks.
+  if (STM.getFeatureBits().test(FeatureAddressableLocalMemorySize327680) ||
+      STM.getFeatureBits().test(FeatureAddressableLocalMemorySize163840)) {
+    // LDS is allocated in 512 or 320 dword blocks.
     LDSAlignShift = 11;
   } else if (STM.getFeatureBits().test(
                  FeatureAddressableLocalMemorySize65536)) {
diff --git a/llvm/test/CodeGen/AMDGPU/extra-lds-size.ll b/llvm/test/CodeGen/AMDGPU/extra-lds-size.ll
index 4349b18fd394c..b31e87c54b563 100644
--- a/llvm/test/CodeGen/AMDGPU/extra-lds-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/extra-lds-size.ll
@@ -31,10 +31,10 @@
 ; GFX1200-MESA: .long 45100
 ; GFX1200-MESA-NEXT: .long 1024
 
-; GFX1250-PAL: '0x2c0b (SPI_SHADER_PGM_RSRC2_PS)': 0x200
+; GFX1250-PAL: '0x2c0b (SPI_SHADER_PGM_RSRC2_PS)': 0x100
 
 ; GFX1250-MESA: .long 45100
-; GFX1250-MESA-NEXT: .long 512
+; GFX1250-MESA-NEXT: .long 256
 
 @lds = internal addrspace(3) global [4096 x i8] poison
 
diff --git a/llvm/test/CodeGen/AMDGPU/lds-size-hsa-gfx1250.ll b/llvm/test/CodeGen/AMDGPU/lds-size-hsa-gfx1250.ll
index 3db0fa8f21759..7e8d5e0f30b9e 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-size-hsa-gfx1250.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-size-hsa-gfx1250.ll
@@ -41,7 +41,7 @@ define amdgpu_kernel void @test_lds_i32(i32 %val) {
 ; GCN-LABEL: test_lds_array_i8:
 ; GCN: .amdhsa_group_segment_fixed_size 327680
 ; GCN: ; LDSByteSize: 327680 bytes/workgroup
-; MESA: granulated_lds_size = 320
+; MESA: granulated_lds_size = 160
 define amdgpu_kernel void @test_lds_array_i8() {
   %gep = getelementptr inbounds [327679 x i8], ptr addrspace(3) @lds.array.i8, i32 0, i32 5
   %val = load i8, ptr addrspace(3) %gep
@@ -52,7 +52,7 @@ define amdgpu_kernel void @test_lds_array_i8() {
 ; GCN-LABEL: test_lds_array_i16:
 ; GCN: .amdhsa_group_segment_fixed_size 327680
 ; GCN: ; LDSByteSize: 327680 bytes/workgroup
-; MESA: granulated_lds_size = 320
+; MESA: granulated_lds_size = 160
 define amdgpu_kernel void @test_lds_array_i16() {
   %gep = getelementptr inbounds [163839 x i16], ptr addrspace(3) @lds.array.i16, i32 0, i32 10
   %val = load i16, ptr addrspace(3) %gep
@@ -63,7 +63,7 @@ define amdgpu_kernel void @test_lds_array_i16() {
 ; GCN-LABEL: test_lds_array_i32:
 ; GCN: .amdhsa_group_segment_fixed_size 327680
 ; GCN: ; LDSByteSize: 327680 bytes/workgroup
-; MESA: granulated_lds_size = 320
+; MESA: granulated_lds_size = 160
 define amdgpu_kernel void @test_lds_array_i32() {
   %gep = getelementptr inbounds [81919 x i32], ptr addrspace(3) @lds.array.i32, i32 0, i32 20
   %val = load i32, ptr addrspace(3) %gep
diff --git a/llvm/test/CodeGen/AMDGPU/pal-metadata-3.0.gfx1250.ll b/llvm/test/CodeGen/AMDGPU/pal-metadata-3.0.gfx1250.ll
index f934c85f68e0f..0df20f6d349f7 100644
--- a/llvm/test/CodeGen/AMDGPU/pal-metadata-3.0.gfx1250.ll
+++ b/llvm/test/CodeGen/AMDGPU/pal-metadata-3.0.gfx1250.ll
@@ -126,7 +126,7 @@
 ; CHECK-NEXT:        .entry_point:    _amdgpu_hs
 ; CHECK-NEXT:        .entry_point_symbol:    hs_shader
 ; CHECK-NEXT:        .forward_progress: true
-; CHECK-NEXT:        .lds_size:       0x1000
+; CHECK-NEXT:        .lds_size:       0x800
 ; CHECK-NEXT:        .mem_ordered:    true
 ; CHECK-NEXT:        .scratch_en:     false
 ; CHECK-NEXT:        .scratch_memory_size: 0

Comment on lines +1164 to +1165
if (STM.getFeatureBits().test(FeatureAddressableLocalMemorySize327680) ||
STM.getFeatureBits().test(FeatureAddressableLocalMemorySize163840)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be modeled more directly rather than inferring the allocation size by the target that supports this allocation size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be modeled more directly rather than inferring the allocation size by the target that supports this allocation size

Do you mean something directly like?:
if (gfx1250)
LDSAlignShift = 11;

Copy link
Contributor Author

@changpeng changpeng Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm : should we get the lds block size from the target directly?

; GFX1200-MESA-NEXT: .long 1024

; GFX1250-PAL: '0x2c0b (SPI_SHADER_PGM_RSRC2_PS)': 0x200
; GFX1250-PAL: '0x2c0b (SPI_SHADER_PGM_RSRC2_PS)': 0x100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this test ends up showing the allocation granularity

Copy link
Collaborator

@rampitec rampitec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not it also affect llvm/test/MC/AMDGPU/hsa-gfx1250-v4.s? That is where the size is really recorded, not just printed.

@changpeng
Copy link
Contributor Author

Should not it also affect llvm/test/MC/AMDGPU/hsa-gfx1250-v4.s? That is where the size is really recorded, not just printed.
Not sure. The granularity is not checked at all in the file. Is there a directive for it that we can check?

By the way, it seems LDS size is not appropriate in the file (should be 320K)
// ASM: .amdhsa_group_segment_fixed_size 393216

@rampitec
Copy link
Collaborator

Should not it also affect llvm/test/MC/AMDGPU/hsa-gfx1250-v4.s? That is where the size is really recorded, not just printed.
Not sure. The granularity is not checked at all in the file. Is there a directive for it that we can check?

It shall change the hex dump of KD.

By the way, it seems LDS size is not appropriate in the file (should be 320K) // ASM: .amdhsa_group_segment_fixed_size 393216

Must be leftover from the moment that was the maximum.


unsigned getLdsDwGranularity(const MCSubtargetInfo &ST) {
return ST.hasFeature(AMDGPU::FeatureAddressableLocalMemorySize327680) ? 256
return ST.hasFeature(AMDGPU::FeatureAddressableLocalMemorySize327680) ? 512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That becomes more and more strange, doc lists other sizes, but here we have only 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That becomes more and more strange, doc lists other sizes, but here we have only 2.
Maybe it is missing for gfx950 (320 byte block),
And R600 was possibly missed intentionally.
We should add them, also call AMDGPU::getLdsDwGranularity in AMDGPUAsmPrinter::getSIProgramInfo too.

But this needs additional investigation and LIT tests, and should fall into separate PRs, I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. But I'd still expect hsa-gfx1250-v4.s to show some changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. But I'd still expect hsa-gfx1250-v4.s to show some changes.

Do you mean we need to add additional tests to expose the change, or something is missed updating when we change the LDS block size in this PR? Thanks; ( I will change the maximum lds size in the test though)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably some other place is not changed. Try changing max size in the test, I'd assume hex dump will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably some other place is not changed. Try changing max size in the test, I'd assume hex dump will change.

maximum lds size updated, and the hex dump changes accordingly. I am going to double check what we possibly missed. Thanks.

@changpeng changpeng requested review from arsenm and rampitec November 13, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants