Skip to content

Conversation

@Andres-Salamanca
Copy link
Contributor

This PR addresses the feedback from llvm/llvm-project#142041 (comment). Our algorithm for accumulating bitfields has diverged from CodeGen since Clang 19.

There is one key difference: in CIR, we use the function getBitfieldStorageType, which checks whether the bit width of the current accumulation run is a valid fundamental width (i.e., a power of two: 8, 16, 32, 64). If it is, it returns a CIR type of that size otherwise, it returns an array with the closest alignment.

For example, given the following struct:

struct S {
  int a : 4;
  int b : 27;
  int c : 17;
  int d : 2;
  int e : 15;
  unsigned f;
};

The CodeGen output is:

%struct.S = type { i64, i16, i32 }

Whereas the new CIR algorithm produces:

!cir.record<struct "S" {!cir.array<!u8i x 7>, !u16i, !u32i}>

In CIR, the algorithm accumulates up to field d, resulting in 50 accumulated bits. Since 50 is not a fundamental width, the closest alignment is 56 bits, which leads to the type !cir.array<!u8i x 7>. The algorithm stops before accumulating field e because including it would exceed the register size (64), which is not ideal.

At this point, it's unclear whether this divergence from CodeGen represents an improvement. If we wanted to match CodeGen exactly, we would need to replace the use of getBitfieldStorageType with getUIntNType. The difference is that getUIntNType always returns the closest power-of-two integer type instead of falling back to an array when the size is not a fundamental width.

With this change, CIR would match CodeGen's layout exactly. It would require the following small code change:

diff --git a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
index 7c1802bff077..17538b191738 100644
--- a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
+++ b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
@@ -616,7 +616,7 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
       if (!InstallBest) {
         // Determine if accumulating the just-seen span will create an expensive
         // access unit or not.
-        mlir::Type Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+        mlir::Type Type = getUIntNType(astContext.toBits(AccessSize));
         if (!astContext.getTargetInfo().hasCheapUnalignedBitFieldAccess())
           llvm_unreachable("NYI");

@@ -674,12 +674,12 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
         // remain there after a stable sort.
         mlir::Type Type;
         if (BestClipped) {
-          assert(getSize(getBitfieldStorageType(
+          assert(getSize(getUIntNType(
                      astContext.toBits(AccessSize))) > AccessSize &&
                  "Clipped access need not be clipped");
           Type = getByteArrayType(AccessSize);
         } else {
-          Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+          Type = getUIntNType(astContext.toBits(AccessSize));
           assert(getSize(Type) == AccessSize &&
                  "Unclipped access must be clipped");
         }

You can see a comparison between the two functions https://godbolt.org/z/qjx1MaEWG.

I'm currently unsure whether using one function over the other has performance implications. Regarding the ARM error I mentioned in the meeting: it was an assert I had forgotten to update. It's now fixed sorry for the confusion.

@Andres-Salamanca
Copy link
Contributor Author

@andykaylor
Copy link
Collaborator

It's difficult to compare the implementations without seeing a side-by-side comparison of the resulting LLVM IR for accessing elements in a structure. I suspect that representing the memory using an array of bytes is going to be bad compared to using power-of-two integers because they result in different physical representations on little-endian machines.

Consider this example:

https://godbolt.org/z/3eWjee55f

struct S {
  int a : 4;
  int b : 27;
  int c : 17;
  int d : 2;
  int e : 15;
  unsigned f;
};

S s;

int f() {
  return s.c;
}

The current incubator produces

%struct.S = type { i32, [3 x i8], i16, i32 }

I think we could get similar results with your PR by replacing a and b with a single integer field.

The LLVM IR generated looks like this:

define dso_local range(i32 -65536, 65536) i32 @_Z1fv() local_unnamed_addr #0 !dbg !4 {
  %1 = load i24, ptr getelementptr inbounds nuw (i8, ptr @s, i64 4), align 4, !dbg !7
  %2 = shl i24 %1, 7, !dbg !7
  %3 = ashr exact i24 %2, 7, !dbg !7
  %4 = sext i24 %3 to i32, !dbg !7
  ret i32 %4, !dbg !8
}

That doesn't seem too bad until you consider that c is placed in bytes 0-3, so it crosses the boundary where it is split in little-endian representation. The x86-64 assembly generated looks like this:

_Z1fv:
.L_Z1fv$local:
  movzx eax, byte ptr [rip + .Ls$local+6]
  movzx ecx, word ptr [rip + .Ls$local+4]
  shl ecx, 15
  shl eax, 31
  or eax, ecx
  sar eax, 15
  ret

We had to load two different values and combine them to get the correct result!

By contrast, classic codegen emits this:

%struct.S = type { i32, i24, i16, i32 }

define dso_local noundef i32 @f()() local_unnamed_addr #0 !dbg !30 {
  %1 = load i32, ptr getelementptr inbounds (%struct.S, ptr @s, i64 0, i32 1), align 4, !dbg !33
  %2 = shl i32 %1, 15, !dbg !33
  %3 = ashr exact i32 %2, 15, !dbg !33
  ret i32 %3, !dbg !34
}

And the generated x86-64 assembly is this:

f():
  mov eax, dword ptr [rip + s+4]
  shl eax, 15
  sar eax, 15
  ret

It almost looks like classic codegen is cheating by telling LLVM IR to load an i32 instead of an i24 but it can do that because using an i24 in the structure definition means that the endianness is baked into the structure.

@andykaylor
Copy link
Collaborator

That is to say, I think you should make the change to get this to emit the same structure that classic codegen emits.

@Andres-Salamanca
Copy link
Contributor Author

Yes, @andykaylor 100% you're correct
Previously, with the array, it looked like this in my machine:

_Z1fv:                                  # @_Z1fv
.L_Z1fv$local:
    .type   .L_Z1fv$local,@function
    .cfi_startproc
# %bb.0:
    movzbl  .Ls$local+6(%rip), %eax
    shll    $16, %eax
    movzwl  .Ls$local+4(%rip), %ecx
    orl     %eax, %ecx
                                            
    movl    %ecx, %eax
    shlq    $47, %rax
    sarq    $47, %rax
                                            
    movl    %eax, -4(%rsp)
    movl    -4(%rsp), %eax
    retq

And now, using power-of-two types, it looks much better exactly like CodeGen:

_Z1fv:                                  # @_Z1fv
.L_Z1fv$local:
    .type   .L_Z1fv$local,@function
    .cfi_startproc
# %bb.0:
    movq    .Ls$local(%rip), %rax
    shlq    $15, %rax
    sarq    $47, %rax
                                            
    retq

Thanks for the detailed explanation, Andy ;)

@bcardosolopes
Copy link
Member

Thanks for fixing this. I agree with Andy here too, matching classic codegen is the way to go!

@bcardosolopes bcardosolopes merged commit 6ddc48e into llvm:main Jun 17, 2025
9 checks passed
terapines-osc-cir pushed a commit to Terapines/clangir that referenced this pull request Sep 2, 2025
This PR addresses the feedback from
llvm/llvm-project#142041 (comment).
Our algorithm for accumulating bitfields has diverged from CodeGen since
Clang 19.

There is one key difference: in CIR, we use the function
`getBitfieldStorageType`, which checks whether the bit width of the
current accumulation run is a valid fundamental width (i.e., a power of
two: 8, 16, 32, 64). If it is, it returns a CIR type of that size
otherwise, it returns an array with the closest alignment.

For example, given the following struct:

```c
struct S {
  int a : 4;
  int b : 27;
  int c : 17;
  int d : 2;
  int e : 15;
  unsigned f;
};
```

The CodeGen output is:

```llvm
%struct.S = type { i64, i16, i32 }
```

Whereas the new CIR algorithm produces:

```mlir
!cir.record<struct "S" {!cir.array<!u8i x 7>, !u16i, !u32i}>
```
In CIR, the algorithm accumulates up to field `d`, resulting in 50
accumulated bits. Since 50 is not a fundamental width, the closest
alignment is 56 bits, which leads to the type `!cir.array<!u8i x 7>`.
The algorithm stops before accumulating field `e` because including it
would exceed the register size (64), which is not ideal.

At this point, it's unclear whether this divergence from CodeGen
represents an improvement. If we wanted to match CodeGen exactly, we
would need to replace the use of `getBitfieldStorageType` with
`getUIntNType`. The difference is that `getUIntNType` always returns the
closest power-of-two integer type instead of falling back to an array
when the size is not a fundamental width.

With this change, CIR would match CodeGen's layout exactly. It would
require the following small code change:
```diff
diff --git a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
index 7c1802b..17538b191738 100644
--- a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
+++ b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
@@ -616,7 +616,7 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
       if (!InstallBest) {
         // Determine if accumulating the just-seen span will create an expensive
         // access unit or not.
-        mlir::Type Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+        mlir::Type Type = getUIntNType(astContext.toBits(AccessSize));
         if (!astContext.getTargetInfo().hasCheapUnalignedBitFieldAccess())
           llvm_unreachable("NYI");

@@ -674,12 +674,12 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
         // remain there after a stable sort.
         mlir::Type Type;
         if (BestClipped) {
-          assert(getSize(getBitfieldStorageType(
+          assert(getSize(getUIntNType(
                      astContext.toBits(AccessSize))) > AccessSize &&
                  "Clipped access need not be clipped");
           Type = getByteArrayType(AccessSize);
         } else {
-          Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+          Type = getUIntNType(astContext.toBits(AccessSize));
           assert(getSize(Type) == AccessSize &&
                  "Unclipped access must be clipped");
         }
```
You can see a comparison between the two functions
https://godbolt.org/z/qjx1MaEWG.

I'm currently unsure whether using one function over the other has
performance implications. Regarding the **ARM error I mentioned in the
meeting: it was an `assert` I had forgotten to update. It's now fixed
sorry for the confusion.**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants