Skip to content

Conversation

@paschalis-mpeis
Copy link
Member

@paschalis-mpeis paschalis-mpeis commented Jun 24, 2025

In gs-pacret-autiasp.s, the undefined call bl g causes inconsistent
basic block splitting: in some platforms BOLT emits two blocks, on some
others one.

Defining a dummy g symbol forces a single basic block everywhere.

@github-actions
Copy link

github-actions bot commented Jun 24, 2025

✅ With the latest revision this PR passed the Python code formatter.

@paschalis-mpeis paschalis-mpeis force-pushed the users/paschalis-mpeis/rhel8-gadget-scanner branch from 8554d09 to f324145 Compare June 24, 2025 15:19
@paschalis-mpeis
Copy link
Member Author

Hey @atrosinenko and @kbeyls,

This tests fails on RHEL8. I am disabling it on that platform and letting you know in case it needs your attention.

I don't have a full RHEL8 environment, but I tested the os+version check I introduced in isolation.

Some (truncated) error logs I received:

**00:15:27**  ********************
**00:15:27**  Testing: 
**00:15:27**  FAIL: BOLT :: binary-analysis/AArch64/gs-pacret-autiasp.s (551 of 92523)
**00:15:27**  ******************** TEST 'BOLT :: binary-analysis/AArch64/gs-pacret-autiasp.s' FAILED ********************
**00:15:27**  Exit Code: 1
**00:15:27**  
**00:15:27**  Command Output (stderr):
**00:15:27**  + /workspace/build/stage/shared_lib_build/bin/llvm-bolt-binary-analysis --scanners=pacret /workspace/build/stage/shared_lib_build/tools/bolt/test/binary-analysis/AArch64/Output/gs-pacret-autiasp.s.tmp.exe
**00:15:27**  /workspace/src/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s:21:16: error: CHECK-NEXT: is not on the line after the previous match
**00:15:27**  // CHECK-NEXT: {{[0-9a-f]+}}: add x0, x0, #0x3
**00:15:27**                 ^
**00:15:27**  <stdin>:17:2: note: 'next' match was here
**00:15:27**   00010308: add x0, x0, #0x3
**00:15:27**   ^
**00:15:27**  <stdin>:12:44: note: previous match ended here
**00:15:27**   This happens in the following basic block:
**00:15:27**                                             ^
**00:15:27**  <stdin>:13:1: note: non-matching line after previous match is here
**00:15:27**   000102f8: paciasp
**00:15:27**  ^
**00:15:27**  /workspace/src/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s:43:16: error: CHECK-NEXT: is not on the line after the previous match
**00:15:27**  // CHECK-NEXT: {{[0-9a-f]+}}: add x0, x0, #0x3
**00:15:27**                 ^
**00:15:27**  <stdin>:30:2: note: 'next' match was here
**00:15:27**   00010324: add x0, x0, #0x3
**00:15:27**   ^
**00:15:27**  <stdin>:25:44: note: previous match ended here
**00:15:27**   This happens in the following basic block:
**00:15:27**                                             ^
**00:15:27**  <stdin>:26:1: note: non-matching line after previous match is here
**00:15:27**   00010314: paciasp
**00:15:27**  ^
... (truncated)

(cc: @pawosm-arm)

@paschalis-mpeis paschalis-mpeis marked this pull request as ready for review June 24, 2025 15:35
@llvmbot llvmbot added the BOLT label Jun 24, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 24, 2025

@llvm/pr-subscribers-bolt

Author: Paschalis Mpeis (paschalis-mpeis)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/145527.diff

2 Files Affected:

  • (modified) bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s (+2-1)
  • (modified) bolt/test/lit.cfg.py (+4)
diff --git a/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s b/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s
index 284f0bea607a5..855fdb6465479 100644
--- a/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s
+++ b/bolt/test/binary-analysis/AArch64/gs-pacret-autiasp.s
@@ -1,3 +1,5 @@
+// REQUIRES: !rhel8
+
 // RUN: %clang %cflags -march=armv9.5-a+pauth-lr -mbranch-protection=pac-ret %s %p/../../Inputs/asm_main.c -o %t.exe
 // RUN: llvm-bolt-binary-analysis --scanners=pacret %t.exe 2>&1 | FileCheck %s
 
@@ -883,4 +885,3 @@ f_autib171615:
 // CHECK-NEXT: {{[0-9a-f]+}}:   ret
         ret
         .size f_autib171615, .-f_autib171615
-
diff --git a/bolt/test/lit.cfg.py b/bolt/test/lit.cfg.py
index 0d05229be2bf3..25c9e52d2e26f 100644
--- a/bolt/test/lit.cfg.py
+++ b/bolt/test/lit.cfg.py
@@ -75,6 +75,10 @@
 if lit.util.which("fuser"):
     config.available_features.add("fuser")
 
+rhel_release = "/etc/redhat-release"
+if os.path.exists(rhel_release) and "release 8" in open(rhel_release).read().lower():
+    config.available_features.add("rhel8")
+
 llvm_config.use_default_substitutions()
 
 llvm_config.config.environment["CLANG"] = config.bolt_clang

@kbeyls
Copy link
Collaborator

kbeyls commented Jun 24, 2025

If I understand correctly, this test has assembly as input, where we expect the exact same binary to be produced on all platforms. Also, the analyzer is expected to produce the same results, given a particular input binary, on all platforms.

For that reason, it seems to me that disabling this test on RHEL8 is the wrong fix. Would you be able to show a the input to FileCheck, so that a more meaningful diff can be produced, and investigate what the root cause is of why this test produces a different result on RHEL8?

@paschalis-mpeis paschalis-mpeis marked this pull request as draft June 25, 2025 07:32
@paschalis-mpeis
Copy link
Member Author

Hey Kristof,

Thanks for your review, I agree.
I'm marking this PR as draft and I'll ask internally for more details on the root cause.
I'll follow up once there's more clarity.

@atrosinenko
Copy link
Contributor

Given that CHECK-NEXT on line 21 matched, but the match "is not on the line after the previous match" and "note: previous match ended here", I assume that some report was generated for f1 and

<stdin>:12:44: note: previous match ended here
 This happens in the following basic block:

corresponds to that report (specifically, it corresponds to CHECK-NEXT on the line 20 of the test source). Then f1 function should start at

000102f8: paciasp

With all the above, it looks like the report on RHEL8 would match something along these lines:

// CHECK-LABEL: GS-PAUTH: non-protected ret found in function f1, basic block {{[0-9a-zA-Z.]+}}, at address
// CHECK-NEXT:    The instruction is     {{[0-9a-f]+}}:       ret
// CHECK-NEXT:    The 1 instructions that write to the affected registers after any authentication are:
// CHECK-NEXT:    1. {{[0-9a-f]+}}: ldp     x29, x30, [sp], #0x10
// CHECK-NEXT:  This happens in the following basic block:
// CHECK-NEXT: {{[0-9a-f]+}}:   paciasp
// CHECK-NEXT: {{[0-9a-f]+}}:   stp     x29, x30, [sp, #-16]!
// CHECK-NEXT: {{[0-9a-f]+}}:   mov     x29, sp
// CHECK-NEXT: {{[0-9a-f]+}}:   bl      g
// CHECK-NEXT: {{[0-9a-f]+}}:   add     x0, x0, #0x3
// CHECK-NEXT: {{[0-9a-f]+}}:   ldp     x29, x30, [sp], #0x10
// CHECK-NEXT: {{[0-9a-f]+}}:   ret

Obviously, I don't suggest inserting these four lines to the test (as it would break other runners), but I wonder why bl g ends basic block on other platforms in the first place.

@paschalis-mpeis
Copy link
Member Author

Hey @atrosinenko,

Thanks a lot for your comment. Reading this back to see if I got it right.
(1) you are saying the following should (probably) work on RHEL8:

// CHECK-LABEL: GS-PAUTH: non-protected ret found in function f1, basic block {{[0-9a-zA-Z.]+}}, at address
// CHECK-NEXT:    The instruction is     {{[0-9a-f]+}}:       ret
// CHECK-NEXT:    The 1 instructions that write to the affected registers after any authentication are:
// CHECK-NEXT:    1. {{[0-9a-f]+}}: ldp     x29, x30, [sp], #0x10
// CHECK-NEXT:  This happens in the following basic block:
// CHECK-NEXT: {{[0-9a-f]+}}:   paciasp
+ // CHECK-NEXT: {{[0-9a-f]+}}:   stp     x29, x30, [sp, #-16]!
+ // CHECK-NEXT: {{[0-9a-f]+}}:   mov     x29, sp
+ // CHECK-NEXT: {{[0-9a-f]+}}:   bl      g
// CHECK-NEXT: {{[0-9a-f]+}}:   add     x0, x0, #0x3
// CHECK-NEXT: {{[0-9a-f]+}}:   ldp     x29, x30, [sp], #0x10
// CHECK-NEXT: {{[0-9a-f]+}}:   ret
        ret

(2) you're asking why on other platforms (non-RHEL8) a bl call splits the basic block?

So we should expect BOLT IR to produce a single block for f1. Given we achieve this consistently, we could add the above extra 3 lines to the test?

I can confirm that on my platform (a non-RHEL8 linux) I get two basic blocks in f1, where the g symbol is undefined. If replace the call target with something that exists (ie f_intermediate_overwrite1), I get a single block.

@kbeyls
Copy link
Collaborator

kbeyls commented Jun 25, 2025

Thanks for the continued analysis @atrosinenko and @paschalis-mpeis !
This rings a bell to me now.
When I landed the initial pac-ret scanner, the bots complained because basic blocks were split differently compared to what I observed at the time locally on my slightly older version of BOLT. See the discussion at #128576

In gs-pacret-autiasp.s, the undefined call `bl g` causes inconsistent
basic block splitting: in some platforms BOLT emits two blocks, on some
others one.

Defining a dummy `g` symbol forces a single basic block everywhere.
@paschalis-mpeis paschalis-mpeis changed the title [BOLT][AArch64] Skip gadget pacret test on RHEL8 [BOLT][AArch64] Make gs-pacret-autiasp.s deterministic Jun 25, 2025
@paschalis-mpeis paschalis-mpeis marked this pull request as ready for review June 25, 2025 12:46
Copy link
Collaborator

@kbeyls kbeyls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like an acceptable work-around!

@paschalis-mpeis
Copy link
Member Author

Hey @kbeyls and @atrosinenko,

That was really helpful, thank you! 🙏

I've added the g symbol and updated each test case affected to make the test deterministic.

The root cause of the unresolved symbol behaviour is still unclear. Perhaps BOLT treated that undefined call as non-external and assumes it cannot return/recover, causing the block to split?

We should keep this in mind and at some point investigate it further.

Copy link
Contributor

@atrosinenko atrosinenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as this still passes on other builders, thank you!

I observed some issues like this in the past, too. IIRC there were problems with calls performed via PLT. Furthermore, IIRC it may behave differently depending on whether --emit-relocs is passed to the linker, as suggested in the documentation of BOLT.

@paschalis-mpeis paschalis-mpeis merged commit 249f074 into main Jun 26, 2025
9 checks passed
@paschalis-mpeis paschalis-mpeis deleted the users/paschalis-mpeis/rhel8-gadget-scanner branch June 26, 2025 08:33
@paschalis-mpeis
Copy link
Member Author

Thanks Kristof and Anatoly for the help and reviews!

cc @maksfb so he's aware of this non-determinism (here's previous related discussion with @kbeyls).

anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
In gs-pacret-autiasp.s, the undefined call `bl g` causes inconsistent
basic block splitting: in some platforms BOLT emits two blocks, on some
others one.

Defining a dummy `g` symbol forces a single basic block everywhere.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants