Skip to content

microblaze: Fix -Os right shift optimization is allowed into delay slot #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: zephyr-gcc-14.3.0
Choose a base branch
from

Conversation

alpsayin
Copy link

@alpsayin alpsayin commented Jul 26, 2025

microblaze: Fix -Os right shift optimization is allowed into delay slot

During picolibc testing, it's found that -Os produces assembly code that
compiler squeezes into a single delay slot. Thus, only the first
instruction (the one in delay slot) emitted by this optimization is
executed and the rest is skipped.

This is a regression introduced by applying Microblaze gcc patches #24.
This patch is a 14.3.0 equivalent of #37.

Signed-off-by: Alp Sayin [email protected]


During picolibc testing, it's found that -Os produces assembly code that compiler
squeezes into a single delay slot. Thus, only the first instruction emitted by this optimization is run and the rest is skipped.
Optimization is generated by

  [(set (match_operand:SI 0 "register_operand" "=&d")
       (ashiftrt:SI (match_operand:SI 1 "register_operand"  "d")
                   (match_operand:SI 2 "immediate_operand" "I")))]
  "(INTVAL (operands[2]) > 5 && optimize_size)"
  {
    operands[3] = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM);

    output_asm_insn ("ori\t%3,r0,%2", operands);
    if (REGNO (operands[0]) != REGNO (operands[1]))
        output_asm_insn ("addk\t%0,%1,r0", operands);

    output_asm_insn ("addik\t%3,%3,-1", operands);
    output_asm_insn ("bneid\t%3,.-4", operands);
    return "sra\t%0,%0";
  }
  [(set_attr "type"    "arith")
  (set_attr "mode"    "SI")
  (set_attr "length"  "20")]

But arith type is NOT disallowed from going into delay slot (see below):

[(and (eq_attr "type" "!branch,call,jump,icmp,multi,no_delay_arith,no_delay_load,no_delay_store,no_delay_imul,no_delay_move,darith")

"Optimized" code is between [191b8-191c8]

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

where operands are:

operands[0] = r24
operands[1] = r29
operands[2] = 20
operands[3] = r18

As a result, this code returns a iy (r24) value of (whatever was in r24) - 1023`

The fix is simple. I've redeclared this size-optimization as multi which is

  1. Not delay-slot allowed
  2. Also the same type for other shift optimizations (they're left shift optimizations)
  3. [(set_attr "type" "multi")
  4. [(set_attr "type" "multi")

For context, non-size-optimized code is N copies of:

sra	r24, r24 /* arithmetic 1-bit shift r24 and put result into r24 */

And as per the above optimization rule, if N <= 5 we'll still get 5 copies of sra instruction.

Originally tested via zephyrproject-rtos/sdk-ng#647
Fix seems to have worked, new disassembly as below:

previously now
if (subnormal_y) { /* subnormal y */ if (subnormal_y) { /* subnormal y */
191ac: b0007ff0 imm 32752 19198: b0007ff0 imm 32752
191b0: a47c0000 andi r3, r28, 0 1919c: a47c0000 andi r3, r28, 0
191b4: be2301ec bneid r3, 492 // 193a0 191a0: bc2301d8 bnei r3, 472 // 19378
/* jump to else but also somehow loop 20x single bit right shift */ /* jump to else no delay slot */
191b8: a2400014 ori r18, r0, 20
191bc: 131e0000 addk r24, r30, r0
191c0: 3252ffff addik r18, r18, -1
191c4: be32fffc bneid r18, -4 // 191c0
191c8: 93180001 sra r24, r24

After the jump

previously now
iy = (hy >> 20) - 1023; iy = (hy >> 20) - 1023;
/*jump to whatever but with delay-slot */ /* loop 20: single bit right shift */
193a0: b810fe40 brid -448 // 191e0 19378: a2400014 ori r18, r0, 20
193a4: 3318fc01 addik r24, r24, -1023 1937c: 131e0000 addk r24, r30, r0
/* -1023 executed and branched */ 19380: 3252ffff addik r18, r18, -1
19384: be32fffc bneid r18, -4 // 19380
19388: 93180001 sra r24, r24
1938c: b810fe2c brid -468 // 191b8
19390: 3318fc01 addik r24, r24, -1023

CC @keith-packard you can grab a copy of the toolchain from https://github.com/zephyrproject-rtos/sdk-ng/actions/runs/11407925329?pr=647 if it's of any interest.

p.s. new disassembly again but monospace cuz it hurts my eyes:

   19194:	bc830180 	bgti	r3, 384		// 19314
    if (subnormal_y) { /* subnormal y */
   19198:	b0007ff0 	imm	32752
   1919c:	a47c0000 	andi	r3, r28, 0
   191a0:	bc2301d8 	bnei	r3, 472		// 19378
...
        iy = (hy >> 20) - 1023;
   19378:	a2400014 	ori	r18, r0, 20
   1937c:	131e0000 	addk	r24, r30, r0
   19380:	3252ffff 	addik	r18, r18, -1
   19384:	be32fffc 	bneid	r18, -4		// 19380
   19388:	93180001 	sra	r24, r24
   1938c:	b810fe2c 	brid	-468		// 191b8
   19390:	3318fc01 	addik	r24, r24, -1023

compared to:

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

During picolibc testing, it's found that `-Os` produces assembly code that
compiler squeezes into a single delay slot. Thus, only the first
instruction (the one in delay slot) emitted by this optimization is
executed and the rest is skipped.

This is a regression introduced by applying Microblaze gcc patches zephyrproject-rtos#24.
This patch is a 14.3.0 equivalent of zephyrproject-rtos#37.

Signed-off-by: Alp Sayin <[email protected]>
@alpsayin alpsayin force-pushed the zephyr-gcc-14.3.0-bad-Os-shift-optimisation branch from 4f30e26 to 80f3cbf Compare July 26, 2025 09:15
@alpsayin alpsayin requested a review from stephanosio July 26, 2025 09:15
@alpsayin alpsayin added the bug Something isn't working label Jul 26, 2025
Copy link
Member

@stephanosio stephanosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open an sdk-ng PR that pulls this patch.

Copy link
Member

@stephanosio stephanosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lack the expertise on the microblaze arch to properly review this; but, based on the description provided, it looks reasonable.

@alpsayin
Copy link
Author

alpsayin commented Jul 26, 2025

Please open an sdk-ng PR that pulls this patch.

Will do, I'll need an SDK to run ALL the tests as before anyways. All zephyr tests + picolibc test-suite.
Working on binutils at the moment. The patchset there is a bit of a beast.

I lack the expertise on the microblaze arch to properly review this; but, based on the description provided, it looks reasonable.

Honestly, it's just assembly delay-slot out-of-order execution nonsense. Honestly I'm no compiler/linker expert either I just found out what's causing the issue, made some educated guesses and flipped a switch.
My understanding is as below, in case you're (anyone's) curious:

Machine instructions for branch could be suffixed with a d to indicate they'll execute the next-instruction-in-memory BEFORE the jump is complete.
Branches with delays have a fixed 2-cycle cost which is the reason delay-slot exists.
Branches without delays has a 3-cycle cost if target-speculation fails but success means 1-cycle.
Above we have bnei vs bneid : branch if not equal to 0 immediate (delay)

If I remember correctly, when this optimization isn't enabled (no -Os or when N<5) the delay-slot optimization actually takes care of multi-line-shift-1 emissions by placing 1 shift-1 into delay-slot and moving the N-1 of the shift-1 s to the branch-target. Branch penalty becomes 1 cycle effectively.

Apart from that;
When MB has a barrel-shifter, shift-N operations are single machine instruction.
When it lacks a barrel-shifter assembler generates a loop-N with single-shift (if N >= 5).

When -Os is enabled, it enables this ^^ and delay-slot-optimization (which probably is enabled via -O2 already) which then tries to slide-back delay-slottable next-line compiler-emissions into delay slots.

The problem here is that the emission of the loop is marked as arithmetic which is delay-slottable when in fact this optimization emission is not. So I marked it as multi-line so that linker wouldn't try to relocate the generated loop into delay slots.

And because -Os is enabled, seems linker preferred a single branch-no-delay instruction over 2-instructions of branchD&NOP, which tracks.
Normally, with -O2, I think compiler usually emits branchD&Nop, hoping that linker would clean it up.
And then I'm guessing -Os cleaned up the remaining branch&NOPs into branch-no-delays. ¯_(ツ)_/¯

And so result of this patch is that a no-delay-branch machine instruction (bnei) is used for branch because it's now understood that post-branch target-code is not delay-slottable.

@keith-packard
Copy link

The patch looks good; didn't we do the same thing in the old SDK? Picolibc testing only builds microblazeel, it doesn't have crt0 code, semihosting code or a qemu script to run them. Let's get those implemented so we can get better test coverage before SDK 0.18 gets released.

And, yes, running the picolibc tests during SDK build would be awesome. All that requires is access to the relevant qemu binary; picolibc contains all of the necessary code otherwise.

@stephanosio
Copy link
Member

And, yes, running the picolibc tests during SDK build would be awesome. All that requires is access to the relevant qemu binary; picolibc contains all of the necessary code otherwise.

There is currently QEMU 8.2.2 included in the sdk-build Docker image for libc testing (used by the LLVM build scripts). We could re-use this for GCC libc testing as well.

@keith-packard
Copy link

There is currently QEMU 8.2.2 included in the sdk-build Docker image for libc testing (used by the LLVM build scripts). We could re-use this for GCC libc testing as well.

To get the picolibc tests working, we'll need primitive crt0 and semihosting support that works with this qemu instance. I attempted to build that llvm version, but it's all a yocto mess, which makes compiling locally difficult. Is there a simple git repository with the qemu source code I can just pull and build?

@alpsayin
Copy link
Author

alpsayin commented Jul 27, 2025

@keith-packard the qemu binary should still be inside the SDK produced here? no?:
https://github.com/zephyrproject-rtos/sdk-ng/actions/runs/16540787327?pr=984
hosttools_linux-x86_64 - 66.1 MB
zephyr-sdk-0.18.0-alpha4-26-g76ff565_linux-x86_64 - 2.27 GB

we'll need primitive crt0 and semihosting support that works with this qemu instance

For some time I've been using https://github.com/alpsayin/picolibc/tree/microblaze-hack-rebased to run the picolibc tests.
This is still based on your initial work on https://github.com/picolibc/picolibc/tree/microblaze-hacks
Now that gcc & friends, and also picolibc have updated I'm not yet sure how well it'll play but in April it was A-OK and I think it will be OK.
I will get to run this tomorrow evening after-work (in about 18 hrs).

The patch looks good; didn't we do the same thing in the old SDK?

We never got to merge the fix ¯_(ツ)_/¯

@keith-packard
Copy link

For some time I've been using https://github.com/alpsayin/picolibc/tree/microblaze-hack-rebased to run the picolibc tests. This is still based on your initial work on https://github.com/picolibc/picolibc/tree/microblaze-hacks Now that gcc & friends, and also picolibc have updated I'm not yet sure how well it'll play but in April it was A-OK and I think it will be OK. I will get to run this tomorrow evening after-work (in about 18 hrs).

Would you believe I didn't remember having done that? So many CPUs, so little time...

Picolibc CI can either consume a Zephyr toolchain or a hand-built one from https://github.com/picolibc/picolibc-ci-tools. In a pinch, it can also use pre-built bits in a tarball it can download. I suspect it will be easiest to do the latter for now?

Let's find a way to hand picolibc CI a toolchain and get that running.

@stephanosio
Copy link
Member

. I attempted to build that llvm version, but it's all a yocto mess, which makes compiling locally difficult.

Not really sure what you mean by this; do you mean the one in the Zephyr SDK host tools, which used to be indeed a Yocto mess until I moved all the patches back to the fork.

Is there a simple git repository with the qemu source code I can just pull and build?

https://github.com/zephyrproject-rtos/qemu/tree/zephyr-qemu-v10.0.2

@keith-packard
Copy link

I wasn't remembering that picolibc already had all of the necessary bits to run the test suite and expected I'd need to debug picolibc, gcc and qemu all together, so I thought I needed to build them all from source locally. Now that @alpsayin reminded me about the existing picolibc support, none of that is required.

@alpsayin alpsayin force-pushed the zephyr-gcc-14.3.0-bad-Os-shift-optimisation branch 2 times, most recently from 227ce1e to 80f3cbf Compare August 5, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants