microblaze: Fix -Os right shift optimization is allowed into delay slot #58

alpsayin · 2025-07-26T09:13:33Z

microblaze: Fix -Os right shift optimization is allowed into delay slot

During picolibc testing, it's found that -Os produces assembly code that
compiler squeezes into a single delay slot. Thus, only the first
instruction (the one in delay slot) emitted by this optimization is
executed and the rest is skipped.

This is a regression introduced by applying Microblaze gcc patches #24.
This patch is a 14.3.0 equivalent of #37.

Signed-off-by: Alp Sayin [email protected]

During picolibc testing, it's found that -Os produces assembly code that compiler
squeezes into a single delay slot. Thus, only the first instruction emitted by this optimization is run and the rest is skipped.
Optimization is generated by

  [(set (match_operand:SI 0 "register_operand" "=&d")
       (ashiftrt:SI (match_operand:SI 1 "register_operand"  "d")
                   (match_operand:SI 2 "immediate_operand" "I")))]
  "(INTVAL (operands[2]) > 5 && optimize_size)"
  {
    operands[3] = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM);

    output_asm_insn ("ori\t%3,r0,%2", operands);
    if (REGNO (operands[0]) != REGNO (operands[1]))
        output_asm_insn ("addk\t%0,%1,r0", operands);

    output_asm_insn ("addik\t%3,%3,-1", operands);
    output_asm_insn ("bneid\t%3,.-4", operands);
    return "sra\t%0,%0";
  }
  [(set_attr "type"    "arith")
  (set_attr "mode"    "SI")
  (set_attr "length"  "20")]

But arith type is NOT disallowed from going into delay slot (see below):

gcc/gcc/config/microblaze/microblaze.md

Line 466 in 428d8d7

    
           [(and (eq_attr "type" "!branch,call,jump,icmp,multi,no_delay_arith,no_delay_load,no_delay_store,no_delay_imul,no_delay_move,darith")

"Optimized" code is between [191b8-191c8]

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

where operands are:

operands[0] = r24
operands[1] = r29
operands[2] = 20
operands[3] = r18

As a result, this code returns a iy (r24) value of (whatever was in r24) - 1023`

The fix is simple. I've redeclared this size-optimization as multi which is

Not delay-slot allowed
Also the same type for other shift optimizations (they're left shift optimizations)
gcc/gcc/config/microblaze/microblaze.md

Line 2070 in 428d8d7

[(set_attr "type" "multi")
gcc/gcc/config/microblaze/microblaze.md

Line 2483 in 428d8d7

[(set_attr "type" "multi")

For context, non-size-optimized code is N copies of:

sra	r24, r24 /* arithmetic 1-bit shift r24 and put result into r24 */

And as per the above optimization rule, if N <= 5 we'll still get 5 copies of sra instruction.

Originally tested via zephyrproject-rtos/sdk-ng#647
Fix seems to have worked, new disassembly as below:

previously	now
if (subnormal_y) { /* subnormal y */	if (subnormal_y) { /* subnormal y */
191ac: b0007ff0 imm 32752	19198: b0007ff0 imm 32752
191b0: a47c0000 andi r3, r28, 0	1919c: a47c0000 andi r3, r28, 0
191b4: be2301ec bneid r3, 492 // 193a0	191a0: bc2301d8 bnei r3, 472 // 19378
/* jump to `else` but also somehow loop 20x single bit right shift */	/* jump to `else` no delay slot */
191b8: a2400014 ori r18, r0, 20
191bc: 131e0000 addk r24, r30, r0
191c0: 3252ffff addik r18, r18, -1
191c4: be32fffc bneid r18, -4 // 191c0
191c8: 93180001 sra r24, r24

After the jump

previously	now
iy = (hy >> 20) - 1023;	iy = (hy >> 20) - 1023;
/jump to whatever but with delay-slot /	/* loop 20: single bit right shift */
193a0: b810fe40 brid -448 // 191e0	19378: a2400014 ori r18, r0, 20
193a4: 3318fc01 addik r24, r24, -1023	1937c: 131e0000 addk r24, r30, r0
/* -1023 executed and branched */	19380: 3252ffff addik r18, r18, -1
	19384: be32fffc bneid r18, -4 // 19380
	19388: 93180001 sra r24, r24
	1938c: b810fe2c brid -468 // 191b8
	19390: 3318fc01 addik r24, r24, -1023

CC @keith-packard you can grab a copy of the toolchain from https://github.com/zephyrproject-rtos/sdk-ng/actions/runs/11407925329?pr=647 if it's of any interest.

p.s. new disassembly again but monospace cuz it hurts my eyes:

   19194:	bc830180 	bgti	r3, 384		// 19314
    if (subnormal_y) { /* subnormal y */
   19198:	b0007ff0 	imm	32752
   1919c:	a47c0000 	andi	r3, r28, 0
   191a0:	bc2301d8 	bnei	r3, 472		// 19378
...
        iy = (hy >> 20) - 1023;
   19378:	a2400014 	ori	r18, r0, 20
   1937c:	131e0000 	addk	r24, r30, r0
   19380:	3252ffff 	addik	r18, r18, -1
   19384:	be32fffc 	bneid	r18, -4		// 19380
   19388:	93180001 	sra	r24, r24
   1938c:	b810fe2c 	brid	-468		// 191b8
   19390:	3318fc01 	addik	r24, r24, -1023

compared to:

   191a8:	bc830194 	bgti	r3, 404		// 1933c
    if (subnormal_y) { /* subnormal y */
   191ac:	b0007ff0 	imm	32752
   191b0:	a47c0000 	andi	r3, r28, 0
   191b4:	be2301ec 	bneid	r3, 492		// 193a0
   191b8:	a2400014 	ori	r18, r0, 20
   191bc:	131e0000 	addk	r24, r30, r0
   191c0:	3252ffff 	addik	r18, r18, -1
   191c4:	be32fffc 	bneid	r18, -4		// 191c0
   191c8:	93180001 	sra	r24, r24
...
        iy = (hy >> 20) - 1023;
   193a0:	b810fe40 	brid	-448		// 191e0
   193a4:	3318fc01 	addik	r24, r24, -1023

During picolibc testing, it's found that `-Os` produces assembly code that compiler squeezes into a single delay slot. Thus, only the first instruction (the one in delay slot) emitted by this optimization is executed and the rest is skipped. This is a regression introduced by applying Microblaze gcc patches zephyrproject-rtos#24. This patch is a 14.3.0 equivalent of zephyrproject-rtos#37. Signed-off-by: Alp Sayin <[email protected]>

stephanosio

Please open an sdk-ng PR that pulls this patch.

stephanosio

I lack the expertise on the microblaze arch to properly review this; but, based on the description provided, it looks reasonable.

alpsayin · 2025-07-26T11:47:55Z

Please open an sdk-ng PR that pulls this patch.

Will do, I'll need an SDK to run ALL the tests as before anyways. All zephyr tests + picolibc test-suite.
Working on binutils at the moment. The patchset there is a bit of a beast.

I lack the expertise on the microblaze arch to properly review this; but, based on the description provided, it looks reasonable.

Honestly, it's just assembly delay-slot out-of-order execution nonsense. Honestly I'm no compiler/linker expert either I just found out what's causing the issue, made some educated guesses and flipped a switch.
My understanding is as below, in case you're (anyone's) curious:

Machine instructions for branch could be suffixed with a d to indicate they'll execute the next-instruction-in-memory BEFORE the jump is complete.
Branches with delays have a fixed 2-cycle cost which is the reason delay-slot exists.
Branches without delays has a 3-cycle cost if target-speculation fails but success means 1-cycle.
Above we have bnei vs bneid : branch if not equal to 0 immediate (delay)

If I remember correctly, when this optimization isn't enabled (no -Os or when N<5) the delay-slot optimization actually takes care of multi-line-shift-1 emissions by placing 1 shift-1 into delay-slot and moving the N-1 of the shift-1 s to the branch-target. Branch penalty becomes 1 cycle effectively.

Apart from that;
When MB has a barrel-shifter, shift-N operations are single machine instruction.
When it lacks a barrel-shifter assembler generates a loop-N with single-shift (if N >= 5).

When -Os is enabled, it enables this ^^ and delay-slot-optimization (which probably is enabled via -O2 already) which then tries to slide-back delay-slottable next-line compiler-emissions into delay slots.

The problem here is that the emission of the loop is marked as arithmetic which is delay-slottable when in fact this optimization emission is not. So I marked it as multi-line so that linker wouldn't try to relocate the generated loop into delay slots.

And because -Os is enabled, seems linker preferred a single branch-no-delay instruction over 2-instructions of branchD&NOP, which tracks.
Normally, with -O2, I think compiler usually emits branchD&Nop, hoping that linker would clean it up.
And then I'm guessing -Os cleaned up the remaining branch&NOPs into branch-no-delays. ¯_(ツ)_/¯

And so result of this patch is that a no-delay-branch machine instruction (bnei) is used for branch because it's now understood that post-branch target-code is not delay-slottable.

keith-packard · 2025-07-27T04:17:14Z

The patch looks good; didn't we do the same thing in the old SDK? Picolibc testing only builds microblazeel, it doesn't have crt0 code, semihosting code or a qemu script to run them. Let's get those implemented so we can get better test coverage before SDK 0.18 gets released.

And, yes, running the picolibc tests during SDK build would be awesome. All that requires is access to the relevant qemu binary; picolibc contains all of the necessary code otherwise.

stephanosio · 2025-07-27T05:13:05Z

And, yes, running the picolibc tests during SDK build would be awesome. All that requires is access to the relevant qemu binary; picolibc contains all of the necessary code otherwise.

There is currently QEMU 8.2.2 included in the sdk-build Docker image for libc testing (used by the LLVM build scripts). We could re-use this for GCC libc testing as well.

keith-packard · 2025-07-27T16:33:03Z

There is currently QEMU 8.2.2 included in the sdk-build Docker image for libc testing (used by the LLVM build scripts). We could re-use this for GCC libc testing as well.

To get the picolibc tests working, we'll need primitive crt0 and semihosting support that works with this qemu instance. I attempted to build that llvm version, but it's all a yocto mess, which makes compiling locally difficult. Is there a simple git repository with the qemu source code I can just pull and build?

alpsayin · 2025-07-27T22:42:03Z

@keith-packard the qemu binary should still be inside the SDK produced here? no?:
https://github.com/zephyrproject-rtos/sdk-ng/actions/runs/16540787327?pr=984
hosttools_linux-x86_64 - 66.1 MB
zephyr-sdk-0.18.0-alpha4-26-g76ff565_linux-x86_64 - 2.27 GB

we'll need primitive crt0 and semihosting support that works with this qemu instance

For some time I've been using https://github.com/alpsayin/picolibc/tree/microblaze-hack-rebased to run the picolibc tests.
This is still based on your initial work on https://github.com/picolibc/picolibc/tree/microblaze-hacks
Now that gcc & friends, and also picolibc have updated I'm not yet sure how well it'll play but in April it was A-OK and I think it will be OK.
I will get to run this tomorrow evening after-work (in about 18 hrs).

The patch looks good; didn't we do the same thing in the old SDK?

We never got to merge the fix ¯_(ツ)_/¯

keith-packard · 2025-07-27T23:37:46Z

For some time I've been using https://github.com/alpsayin/picolibc/tree/microblaze-hack-rebased to run the picolibc tests. This is still based on your initial work on https://github.com/picolibc/picolibc/tree/microblaze-hacks Now that gcc & friends, and also picolibc have updated I'm not yet sure how well it'll play but in April it was A-OK and I think it will be OK. I will get to run this tomorrow evening after-work (in about 18 hrs).

Would you believe I didn't remember having done that? So many CPUs, so little time...

Picolibc CI can either consume a Zephyr toolchain or a hand-built one from https://github.com/picolibc/picolibc-ci-tools. In a pinch, it can also use pre-built bits in a tarball it can download. I suspect it will be easiest to do the latter for now?

Let's find a way to hand picolibc CI a toolchain and get that running.

stephanosio · 2025-07-28T13:43:16Z

. I attempted to build that llvm version, but it's all a yocto mess, which makes compiling locally difficult.

Not really sure what you mean by this; do you mean the one in the Zephyr SDK host tools, which used to be indeed a Yocto mess until I moved all the patches back to the fork.

Is there a simple git repository with the qemu source code I can just pull and build?

https://github.com/zephyrproject-rtos/qemu/tree/zephyr-qemu-v10.0.2

keith-packard · 2025-07-28T17:50:11Z

I wasn't remembering that picolibc already had all of the necessary bits to run the test suite and expected I'd need to debug picolibc, gcc and qemu all together, so I thought I needed to build them all from source locally. Now that @alpsayin reminded me about the existing picolibc support, none of that is required.

github-project-automation bot added this to SDK / Docker / Toolchains Jul 26, 2025

alpsayin self-assigned this Jul 26, 2025

alpsayin force-pushed the zephyr-gcc-14.3.0-bad-Os-shift-optimisation branch from 4f30e26 to 80f3cbf Compare July 26, 2025 09:15

alpsayin requested a review from stephanosio July 26, 2025 09:15

alpsayin added the bug Something isn't working label Jul 26, 2025

alpsayin mentioned this pull request Jul 26, 2025

microblaze: Fix -Os right shift optimization is allowed into delay slot #37

Closed

stephanosio requested a review from keith-packard July 26, 2025 10:22

stephanosio reviewed Jul 26, 2025

View reviewed changes

stephanosio approved these changes Jul 26, 2025

View reviewed changes

alpsayin mentioned this pull request Jul 26, 2025

microblaze: apply toolchain patches from meta-xilinx repository zephyrproject-rtos/sdk-ng#984

Closed

alpsayin force-pushed the zephyr-gcc-14.3.0-bad-Os-shift-optimisation branch 2 times, most recently from 227ce1e to 80f3cbf Compare August 5, 2025 21:45

alpsayin mentioned this pull request Aug 5, 2025

config/microblaze: fix whitespace in LINK_SPEC is breaking linking #59

Merged

alpsayin mentioned this pull request Sep 5, 2025

Revert "ci: add microblazeel-zephyr-elf to target platforms" zephyrproject-rtos/sdk-ng#1011

Open

stephanosio approved these changes Sep 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

microblaze: Fix -Os right shift optimization is allowed into delay slot #58

microblaze: Fix -Os right shift optimization is allowed into delay slot #58

Uh oh!

alpsayin commented Jul 26, 2025 •

edited

Loading

Uh oh!

stephanosio left a comment

Uh oh!

stephanosio left a comment

Uh oh!

alpsayin commented Jul 26, 2025 •

edited

Loading

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

stephanosio commented Jul 27, 2025

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

alpsayin commented Jul 27, 2025 •

edited

Loading

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

stephanosio commented Jul 28, 2025

Uh oh!

keith-packard commented Jul 28, 2025

Uh oh!

Uh oh!

microblaze: Fix -Os right shift optimization is allowed into delay slot #58

Are you sure you want to change the base?

microblaze: Fix -Os right shift optimization is allowed into delay slot #58

Uh oh!

Conversation

alpsayin commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanosio left a comment

Choose a reason for hiding this comment

Uh oh!

stephanosio left a comment

Choose a reason for hiding this comment

Uh oh!

alpsayin commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

stephanosio commented Jul 27, 2025

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

alpsayin commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keith-packard commented Jul 27, 2025

Uh oh!

stephanosio commented Jul 28, 2025

Uh oh!

keith-packard commented Jul 28, 2025

Uh oh!

Uh oh!

alpsayin commented Jul 26, 2025 •

edited

Loading

alpsayin commented Jul 26, 2025 •

edited

Loading

alpsayin commented Jul 27, 2025 •

edited

Loading