Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

AWSjswinney · 2025-07-21T16:48:38Z

aarch64: Optimize instruction scheduling in gf_5vect_dot_prod_neon

Implement advanced register allocation strategy that:

Allocates additional stack space for temporary register spilling
Uses shared temporary registers between adjacent sections (p4 for sections 1-2, p1 for sections 3-4, p2 for section 5)
Groups table lookup operations to improve instruction-level parallelism
Replaces individual loads with vector loads for better memory access patterns
Removes unnecessary prefetch instructions

This optimization improves encode performance by approximately 9.4%.

aarch64: Optimize instruction scheduling in gf_4vect_dot_prod_neon

Improve performance by:

Grouping table lookup (tbl) instructions to enhance instruction-level parallelism
Replacing individual loads with paired loads (ldp) for better memory access patterns
Removing unnecessary prefetch instructions
Reordering operations to reduce pipeline stalls and data dependencies

This optimization improves decode performance by approximately 6.6%.

pablodelara · 2025-07-22T09:30:12Z

@AWSjswinney thanks for the PR. Can you sign off the commits, as per the contribution guidelines? Just add "Signed-off-by: ...." in both, thanks!

Improve performance by: - Grouping table lookup (tbl) instructions to enhance instruction-level parallelism - Replacing individual loads with paired loads (ldp) for better memory access patterns - Removing unnecessary prefetch instructions - Reordering operations to reduce pipeline stalls and data dependencies This optimization improves decode performance by approximately 6.6%. Signed-off-by: Jonathan Swinney <[email protected]>

Implement advanced register allocation strategy that: - Allocates additional stack space for temporary register spilling - Uses shared temporary registers between adjacent sections (p4 for sections 1-2, p1 for sections 3-4, p2 for section 5) - Groups table lookup operations to improve instruction-level parallelism - Replaces individual loads with vector loads for better memory access patterns - Removes unnecessary prefetch instructions This optimization improves encode performance by approximately 9.4%. Signed-off-by: Jonathan Swinney <[email protected]>

AWSjswinney · 2025-07-24T21:58:17Z

Sorry! In my rush to get this submitted I forgot to add the signed-off flag in the commit. I've fixed it now.

pablodelara · 2025-07-31T11:01:39Z

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

liuqinfei · 2025-07-31T11:55:43Z

Could you share the performance benchmark results? I recently conducted verification tests on the ARM-based Kunpeng 920 server and observed that the patch yields no measurable benefits—performance actually declined in the trials.

`
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: Kunpeng 920-6426

[root@node1 isa-l]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03 (LTS-SP3)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP3)"
ANSI_COLOR="0;31"

base commit 4e27f0b
erasure_code_base_perf: 14x9344 4
erasure_code_base_encode_warm: runtime = 3000138 usecs, bandwidth 237 MB in 3.0001 sec = 79.10 MB/s
erasure_code_base_decode_warm: runtime = 3000792 usecs, bandwidth 235 MB in 3.0008 sec = 78.51 MB/s
done all: Pass
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 6 4 2 7 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime = 3062711 usecs, bandwidth 9873 MB in 3.0627 sec = 3223.77 MB/s
erasure_code_decode_warm: runtime = 3062709 usecs, bandwidth 14251 MB in 3.0627 sec = 4653.26 MB/s
done all: Pass
Testing with 10 data buffers and 4 parity buffers (num errors = 4, in [ 4 6 2 9 ])
ec_encode_data_update_perf: 14x9344 4
ec_encode_data_update_warm: runtime = 3006088 usecs, bandwidth 15596 MB in 3.0061 sec = 5188.14 MB/s
ec_encode_data_update_single_src_warm: runtime = 3062509 usecs, bandwidth 53423 MB in 3.0625 sec = 17444.47 MB/s
ec_encode_data_update_single_src_simple_warm: runtime = 3000585 usecs, bandwidth 62951 MB in 3.0006 sec = 20979.88 MB/s
ec_encode_data_update_decode_warm: runtime = 3032860 usecs, bandwidth 11727 MB in 3.0329 sec = 3866.91 MB/s
done all: Pass
gf_vect_dot_prod: 10x8192
gf_vect_dot_prod_warm: runtime = 3062101 usecs, bandwidth 30699 MB in 3.0621 sec = 10025.64 MB/s
pass perf check
gf_vect_mul_perf:
Start timed tests
gf_vect_mul_warm: runtime = 3000763 usecs, bandwidth 36979 MB in 3.0008 sec = 12323.31 MB/s

update commit a4f7b40
erasure_code_base_perf: 14x9344 4
erasure_code_base_encode_warm: runtime = 3000095 usecs, bandwidth 237 MB in 3.0001 sec = 79.10 MB/s
erasure_code_base_decode_warm: runtime = 3000941 usecs, bandwidth 235 MB in 3.0009 sec = 78.51 MB/s
done all: Pass
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 6 4 2 7 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime = 3062477 usecs, bandwidth 8672 MB in 3.0625 sec = 2831.93 MB/s
erasure_code_decode_warm: runtime = 3024886 usecs, bandwidth 14104 MB in 3.0249 sec = 4662.70 MB/s
done all: Pass
Testing with 10 data buffers and 4 parity buffers (num errors = 4, in [ 4 6 2 9 ])
ec_encode_data_update_perf: 14x9344 4
ec_encode_data_update_warm: runtime = 3013910 usecs, bandwidth 15551 MB in 3.0139 sec = 5159.75 MB/s
ec_encode_data_update_single_src_warm: runtime = 3062573 usecs, bandwidth 53450 MB in 3.0626 sec = 17452.68 MB/s
ec_encode_data_update_single_src_simple_warm: runtime = 3000914 usecs, bandwidth 62382 MB in 3.0009 sec = 20787.86 MB/s
ec_encode_data_update_decode_warm: runtime = 3002366 usecs, bandwidth 11486 MB in 3.0024 sec = 3825.71 MB/s
done all: Pass
gf_vect_dot_prod: 10x8192
gf_vect_dot_prod_warm: runtime = 3000162 usecs, bandwidth 29923 MB in 3.0002 sec = 9973.87 MB/s
pass perf check
gf_vect_mul_perf:
Start timed tests
gf_vect_mul_warm: runtime = 3021651 usecs, bandwidth 37271 MB in 3.0217 sec = 12334.78 MB/s
`

AWSjswinney · 2025-07-31T22:08:53Z

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

I have another patch coming for 1, 2, and 3. I just completed it today, but I still have some clean up before I post it.

pablodelara · 2025-08-29T14:30:45Z

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

I have another patch coming for 1, 2, and 3. I just completed it today, but I still have some clean up before I post it.

Hi @AWSjswinney. Will you send another patch soon? Thanks for your contributions!

Implement instruction scheduling optimization using strategic register reuse: - Load data registers just-in-time before processing each section - Reuse other data registers as temporaries for table lookups - Group table lookup instructions together for better parallelism - Group eor instructions together to reduce pipeline stalls - Remove unnecessary prefetch instructions This approach achieves instruction-level parallelism benefits without stack spilling overhead by cleverly reusing data registers that are not currently being processed as temporary storage. Signed-off-by: Jonathan Swinney <[email protected]>

Implement comprehensive optimization using advanced register reuse and efficient memory access patterns: - Use ld1 4-register loads for maximum memory bandwidth utilization - Delay loading of data_4-7 until needed after processing data_0-3 - Reuse unloaded data registers as temporaries for table lookups - Group table lookup and eor instructions for better parallelism - Remove unnecessary prefetch instructions This approach achieves optimal instruction scheduling without stack spilling overhead by strategically timing data loads and reusing registers as temporaries when they are not needed. Signed-off-by: Jonathan Swinney <[email protected]>

Improve instruction-level parallelism through strategic instruction reordering: - Remove unnecessary prefetch instructions - Reorder dependent eor instruction pairs for better pipeline utilization - Group independent operations together to reduce pipeline stalls - Separate dependent instructions to allow parallel execution This optimization reduces pipeline stalls by allowing the CPU to execute more instructions in parallel, improving overall performance through better utilization of the instruction pipeline. Signed-off-by: Jonathan Swinney <[email protected]>

AWSjswinney · 2025-09-02T01:22:24Z

Hi @AWSjswinney. Will you send another patch soon? Thanks for your contributions!

Apologies for the delay. I've added the patches for the other three functions.

aarch64: Optimize instruction scheduling in gf_vect_dot_prod_neon

Improve instruction-level parallelism through strategic instruction reordering:

Remove unnecessary prefetch instructions
Reorder dependent eor instruction pairs for better pipeline utilization
Group independent operations together to reduce pipeline stalls
Separate dependent instructions to allow parallel execution

This optimization reduces pipeline stalls by allowing the CPU to execute
more instructions in parallel, improving overall performance through
better utilization of the instruction pipeline.

aarch64: Optimize instruction scheduling in gf_2vect_dot_prod_neon

Implement comprehensive optimization using advanced register reuse and
efficient memory access patterns:

Use ld1 4-register loads for maximum memory bandwidth utilization
Delay loading of data_4-7 until needed after processing data_0-3
Reuse unloaded data registers as temporaries for table lookups
Group table lookup and eor instructions for better parallelism
Remove unnecessary prefetch instructions

This approach achieves optimal instruction scheduling without stack
spilling overhead by strategically timing data loads and reusing
registers as temporaries when they are not needed.

aarch64: Optimize instruction scheduling in gf_3vect_dot_prod_neon

Implement instruction scheduling optimization using strategic register reuse:

Load data registers just-in-time before processing each section
Reuse other data registers as temporaries for table lookups
Group table lookup instructions together for better parallelism
Group eor instructions together to reduce pipeline stalls
Remove unnecessary prefetch instructions

This approach achieves instruction-level parallelism benefits without
stack spilling overhead by cleverly reusing data registers that are
not currently being processed as temporary storage.

Benchmark data

I've attached benchmarks across the following parameters:
k_range = 4,6,8,10,12,14,16,17,20,25
p_range = 1,2,3,4,5,6
e_range = 1,2,3,4,5,6

2025-09-02-vector-function-improvements.pdf

AWSjswinney · 2025-09-02T16:17:26Z

I reorganized the plots to separate encode and decode. It's a little easier to read that way.
2025-09-02-vector-function-improvements.pdf

liuqinfei · 2025-09-03T02:12:23Z

I performed local validation of the patch but observed that most test outcomes on the Kunpeng 920 platform were deteriorated. A thorough review of the validation process and results is warranted to identify root causes.

optimize/base - 1

kunpeng920	1	2	3	4	5	6
1-4-decode:	-0.42%	-2.90%	-0.97%	2.44%	-1.71%	-2.76%
1-4-encode:	0.46%	-22.94%	-38.46%	-13.78%	-13.51%	-3.62%
1-6-decode:	-0.34%	-4.43%	0.04%	-4.27%	-1.67%	-0.35%
1-6-encode:	0.33%	-16.71%	-28.97%	-16.17%	-12.03%	-13.01%
1-8-decode:	1.46%	0.47%	4.53%	0.70%	0.10%	-1.19%
1-8-encode:	2.51%	-24.35%	-29.83%	-16.95%	-20.04%	-13.16%
1-10-decode:	1.28%	0.06%	0.74%	0.61%	-15.24%	-1.59%
1-10-encode:	1.96%	-21.04%	-29.15%	-19.61%	-19.35%	-13.27%
1-12-decode:	4.28%	0.73%	9.52%	-1.30%	-0.65%	-2.59%
1-12-encode:	0.51%	-24.82%	-20.21%	-16.86%	-20.12%	-13.48%
1-14-decode:	0.73%	-7.61%	-1.65%	0.97%	-0.20%	-1.24%
1-14-encode:	2.03%	-15.94%	-30.11%	-17.61%	-14.10%	-14.56%
1-16-decode:	-7.58%	6.96%	9.88%	1.70%	0.57%	0.43%
1-16-encode:	-4.05%	-18.91%	-23.61%	-15.47%	-16.83%	-12.79%
1-17-decode:	-5.33%	-6.53%	1.58%	-0.49%	1.66%	0.11%
1-17-encode:	-4.99%	-26.57%	-32.99%	-13.31%	-14.58%	-12.30%
1-20-decode:	2.94%	2.07%	-5.66%	2.95%	-0.66%	1.04%
1-20-encode:	1.55%	-24.20%	-29.37%	-17.25%	-16.75%	-11.43%
1-25-decode:	5.47%	-3.53%	-4.85%	4.53%	-1.03%	0.48%
1-25-encode:	5.18%	-23.33%	-24.90%	-17.31%	-16.72%	-12.03%
2-4-decode:		-23.35%	-21.84%	-18.16%	-21.08%	-18.23%
2-4-encode:		-23.27%	-27.19%	-13.52%	-14.88%	-11.59%
2-6-decode:		-3.92%	-13.92%	-23.40%	-20.65%	-20.51%
2-6-encode:		-3.37%	-14.58%	-17.65%	-14.93%	-10.73%
2-8-decode:		-23.26%	-23.57%	-20.71%	-24.10%	-22.48%
2-8-encode:		-23.26%	-28.96%	-16.80%	-9.18%	-12.58%
2-10-decode:		-19.31%	-24.54%	-24.63%	-23.64%	-9.60%
2-10-encode:		-20.02%	-28.91%	-18.53%	-15.73%	-8.63%
2-12-decode:		-20.94%	-24.10%	-18.49%	-24.06%	-24.10%
2-12-encode:		-20.96%	-29.78%	-18.20%	-11.60%	-12.06%
2-14-decode:		-19.51%	-24.19%	-24.89%	-17.94%	-21.38%
2-14-encode:		-19.77%	-29.38%	-17.04%	-12.42%	-12.78%
2-16-decode:		-25.66%	-23.48%	-23.19%	-22.41%	-18.56%
2-16-encode:		-26.76%	-25.93%	-13.24%	-12.58%	-12.26%
2-17-decode:		-22.81%	-20.93%	-25.15%	-20.66%	-19.56%
2-17-encode:		-21.56%	-27.93%	-22.59%	-12.22%	-11.99%
2-20-decode:		-23.59%	-19.34%	-14.81%	-23.81%	-23.80%
2-20-encode:		-25.10%	-24.23%	-17.25%	-12.91%	-12.52%
2-25-decode:		-23.43%	-24.12%	-22.56%	-26.40%	-22.42%
2-25-encode:		-23.55%	-28.44%	-19.20%	-14.51%	-12.35%
3-4-decode:			-28.62%	-26.87%	-27.24%	-25.38%
3-4-encode:			-29.19%	-13.88%	-0.94%	-12.56%
3-6-decode:			-29.51%	-27.94%	-12.74%	-27.23%
3-6-encode:			-29.65%	-16.57%	-5.62%	-12.28%
3-8-decode:			-29.00%	-27.34%	-27.75%	-28.94%
3-8-encode:			-29.46%	-5.58%	-12.37%	-14.33%
3-10-decode:			-29.55%	-28.43%	-19.26%	-31.02%
3-10-encode:			-29.38%	-17.35%	-8.82%	-14.48%
3-12-decode:			-30.01%	-28.51%	-33.46%	-29.25%
3-12-encode:			-29.92%	-16.99%	-18.28%	-15.99%
3-14-decode:			-30.77%	-29.50%	-30.23%	-24.78%
3-14-encode:			-31.24%	-17.99%	-16.08%	-9.96%
3-16-decode:			-21.09%	-28.91%	-31.30%	-24.24%
3-16-encode:			-24.77%	-22.74%	-16.17%	-11.41%
3-17-decode:			-22.21%	-27.67%	-27.81%	-23.38%
3-17-encode:			-22.14%	-18.73%	-17.79%	-12.52%
3-20-decode:			-22.76%	-31.31%	-22.93%	-24.08%
3-20-encode:			-24.54%	-19.24%	-12.14%	-10.56%
3-25-decode:			-21.45%	-22.97%	-23.75%	-28.02%
3-25-encode:			-22.27%	-14.83%	-14.84%	-12.91%
4-4-decode:				-12.40%	-27.83%	-13.18%
4-4-encode:				-13.39%	-25.09%	-11.51%
4-6-decode:				-16.76%	-15.74%	-15.79%
4-6-encode:				-16.87%	-20.02%	-12.56%
4-8-decode:				-15.93%	-27.19%	-6.36%
4-8-encode:				-16.51%	-16.22%	-8.84%
4-10-decode:				-6.84%	-21.81%	-7.74%
4-10-encode:				-6.75%	-19.03%	-7.16%
4-12-decode:				-23.25%	-16.99%	-9.83%
4-12-encode:				-24.10%	-14.32%	-12.39%
4-14-decode:				-18.66%	-10.80%	-18.08%
4-14-encode:				-17.14%	-11.08%	-11.83%
4-16-decode:				-18.40%	-16.07%	-20.90%
4-16-encode:				-17.41%	-15.68%	-14.24%
4-17-decode:				-22.17%	-18.08%	-15.79%
4-17-encode:				-22.46%	-12.19%	-12.00%
4-20-decode:				-17.82%	-17.15%	-16.53%
4-20-encode:				-18.04%	-13.34%	-13.62%
4-25-decode:				-16.04%	-16.85%	-22.44%
4-25-encode:				-16.40%	-12.75%	-10.25%
5-6-decode:					-14.70%	-9.45%
5-6-encode:					-15.95%	-9.19%
5-8-decode:					-13.64%	-13.72%
5-8-encode:					-14.73%	-12.24%
5-10-decode:					-12.75%	-8.67%
5-10-encode:					-12.21%	-6.03%
5-12-decode:					-13.83%	-8.53%
5-12-encode:					-14.11%	-8.12%
5-14-decode:					-16.70%	-11.63%
5-14-encode:					-17.47%	-8.64%
5-16-decode:					-18.65%	-13.81%
5-16-encode:					-19.15%	-11.52%
5-17-decode:					-13.62%	-10.81%
5-17-encode:					-12.83%	-9.45%
5-20-decode:					-15.79%	-15.97%
5-20-encode:					-16.17%	-12.99%
5-25-decode:					-14.65%	-12.04%
5-25-encode:						-9.28%
6-6-decode:						-11.53%
6-6-encode:						-12.01%
6-8-decode:						-14.21%
6-8-encode:						-14.47%
6-10-decode:						-8.78%
6-10-encode:						-9.69%
6-12-decode:						-9.51%
6-12-encode:						-10.02%
6-14-decode:						-11.76%
6-14-encode:						-12.61%
6-16-decode:						-13.91%
6-16-encode:						-14.03%
6-17-decode:						-11.99%
6-17-encode:						-13.10%
6-20-decode:						-14.11%
6-20-encode:						-14.52%
6-25-decode:						-10.44%
6-25-encode:						-11.22%

AWSjswinney · 2025-09-03T15:44:40Z

It's probably because I optimized this for the 4 parallel vector execution units in the Neoverse-V2 cores in AWS Graviton4. In a quick search I found that the Kunpeng 920 has 2 vector execution units so it doesn't benefit from the scheduling changes that I made.

It's possible the this could be reimplemented with intrinsics which would allow the compiler to do scheduling based on flags like -mcpu=neoverse-v2 or -march=armv8.2-a. This would be a pretty big change and I'm not confident we could get the same performance out of a solution like that, but it's a possible path of exploration if the regression on the Kugpeng 920 makes this a non-starter.

AWSjswinney · 2025-09-19T21:34:59Z

I benchmarked on Graviton2, which is a Neoverse-N1 and also has 2 vector pipelines similar to the Kungpeng 920. Would you be open to dispatching a second version of the Neon functions that are optimized for wider vector throughput?

liuqinfei · 2025-09-21T19:06:31Z

I benchmarked on Graviton2, which is a Neoverse-N1 and also has 2 vector pipelines similar to the Kungpeng 920. Would you be open to dispatching a second version of the Neon functions that are optimized for wider vector throughput?

I support this to fully leverage the computational power of processorses across varying concurrency levels. Additionally, the proposed solution should account for the maintainability of code versions.

AWSjswinney changed the title ~~Jswinney/2025 07 21 scheduling cleanup~~ Optimize instruction scheduling in gf_5vect_dot_prod_neon and gf_4vect_dot_prod_neon Jul 21, 2025

AWSjswinney added 2 commits July 24, 2025 16:44

AWSjswinney force-pushed the jswinney/2025-07-21-scheduling-cleanup branch from 2022b76 to a4f7b40 Compare July 24, 2025 21:45

AWSjswinney added 3 commits September 1, 2025 19:59

AWSjswinney changed the title ~~Optimize instruction scheduling in gf_5vect_dot_prod_neon and gf_4vect_dot_prod_neon~~ Optimize instruction scheduling in gf_*vect_dot_prod_neon Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Uh oh!

AWSjswinney commented Jul 21, 2025

Uh oh!

pablodelara commented Jul 22, 2025

Uh oh!

AWSjswinney commented Jul 24, 2025

Uh oh!

pablodelara commented Jul 31, 2025

Uh oh!

liuqinfei commented Jul 31, 2025

Uh oh!

AWSjswinney commented Jul 31, 2025

Uh oh!

pablodelara commented Aug 29, 2025

Uh oh!

AWSjswinney commented Sep 2, 2025

Uh oh!

AWSjswinney commented Sep 2, 2025

Uh oh!

liuqinfei commented Sep 3, 2025

Uh oh!

AWSjswinney commented Sep 3, 2025

Uh oh!

AWSjswinney commented Sep 19, 2025

Uh oh!

liuqinfei commented Sep 21, 2025

Uh oh!

Uh oh!

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Are you sure you want to change the base?

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Uh oh!

Conversation

AWSjswinney commented Jul 21, 2025

Uh oh!

pablodelara commented Jul 22, 2025

Uh oh!

AWSjswinney commented Jul 24, 2025

Uh oh!

pablodelara commented Jul 31, 2025

Uh oh!

liuqinfei commented Jul 31, 2025

Uh oh!

AWSjswinney commented Jul 31, 2025

Uh oh!

pablodelara commented Aug 29, 2025

Uh oh!

AWSjswinney commented Sep 2, 2025

Benchmark data

Uh oh!

AWSjswinney commented Sep 2, 2025

Uh oh!

liuqinfei commented Sep 3, 2025

Uh oh!

AWSjswinney commented Sep 3, 2025

Uh oh!

AWSjswinney commented Sep 19, 2025

Uh oh!

liuqinfei commented Sep 21, 2025

Uh oh!

Uh oh!