Skip to content

Conversation

AWSjswinney
Copy link
Contributor

aarch64: Optimize instruction scheduling in gf_5vect_dot_prod_neon

Implement advanced register allocation strategy that:

  • Allocates additional stack space for temporary register spilling
  • Uses shared temporary registers between adjacent sections (p4 for sections 1-2, p1 for sections 3-4, p2 for section 5)
  • Groups table lookup operations to improve instruction-level parallelism
  • Replaces individual loads with vector loads for better memory access patterns
  • Removes unnecessary prefetch instructions

This optimization improves encode performance by approximately 9.4%.

aarch64: Optimize instruction scheduling in gf_4vect_dot_prod_neon

Improve performance by:

  • Grouping table lookup (tbl) instructions to enhance instruction-level parallelism
  • Replacing individual loads with paired loads (ldp) for better memory access patterns
  • Removing unnecessary prefetch instructions
  • Reordering operations to reduce pipeline stalls and data dependencies

This optimization improves decode performance by approximately 6.6%.

@AWSjswinney AWSjswinney changed the title Jswinney/2025 07 21 scheduling cleanup Optimize instruction scheduling in gf_5vect_dot_prod_neon and gf_4vect_dot_prod_neon Jul 21, 2025
@pablodelara
Copy link
Contributor

@AWSjswinney thanks for the PR. Can you sign off the commits, as per the contribution guidelines? Just add "Signed-off-by: ...." in both, thanks!

Improve performance by:
- Grouping table lookup (tbl) instructions to enhance instruction-level parallelism
- Replacing individual loads with paired loads (ldp) for better memory access patterns
- Removing unnecessary prefetch instructions
- Reordering operations to reduce pipeline stalls and data dependencies

This optimization improves decode performance by approximately 6.6%.

Signed-off-by: Jonathan Swinney <[email protected]>
Implement advanced register allocation strategy that:
- Allocates additional stack space for temporary register spilling
- Uses shared temporary registers between adjacent sections (p4 for sections 1-2, p1 for sections 3-4, p2 for section 5)
- Groups table lookup operations to improve instruction-level parallelism
- Replaces individual loads with vector loads for better memory access patterns
- Removes unnecessary prefetch instructions

This optimization improves encode performance by approximately 9.4%.

Signed-off-by: Jonathan Swinney <[email protected]>
@AWSjswinney AWSjswinney force-pushed the jswinney/2025-07-21-scheduling-cleanup branch from 2022b76 to a4f7b40 Compare July 24, 2025 21:45
@AWSjswinney
Copy link
Contributor Author

Sorry! In my rush to get this submitted I forgot to add the signed-off flag in the commit. I've fixed it now.

@pablodelara
Copy link
Contributor

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

@liuqinfei
Copy link
Contributor

Could you share the performance benchmark results? I recently conducted verification tests on the ARM-based Kunpeng 920 server and observed that the patch yields no measurable benefits—performance actually declined in the trials.

`
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: Kunpeng 920-6426

[root@node1 isa-l]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03 (LTS-SP3)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP3)"
ANSI_COLOR="0;31"

base commit 4e27f0b
erasure_code_base_perf: 14x9344 4
erasure_code_base_encode_warm: runtime = 3000138 usecs, bandwidth 237 MB in 3.0001 sec = 79.10 MB/s
erasure_code_base_decode_warm: runtime = 3000792 usecs, bandwidth 235 MB in 3.0008 sec = 78.51 MB/s
done all: Pass
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 6 4 2 7 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime = 3062711 usecs, bandwidth 9873 MB in 3.0627 sec = 3223.77 MB/s
erasure_code_decode_warm: runtime = 3062709 usecs, bandwidth 14251 MB in 3.0627 sec = 4653.26 MB/s
done all: Pass
Testing with 10 data buffers and 4 parity buffers (num errors = 4, in [ 4 6 2 9 ])
ec_encode_data_update_perf: 14x9344 4
ec_encode_data_update_warm: runtime = 3006088 usecs, bandwidth 15596 MB in 3.0061 sec = 5188.14 MB/s
ec_encode_data_update_single_src_warm: runtime = 3062509 usecs, bandwidth 53423 MB in 3.0625 sec = 17444.47 MB/s
ec_encode_data_update_single_src_simple_warm: runtime = 3000585 usecs, bandwidth 62951 MB in 3.0006 sec = 20979.88 MB/s
ec_encode_data_update_decode_warm: runtime = 3032860 usecs, bandwidth 11727 MB in 3.0329 sec = 3866.91 MB/s
done all: Pass
gf_vect_dot_prod: 10x8192
gf_vect_dot_prod_warm: runtime = 3062101 usecs, bandwidth 30699 MB in 3.0621 sec = 10025.64 MB/s
pass perf check
gf_vect_mul_perf:
Start timed tests
gf_vect_mul_warm: runtime = 3000763 usecs, bandwidth 36979 MB in 3.0008 sec = 12323.31 MB/s

update commit a4f7b40
erasure_code_base_perf: 14x9344 4
erasure_code_base_encode_warm: runtime = 3000095 usecs, bandwidth 237 MB in 3.0001 sec = 79.10 MB/s
erasure_code_base_decode_warm: runtime = 3000941 usecs, bandwidth 235 MB in 3.0009 sec = 78.51 MB/s
done all: Pass
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 6 4 2 7 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime = 3062477 usecs, bandwidth 8672 MB in 3.0625 sec = 2831.93 MB/s
erasure_code_decode_warm: runtime = 3024886 usecs, bandwidth 14104 MB in 3.0249 sec = 4662.70 MB/s
done all: Pass
Testing with 10 data buffers and 4 parity buffers (num errors = 4, in [ 4 6 2 9 ])
ec_encode_data_update_perf: 14x9344 4
ec_encode_data_update_warm: runtime = 3013910 usecs, bandwidth 15551 MB in 3.0139 sec = 5159.75 MB/s
ec_encode_data_update_single_src_warm: runtime = 3062573 usecs, bandwidth 53450 MB in 3.0626 sec = 17452.68 MB/s
ec_encode_data_update_single_src_simple_warm: runtime = 3000914 usecs, bandwidth 62382 MB in 3.0009 sec = 20787.86 MB/s
ec_encode_data_update_decode_warm: runtime = 3002366 usecs, bandwidth 11486 MB in 3.0024 sec = 3825.71 MB/s
done all: Pass
gf_vect_dot_prod: 10x8192
gf_vect_dot_prod_warm: runtime = 3000162 usecs, bandwidth 29923 MB in 3.0002 sec = 9973.87 MB/s
pass perf check
gf_vect_mul_perf:
Start timed tests
gf_vect_mul_warm: runtime = 3021651 usecs, bandwidth 37271 MB in 3.0217 sec = 12334.78 MB/s
`

@AWSjswinney
Copy link
Contributor Author

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

I have another patch coming for 1, 2, and 3. I just completed it today, but I still have some clean up before I post it.

@pablodelara
Copy link
Contributor

What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5?

I have another patch coming for 1, 2, and 3. I just completed it today, but I still have some clean up before I post it.

Hi @AWSjswinney. Will you send another patch soon? Thanks for your contributions!

Implement instruction scheduling optimization using strategic register reuse:

- Load data registers just-in-time before processing each section
- Reuse other data registers as temporaries for table lookups
- Group table lookup instructions together for better parallelism
- Group eor instructions together to reduce pipeline stalls
- Remove unnecessary prefetch instructions

This approach achieves instruction-level parallelism benefits without
stack spilling overhead by cleverly reusing data registers that are
not currently being processed as temporary storage.

Signed-off-by: Jonathan Swinney <[email protected]>
Implement comprehensive optimization using advanced register reuse and
efficient memory access patterns:

- Use ld1 4-register loads for maximum memory bandwidth utilization
- Delay loading of data_4-7 until needed after processing data_0-3
- Reuse unloaded data registers as temporaries for table lookups
- Group table lookup and eor instructions for better parallelism
- Remove unnecessary prefetch instructions

This approach achieves optimal instruction scheduling without stack
spilling overhead by strategically timing data loads and reusing
registers as temporaries when they are not needed.

Signed-off-by: Jonathan Swinney <[email protected]>
Improve instruction-level parallelism through strategic instruction reordering:

- Remove unnecessary prefetch instructions
- Reorder dependent eor instruction pairs for better pipeline utilization
- Group independent operations together to reduce pipeline stalls
- Separate dependent instructions to allow parallel execution

This optimization reduces pipeline stalls by allowing the CPU to execute
more instructions in parallel, improving overall performance through
better utilization of the instruction pipeline.

Signed-off-by: Jonathan Swinney <[email protected]>
@AWSjswinney AWSjswinney changed the title Optimize instruction scheduling in gf_5vect_dot_prod_neon and gf_4vect_dot_prod_neon Optimize instruction scheduling in gf_*vect_dot_prod_neon Sep 2, 2025
@AWSjswinney
Copy link
Contributor Author

Hi @AWSjswinney. Will you send another patch soon? Thanks for your contributions!

Apologies for the delay. I've added the patches for the other three functions.

aarch64: Optimize instruction scheduling in gf_vect_dot_prod_neon

Improve instruction-level parallelism through strategic instruction reordering:

  • Remove unnecessary prefetch instructions
  • Reorder dependent eor instruction pairs for better pipeline utilization
  • Group independent operations together to reduce pipeline stalls
  • Separate dependent instructions to allow parallel execution

This optimization reduces pipeline stalls by allowing the CPU to execute
more instructions in parallel, improving overall performance through
better utilization of the instruction pipeline.

aarch64: Optimize instruction scheduling in gf_2vect_dot_prod_neon

Implement comprehensive optimization using advanced register reuse and
efficient memory access patterns:

  • Use ld1 4-register loads for maximum memory bandwidth utilization
  • Delay loading of data_4-7 until needed after processing data_0-3
  • Reuse unloaded data registers as temporaries for table lookups
  • Group table lookup and eor instructions for better parallelism
  • Remove unnecessary prefetch instructions

This approach achieves optimal instruction scheduling without stack
spilling overhead by strategically timing data loads and reusing
registers as temporaries when they are not needed.

aarch64: Optimize instruction scheduling in gf_3vect_dot_prod_neon

Implement instruction scheduling optimization using strategic register reuse:

  • Load data registers just-in-time before processing each section
  • Reuse other data registers as temporaries for table lookups
  • Group table lookup instructions together for better parallelism
  • Group eor instructions together to reduce pipeline stalls
  • Remove unnecessary prefetch instructions

This approach achieves instruction-level parallelism benefits without
stack spilling overhead by cleverly reusing data registers that are
not currently being processed as temporary storage.

Benchmark data

I've attached benchmarks across the following parameters:
k_range = 4,6,8,10,12,14,16,17,20,25
p_range = 1,2,3,4,5,6
e_range = 1,2,3,4,5,6

2025-09-02-vector-function-improvements.pdf

@AWSjswinney
Copy link
Contributor Author

I reorganized the plots to separate encode and decode. It's a little easier to read that way.
2025-09-02-vector-function-improvements.pdf

@liuqinfei
Copy link
Contributor

I performed local validation of the patch but observed that most test outcomes on the Kunpeng 920 platform were deteriorated. A thorough review of the validation process and results is warranted to identify root causes.

optimize/base - 1

kunpeng920 1 2 3 4 5 6
1-4-decode: -0.42% -2.90% -0.97% 2.44% -1.71% -2.76%
1-4-encode: 0.46% -22.94% -38.46% -13.78% -13.51% -3.62%
1-6-decode: -0.34% -4.43% 0.04% -4.27% -1.67% -0.35%
1-6-encode: 0.33% -16.71% -28.97% -16.17% -12.03% -13.01%
1-8-decode: 1.46% 0.47% 4.53% 0.70% 0.10% -1.19%
1-8-encode: 2.51% -24.35% -29.83% -16.95% -20.04% -13.16%
1-10-decode: 1.28% 0.06% 0.74% 0.61% -15.24% -1.59%
1-10-encode: 1.96% -21.04% -29.15% -19.61% -19.35% -13.27%
1-12-decode: 4.28% 0.73% 9.52% -1.30% -0.65% -2.59%
1-12-encode: 0.51% -24.82% -20.21% -16.86% -20.12% -13.48%
1-14-decode: 0.73% -7.61% -1.65% 0.97% -0.20% -1.24%
1-14-encode: 2.03% -15.94% -30.11% -17.61% -14.10% -14.56%
1-16-decode: -7.58% 6.96% 9.88% 1.70% 0.57% 0.43%
1-16-encode: -4.05% -18.91% -23.61% -15.47% -16.83% -12.79%
1-17-decode: -5.33% -6.53% 1.58% -0.49% 1.66% 0.11%
1-17-encode: -4.99% -26.57% -32.99% -13.31% -14.58% -12.30%
1-20-decode: 2.94% 2.07% -5.66% 2.95% -0.66% 1.04%
1-20-encode: 1.55% -24.20% -29.37% -17.25% -16.75% -11.43%
1-25-decode: 5.47% -3.53% -4.85% 4.53% -1.03% 0.48%
1-25-encode: 5.18% -23.33% -24.90% -17.31% -16.72% -12.03%
2-4-decode:   -23.35% -21.84% -18.16% -21.08% -18.23%
2-4-encode:   -23.27% -27.19% -13.52% -14.88% -11.59%
2-6-decode:   -3.92% -13.92% -23.40% -20.65% -20.51%
2-6-encode:   -3.37% -14.58% -17.65% -14.93% -10.73%
2-8-decode:   -23.26% -23.57% -20.71% -24.10% -22.48%
2-8-encode:   -23.26% -28.96% -16.80% -9.18% -12.58%
2-10-decode:   -19.31% -24.54% -24.63% -23.64% -9.60%
2-10-encode:   -20.02% -28.91% -18.53% -15.73% -8.63%
2-12-decode:   -20.94% -24.10% -18.49% -24.06% -24.10%
2-12-encode:   -20.96% -29.78% -18.20% -11.60% -12.06%
2-14-decode:   -19.51% -24.19% -24.89% -17.94% -21.38%
2-14-encode:   -19.77% -29.38% -17.04% -12.42% -12.78%
2-16-decode:   -25.66% -23.48% -23.19% -22.41% -18.56%
2-16-encode:   -26.76% -25.93% -13.24% -12.58% -12.26%
2-17-decode:   -22.81% -20.93% -25.15% -20.66% -19.56%
2-17-encode:   -21.56% -27.93% -22.59% -12.22% -11.99%
2-20-decode:   -23.59% -19.34% -14.81% -23.81% -23.80%
2-20-encode:   -25.10% -24.23% -17.25% -12.91% -12.52%
2-25-decode:   -23.43% -24.12% -22.56% -26.40% -22.42%
2-25-encode:   -23.55% -28.44% -19.20% -14.51% -12.35%
3-4-decode:     -28.62% -26.87% -27.24% -25.38%
3-4-encode:     -29.19% -13.88% -0.94% -12.56%
3-6-decode:     -29.51% -27.94% -12.74% -27.23%
3-6-encode:     -29.65% -16.57% -5.62% -12.28%
3-8-decode:     -29.00% -27.34% -27.75% -28.94%
3-8-encode:     -29.46% -5.58% -12.37% -14.33%
3-10-decode:     -29.55% -28.43% -19.26% -31.02%
3-10-encode:     -29.38% -17.35% -8.82% -14.48%
3-12-decode:     -30.01% -28.51% -33.46% -29.25%
3-12-encode:     -29.92% -16.99% -18.28% -15.99%
3-14-decode:     -30.77% -29.50% -30.23% -24.78%
3-14-encode:     -31.24% -17.99% -16.08% -9.96%
3-16-decode:     -21.09% -28.91% -31.30% -24.24%
3-16-encode:     -24.77% -22.74% -16.17% -11.41%
3-17-decode:     -22.21% -27.67% -27.81% -23.38%
3-17-encode:     -22.14% -18.73% -17.79% -12.52%
3-20-decode:     -22.76% -31.31% -22.93% -24.08%
3-20-encode:     -24.54% -19.24% -12.14% -10.56%
3-25-decode:     -21.45% -22.97% -23.75% -28.02%
3-25-encode:     -22.27% -14.83% -14.84% -12.91%
4-4-decode:       -12.40% -27.83% -13.18%
4-4-encode:       -13.39% -25.09% -11.51%
4-6-decode:       -16.76% -15.74% -15.79%
4-6-encode:       -16.87% -20.02% -12.56%
4-8-decode:       -15.93% -27.19% -6.36%
4-8-encode:       -16.51% -16.22% -8.84%
4-10-decode:       -6.84% -21.81% -7.74%
4-10-encode:       -6.75% -19.03% -7.16%
4-12-decode:       -23.25% -16.99% -9.83%
4-12-encode:       -24.10% -14.32% -12.39%
4-14-decode:       -18.66% -10.80% -18.08%
4-14-encode:       -17.14% -11.08% -11.83%
4-16-decode:       -18.40% -16.07% -20.90%
4-16-encode:       -17.41% -15.68% -14.24%
4-17-decode:       -22.17% -18.08% -15.79%
4-17-encode:       -22.46% -12.19% -12.00%
4-20-decode:       -17.82% -17.15% -16.53%
4-20-encode:       -18.04% -13.34% -13.62%
4-25-decode:       -16.04% -16.85% -22.44%
4-25-encode:       -16.40% -12.75% -10.25%
5-6-decode:         -14.70% -9.45%
5-6-encode:         -15.95% -9.19%
5-8-decode:         -13.64% -13.72%
5-8-encode:         -14.73% -12.24%
5-10-decode:         -12.75% -8.67%
5-10-encode:         -12.21% -6.03%
5-12-decode:         -13.83% -8.53%
5-12-encode:         -14.11% -8.12%
5-14-decode:         -16.70% -11.63%
5-14-encode:         -17.47% -8.64%
5-16-decode:         -18.65% -13.81%
5-16-encode:         -19.15% -11.52%
5-17-decode:         -13.62% -10.81%
5-17-encode:         -12.83% -9.45%
5-20-decode:         -15.79% -15.97%
5-20-encode:         -16.17% -12.99%
5-25-decode:         -14.65% -12.04%
5-25-encode:           -9.28%
6-6-decode:           -11.53%
6-6-encode:           -12.01%
6-8-decode:           -14.21%
6-8-encode:           -14.47%
6-10-decode:           -8.78%
6-10-encode:           -9.69%
6-12-decode:           -9.51%
6-12-encode:           -10.02%
6-14-decode:           -11.76%
6-14-encode:           -12.61%
6-16-decode:           -13.91%
6-16-encode:           -14.03%
6-17-decode:           -11.99%
6-17-encode:           -13.10%
6-20-decode:           -14.11%
6-20-encode:           -14.52%
6-25-decode:           -10.44%
6-25-encode:           -11.22%

@AWSjswinney
Copy link
Contributor Author

It's probably because I optimized this for the 4 parallel vector execution units in the Neoverse-V2 cores in AWS Graviton4. In a quick search I found that the Kunpeng 920 has 2 vector execution units so it doesn't benefit from the scheduling changes that I made.

It's possible the this could be reimplemented with intrinsics which would allow the compiler to do scheduling based on flags like -mcpu=neoverse-v2 or -march=armv8.2-a. This would be a pretty big change and I'm not confident we could get the same performance out of a solution like that, but it's a possible path of exploration if the regression on the Kugpeng 920 makes this a non-starter.

@AWSjswinney
Copy link
Contributor Author

I benchmarked on Graviton2, which is a Neoverse-N1 and also has 2 vector pipelines similar to the Kungpeng 920. Would you be open to dispatching a second version of the Neon functions that are optimized for wider vector throughput?

@liuqinfei
Copy link
Contributor

I benchmarked on Graviton2, which is a Neoverse-N1 and also has 2 vector pipelines similar to the Kungpeng 920. Would you be open to dispatching a second version of the Neon functions that are optimized for wider vector throughput?

I support this to fully leverage the computational power of processorses across varying concurrency levels. Additionally, the proposed solution should account for the maintainability of code versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants