-
Notifications
You must be signed in to change notification settings - Fork 332
Optimize instruction scheduling in gf_*vect_dot_prod_neon #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Optimize instruction scheduling in gf_*vect_dot_prod_neon #349
Conversation
@AWSjswinney thanks for the PR. Can you sign off the commits, as per the contribution guidelines? Just add "Signed-off-by: ...." in both, thanks! |
Improve performance by: - Grouping table lookup (tbl) instructions to enhance instruction-level parallelism - Replacing individual loads with paired loads (ldp) for better memory access patterns - Removing unnecessary prefetch instructions - Reordering operations to reduce pipeline stalls and data dependencies This optimization improves decode performance by approximately 6.6%. Signed-off-by: Jonathan Swinney <[email protected]>
Implement advanced register allocation strategy that: - Allocates additional stack space for temporary register spilling - Uses shared temporary registers between adjacent sections (p4 for sections 1-2, p1 for sections 3-4, p2 for section 5) - Groups table lookup operations to improve instruction-level parallelism - Replaces individual loads with vector loads for better memory access patterns - Removes unnecessary prefetch instructions This optimization improves encode performance by approximately 9.4%. Signed-off-by: Jonathan Swinney <[email protected]>
2022b76
to
a4f7b40
Compare
Sorry! In my rush to get this submitted I forgot to add the signed-off flag in the commit. I've fixed it now. |
What about the other functions (1,2,3 vect)? Any reason why you are only targeting 4 and 5? |
Could you share the performance benchmark results? I recently conducted verification tests on the ARM-based Kunpeng 920 server and observed that the patch yields no measurable benefits—performance actually declined in the trials. ` [root@node1 isa-l]# cat /etc/os-release base commit 4e27f0b update commit a4f7b40 |
I have another patch coming for 1, 2, and 3. I just completed it today, but I still have some clean up before I post it. |
Hi @AWSjswinney. Will you send another patch soon? Thanks for your contributions! |
Implement instruction scheduling optimization using strategic register reuse: - Load data registers just-in-time before processing each section - Reuse other data registers as temporaries for table lookups - Group table lookup instructions together for better parallelism - Group eor instructions together to reduce pipeline stalls - Remove unnecessary prefetch instructions This approach achieves instruction-level parallelism benefits without stack spilling overhead by cleverly reusing data registers that are not currently being processed as temporary storage. Signed-off-by: Jonathan Swinney <[email protected]>
Implement comprehensive optimization using advanced register reuse and efficient memory access patterns: - Use ld1 4-register loads for maximum memory bandwidth utilization - Delay loading of data_4-7 until needed after processing data_0-3 - Reuse unloaded data registers as temporaries for table lookups - Group table lookup and eor instructions for better parallelism - Remove unnecessary prefetch instructions This approach achieves optimal instruction scheduling without stack spilling overhead by strategically timing data loads and reusing registers as temporaries when they are not needed. Signed-off-by: Jonathan Swinney <[email protected]>
Improve instruction-level parallelism through strategic instruction reordering: - Remove unnecessary prefetch instructions - Reorder dependent eor instruction pairs for better pipeline utilization - Group independent operations together to reduce pipeline stalls - Separate dependent instructions to allow parallel execution This optimization reduces pipeline stalls by allowing the CPU to execute more instructions in parallel, improving overall performance through better utilization of the instruction pipeline. Signed-off-by: Jonathan Swinney <[email protected]>
Apologies for the delay. I've added the patches for the other three functions. aarch64: Optimize instruction scheduling in gf_vect_dot_prod_neon Improve instruction-level parallelism through strategic instruction reordering:
This optimization reduces pipeline stalls by allowing the CPU to execute aarch64: Optimize instruction scheduling in gf_2vect_dot_prod_neon Implement comprehensive optimization using advanced register reuse and
This approach achieves optimal instruction scheduling without stack aarch64: Optimize instruction scheduling in gf_3vect_dot_prod_neon Implement instruction scheduling optimization using strategic register reuse:
This approach achieves instruction-level parallelism benefits without Benchmark dataI've attached benchmarks across the following parameters: |
I reorganized the plots to separate encode and decode. It's a little easier to read that way. |
I performed local validation of the patch but observed that most test outcomes on the Kunpeng 920 platform were deteriorated. A thorough review of the validation process and results is warranted to identify root causes. optimize/base - 1
|
It's probably because I optimized this for the 4 parallel vector execution units in the Neoverse-V2 cores in AWS Graviton4. In a quick search I found that the Kunpeng 920 has 2 vector execution units so it doesn't benefit from the scheduling changes that I made. It's possible the this could be reimplemented with intrinsics which would allow the compiler to do scheduling based on flags like |
I benchmarked on Graviton2, which is a Neoverse-N1 and also has 2 vector pipelines similar to the Kungpeng 920. Would you be open to dispatching a second version of the Neon functions that are optimized for wider vector throughput? |
I support this to fully leverage the computational power of processorses across varying concurrency levels. Additionally, the proposed solution should account for the maintainability of code versions. |
aarch64: Optimize instruction scheduling in gf_5vect_dot_prod_neon
Implement advanced register allocation strategy that:
This optimization improves encode performance by approximately 9.4%.
aarch64: Optimize instruction scheduling in gf_4vect_dot_prod_neon
Improve performance by:
This optimization improves decode performance by approximately 6.6%.