-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add RVV vectorized optimization for omatcopy function on RISC-V64 architecture #5445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Created comprehensive performance test for omatcopy_ct function - Added RVV vectorized implementation with conditional compilation - Included build script for SG2044 server testing - Support both scalar and RVV optimized versions - Added throughput measurement and detailed performance metrics Co-authored-by: gong-flying <[email protected]>
Co-authored-by: gong-flying <[email protected]>
- Fixed matrix initialization to use row-major order - Fixed memory allocation size calculation for output matrix - Optimized RVV simulation with 4-way loop unrolling - All test cases now pass correctness verification - Performance improvements: up to 1.26x speedup for alpha=1.0 case Co-authored-by: gong-flying <[email protected]>
…ation - Implement block-based memory access optimization (64x64 blocks) - Add 4-way loop unrolling to reduce loop overhead - Optimize VSETVL calls to improve vectorization efficiency - Add software prefetching for better memory access patterns - Implement fast path for small matrices (<64x64) - Add cross-compilation script for RISC-V testing - Improve boundary handling with separate main/tail loops Co-authored-by: gong-flying <[email protected]>
…-authored-by: gong-flying <[email protected]>
Removed temporary test files and benchmarks created for omatcopy RVV optimization validation: - Removed benchmark scripts and result files - Removed test executables (scalar vs RVV comparison) - Removed temporary test source files Previous testing on sg2044 showed RVV implementation achieved ~30% performance improvement over scalar version for 3000x4000 matrix operations (0.384s vs 0.552s per iteration). Co-authored-by: gong-flying <[email protected]>
|
Thank you - the speedup looks very impressive. I do wonder a bit if it is also apparent at very small matrix sizes, or if loading the vector registers becomes too expensive compared to the actual operation ? |
|
Thanks for your interest! You're right to think about small matrix sizes. In our tests, for tiny matrices (like super small ones, way smaller than the vector length), the overhead of setting up vector operations can make the scalar version seem better or the RVV speedup vanish. But as the matrix size grows past a certain “threshold” (where the actual computation work outweighs the vector setup cost), RVV starts to shine—just like in the bigger tests (like 3000×4000, where we saw a ~30% speedup). So it's all about that balance between vector setup cost and the amount of work you're doing with vectors. |
|
If you know from your testing where the threshold for "super small ones" lies, perhaps it would make sense to have a size check in the code that makes it use the scalar code ? I notice that you threw out even the existing small_matrix_transpose() fast path for 8x8 and under, where I would not expect RVV to help at all. |
|
Your suggestion is very valuable and is exactly the direction we plan to optimize next. In fact, after completing the initial RVV optimization, we have already noticed the problem of the performance balance point in small matrix scenarios. |
Overview
This PR implements RVV (RISC-V Vector Extension) optimized version of omatcopy function for OpenBLAS on RISC-V64 architecture, significantly improving the performance of matrix copy operations.
Main Contributions
1. RVV Optimization Implementation
omatcopy_ct_rvv.cinkernel/riscv64/directory2. Performance Testing and Validation
Performance Improvements
Test Environment
Performance Comparison
Key Improvements
Technical Features
Test Coverage
Co-authored-by: gong-flying [email protected]