Skip to content

Conversation

@Dayuxiaoshui
Copy link
Contributor

Overview

This PR implements RVV (RISC-V Vector Extension) optimized version of omatcopy function for OpenBLAS on RISC-V64 architecture, significantly improving the performance of matrix copy operations.

Main Contributions

1. RVV Optimization Implementation

  • Created omatcopy_ct_rvv.c in kernel/riscv64/ directory
  • Leveraged RISC-V Vector Extension instruction set for optimized matrix copy operations
  • Supports efficient vectorized processing for single-precision floating-point matrices
  • Implemented adaptive vector length handling to fully utilize hardware vector units

2. Performance Testing and Validation

  • Conducted comprehensive performance benchmarks on sg2044 platform
  • Compared performance between scalar and RVV vectorized versions
  • Tested various matrix sizes to verify consistent optimization effectiveness

Performance Improvements

Test Environment

  • Platform: sg2044
  • Test Matrix: 3000×4000
  • Iterations: 100

Performance Comparison

Version Total Time(s) Average Time(s) GFLOPS Performance Gain
Scalar 55.22 0.552 0.043 -
RVV 38.42 0.384 0.062 +30.4%

Key Improvements

  • Execution Time Reduction: From 0.552s to 0.384s, 30.4% improvement
  • GFLOPS Enhancement: From 0.043 to 0.062, 44.2% increase
  • Scalability: More significant advantages with larger matrix operations

Technical Features

  • Fully utilizes RISC-V Vector Extension parallel processing capabilities
  • Optimized memory access patterns to reduce cache misses
  • Maintains complete compatibility with existing APIs
  • Clean code structure for easy maintenance and extension

Test Coverage

  • Functional correctness verification
  • Performance testing across multiple matrix sizes
  • Boundary condition handling validation
  • Result consistency checks with scalar version

Co-authored-by: gong-flying [email protected]

Dayuxiaoshui and others added 8 commits September 11, 2025 13:29
- Created comprehensive performance test for omatcopy_ct function
- Added RVV vectorized implementation with conditional compilation
- Included build script for SG2044 server testing
- Support both scalar and RVV optimized versions
- Added throughput measurement and detailed performance metrics

Co-authored-by: gong-flying <[email protected]>
- Fixed matrix initialization to use row-major order
- Fixed memory allocation size calculation for output matrix
- Optimized RVV simulation with 4-way loop unrolling
- All test cases now pass correctness verification
- Performance improvements: up to 1.26x speedup for alpha=1.0 case

Co-authored-by: gong-flying <[email protected]>
…ation

- Implement block-based memory access optimization (64x64 blocks)
- Add 4-way loop unrolling to reduce loop overhead
- Optimize VSETVL calls to improve vectorization efficiency
- Add software prefetching for better memory access patterns
- Implement fast path for small matrices (<64x64)
- Add cross-compilation script for RISC-V testing
- Improve boundary handling with separate main/tail loops

Co-authored-by: gong-flying <[email protected]>
Removed temporary test files and benchmarks created for omatcopy RVV optimization validation:
- Removed benchmark scripts and result files
- Removed test executables (scalar vs RVV comparison)
- Removed temporary test source files

Previous testing on sg2044 showed RVV implementation achieved ~30% performance improvement
over scalar version for 3000x4000 matrix operations (0.384s vs 0.552s per iteration).

Co-authored-by: gong-flying <[email protected]>
@martin-frbg martin-frbg added this to the 0.3.31 milestone Sep 15, 2025
@martin-frbg
Copy link
Collaborator

Thank you - the speedup looks very impressive. I do wonder a bit if it is also apparent at very small matrix sizes, or if loading the vector registers becomes too expensive compared to the actual operation ?

@martin-frbg martin-frbg merged commit 79a1f38 into OpenMathLib:develop Sep 17, 2025
84 of 88 checks passed
@Dayuxiaoshui
Copy link
Contributor Author

Thanks for your interest! You're right to think about small matrix sizes. In our tests, for tiny matrices (like super small ones, way smaller than the vector length), the overhead of setting up vector operations can make the scalar version seem better or the RVV speedup vanish. But as the matrix size grows past a certain “threshold” (where the actual computation work outweighs the vector setup cost), RVV starts to shine—just like in the bigger tests (like 3000×4000, where we saw a ~30% speedup). So it's all about that balance between vector setup cost and the amount of work you're doing with vectors.

@martin-frbg
Copy link
Collaborator

If you know from your testing where the threshold for "super small ones" lies, perhaps it would make sense to have a size check in the code that makes it use the scalar code ? I notice that you threw out even the existing small_matrix_transpose() fast path for 8x8 and under, where I would not expect RVV to help at all.
This is probably irrelevant if you're training some big LLM or similar, but if there's anything my time with this project has told me, is that it is used in all kinds of applications and some may experience a massive slowdown from changes that are performant at big workloads only.

@Dayuxiaoshui
Copy link
Contributor Author

Your suggestion is very valuable and is exactly the direction we plan to optimize next. In fact, after completing the initial RVV optimization, we have already noticed the problem of the performance balance point in small matrix scenarios.
For extremely small matrices of 8x8 and below, we should indeed retain the original small_matrix_transpose() fast path. The temporary removal of this part of the logic in the current implementation is mainly to more clearly compare the pure benefits of RVV in vector processing in benchmark tests.
Regarding the size checking mechanism you mentioned, we plan to introduce it in the next version: by dynamically judging the proportional relationship between the matrix dimension and the hardware vector length, the scalar/RVV execution path is automatically switched. The specific threshold will be calibrated based on the hardware characteristics of different RVV implementations (such as vector register length, loading latency, etc.), and preliminary tests show that around 64x64 is a more appropriate switching point.
This can not only ensure that large matrices (such as large-scale operations common in LLM training) achieve a performance improvement of more than 30%, but also avoid performance regression in small matrix scenarios. Thank you very much for pointing out this key issue in practical applications, which will help our optimization solution be more universal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants