Current implementation cycles through matrix elements sequentially, this brings poor performance. To avoid this: - Change sequential calculation to matrix operations. - Add MPI or threading support since most data could be processed independently.