performance upgrade needed

Current implementation cycles through matrix elements sequentially, this brings poor performance. To avoid this:

- Change sequential calculation to matrix operations.

- Add MPI or threading support since most data could be processed independently.