Earlier we submitted a related pull request: #859
Here we used ArrayFire to do a preliminary GPU implementation of the RMSD optimalAlignment, and posted the relevant performance tests.
Due to the extra time spent on data transfer between the host and the device, the speedup achieved is limited. And we discussed the issue with several developers.
Proposed by @carlocamilloni, open this issue to discuss how to move GPU implementation forward.
Any thoughts on the GPU implementation or this pull request can be discussed here.
Thanks!