Continuation of #132818 . In #137545, an implementation was introduced that calls hardware sqrt() to compute rsqrtf16 - which is turned out to be the fastest option ( approximations were slower by ~30% ). But, there is still some work left - for targets that don't have fixed-precision floating points, an int-based approximation is still needed.
I would like to work on this, please assign me to it. Thank you!