It seems this Java implementation needs an auxiliary array of the length of input. Is it really necessary? If I'm not mistaken, the paper actually says it shall sort in place.
Could you shed some light on that?
Btw. do you have any performance comparison to e.g. tuned quick sort written in Java? Could you post some preliminary benchmarks in the Readme of this repo?