I think the CSharp script is not well optimized. I have rewrite the code according to C++ implementation in this repository
https://github.com/rabbitism/SkeletonTracingCSharp
From my rough testing it's about 40% faster than currently c# implementation. But in your README you said you excluded thinning process, so may I know what is the actual code you run for the 1000 replication?