This is very cool! I was wondering if you could add a brief description of how the algorithm works to the readme mainpage. There are so many alternatives in the source that are macro-enabled/disabled, it's really hard to tell what's going on. Specifically, I'd really like to hear how you implement the 4x 8b (4-channel RGBA) histogram. Privatized counters in shared memory? Shared memory atomics? Hand-rolled exclusive read-modify-write in shared memory using naming collisions? Some other weird hashing variant?
Thanks!
Duane