triton-kernels i write kernels when bored and publish them here. some are efficient, some are not (as native torch utilizes inline PTX in CUDA environments)