For example, when sampling an array A of size (4, 4, 4), with count of size (4, 4) and prob of size (4,), use an index i to select among the CartesianIndices((4,4)) and an index j to select among CartesianIndices((4,)), and iterate over the remaining indices (in this case, just the third axis of A) within the kernel. This avoids recomputing the constants that are needed for the BTRS algorithm.
This should probably only be merged if it is consistently faster, or else maybe be a user option avoid_recomp.