The only need for the samples matrix is to build the B_j sub-blocks in the hessian. Should the samples matrix be stored on GPU, or could we fetch data from CPU by warp when necessary? This would require writing a custom kernel to do most efficiently, but would help with the GPU memory bottleneck we are seeing right now.