The FD shmem thread-blocks use too few threads per block for low vertical resolution, which can result in very bad performance. See https://github.com/CliMA/ClimaCore.jl/pull/2296#issuecomment-2815729495.