LLVM generates suboptimal code for llvm.ctlz() on the int64 type across various x86-64 instruction sets (SSE4–AVX2) before AVX512. Performance measurements indicate that extracting individual 64-bit values from the ymm register and applying lzcnt separately to each yields a 25% improvement on AVX2 and a 124% improvement on SSE4, compared to llvm.ctlz vectorized implementation.
Please see the example here: https://ispc.godbolt.org/z/EEErrednx