Skip to content

Another cmov_int16 proposal#78

Open
aguinetsb wants to merge 1 commit intopq-crystals:mainfrom
aguinetsb:cmov_proposal
Open

Another cmov_int16 proposal#78
aguinetsb wants to merge 1 commit intopq-crystals:mainfrom
aguinetsb:cmov_proposal

Conversation

@aguinetsb
Copy link
Copy Markdown

@aguinetsb aguinetsb commented Jun 8, 2024

This proposal does not conditionally load a value, which could be less future proof. Indeed, having only one "virtual branch" loading a value is what could cause a compiler to create a branch, as it can avoid a load from memory given a condition - which is usually considered as a profitable optimization.

It's not only a speculation as this is what seemed to have originally happened with clang 18. Let's indeed look at clang's optimization pipeline using compiler explorer's very nice viewer on the original poly_frommsg function (https://godbolt.org/z/eYrM3scbe):

  • before the X86 cmov conversion pass, we actually have "virtual" cmov instructions:
    image

  • after the X86 cmov conversion pass, we have the introduction of branches:
    image

Let's look at what this pass is doing: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L10-L31. I think we hit all the conditions in that case:

  1. we have cmovs in a loop
  2. it is considered as profitable to do the transformation, because I think it manages to save a mov to a register (see the comments in checkForProfitableCmovCandidates: https://github.com/llvm/llvm-project/blob/768118d1ad38bf13c545828f67bd6b474d61fc55/llvm/lib/Target/X86/X86CmovConversion.cpp#L421-L443 that give more details). It would be interesting to debug that analysis to really understand what is considered more profitable here, as the generated assembly doesn't look very efficient to me:

image

What we can note though is that if it figures out one of the "branch" of the cmov has a slower latency than the other, it will generate a branch. This is why I think the original patch is asking for troubles, and that we should avoid exposing such patterns.

As to when this has been introduced in Clang, for what I've checked clang 15 is already doing that optimization, so it's been "a while". Checking LLVM's git log could help answer that question more precisely if that's of interest.

This patch also defines the cmov_int16 function as noinline. Indeed, Link-Time Optimization (LTO) could inline that function: the compiler could then realize the select argument is just one bit, which could then generate a select from the LLVM IR point of view, and we could then have the x86 cmov conversion pass be tempted to generate a branch again.

As an idea for the future, we could verify codegen in a test by verifying that the code generated by the compiler does not contain any branch. This could be done by using LLVM's FileCheck utility - which is basically an enhanced grep that can reason with context. The test would basically check that in the general textual assembly, we don't have any test or branch instructions (depending on the targeted architecture). We could also check that on the LLVM IR, but:

  1. that would only work for clang-based compilers
  2. LLVM's codegen is the one generating a branch here (as stated above), so we wouldn't catch the issue.

This test could be run only if that tool is available in the environment, so that we don't make the build system more complex for everyone. I can make a first tentative if that's something of interest to the maintainers!

This proposal does not conditionally load a value, which could be less
future proof.

Indeed, having only one "branch" load a value is what could cause a
compiler to create a branch, as it can avoid a load from memory given a
condition - which is usually considered as an effective optimization.
danalec added a commit to danalec/kyber that referenced this pull request Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant