prov/efa, common: Fix collisions in QKEY generator#11874
Open
alekswn wants to merge 3 commits intoofiwg:mainfrom
Open
prov/efa, common: Fix collisions in QKEY generator#11874alekswn wants to merge 3 commits intoofiwg:mainfrom
alekswn wants to merge 3 commits intoofiwg:mainfrom
Conversation
Replace time-based seeding with domain-scoped xorshift RNG state for generating RDM connection IDs (QKEY values). The RNG state is now: - Initialized once per domain using ofi_generate_seed() - Protected by domain lock to ensure thread-safe updates - Constrained to 31-bit positive values to avoid privileged range This eliminates repeated gettimeofday() calls and ensures unique connection IDs across the domain lifetime while maintaining the 0x7fffffff upper bound requirement. The implementation rejects values with the high bit set (>0x7fffffff) to stay within the non-privileged QKEY range. While this rejection sampling has ~50% overhead, it preserves the full 31-bit output space and avoids birthday paradox collisions that would occur with simple masking of 32-bit state to 31-bit output. Signed-off-by: Alexey Novikov <nalexey@amazon.com>
Add ofi_lfsr31_r(), a 31-bit Linear Feedback Shift Register using primitive polynomial x^31 + x^3 + 1. This generator provides: - Maximal period of 2^31-1 (2.1 billion unique values) - No duplicates within the full period - Guaranteed coverage of all values 1 to 0x7FFFFFFF - Output naturally constrained to 31-bit positive range Unlike xorshift with masking, LFSR31 ensures each value appears exactly once before the sequence repeats, avoiding birthday paradox collisions. The trinomial implementation (only 2 taps) provides efficient computation. Primitive polynomial reference: "Error Correction Coding" by Todd K. Moon (Wiley, 2005) https://web.eecs.utk.edu/~jplank/plank/papers/CS-07-593/primitive-polynomial-table.txt Signed-off-by: Alexey Novikov <nalexey@amazon.com>
Replace xorshift with rejection sampling in efa_generate_rdm_connid() with ofi_lfsr31_r(). This eliminates the ~50% rejection overhead from discarding xorshift values >0x7FFFFFFF. Benefits: - No rejection sampling needed (LFSR31 output is always ≤0x7FFFFFFF) - Guaranteed unique connection IDs within 2^31-1 period - Simpler code without do-while loop The LFSR31 state naturally stays within the non-privileged QKEY range [1, 0x7FFFFFFF], making it ideal for this use case. Signed-off-by: Alexey Novikov <nalexey@amazon.com>
Contributor
Author
|
bot:aws:retest |
3 similar comments
Contributor
Author
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
Contributor
Author
|
bot:aws:retest |
shijin-aws
approved these changes
Feb 9, 2026
Contributor
j-xiong
approved these changes
Feb 9, 2026
Contributor
Author
|
@j-xiong Could you restart Intel CI? |
sunkuamzn
reviewed
Feb 10, 2026
| size_t mtu_size; | ||
| size_t addrlen; | ||
| /* Random state to generate QKEY */ | ||
| uint32_t connid_random_state; |
Contributor
There was a problem hiding this comment.
I think the random state needs to be stored at the IBV device level because QPs in different PDs (Libfabric domains) can still get packets meant for each other if the QKEY matches
That's still local to each process
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR replaces time-based connection ID (QKEY) generation with a domain-scoped 31-bit LFSR random number generator in the EFA provider.
Changes
Before:
gettimeofday()on every connection ID generationAfter:
ofi_generate_seed()ofi_lfsr31_r()function in common using primitive polynomial x^31 + x^3 + 1Benefits
gettimeofday()overhead