Skip to content

Comments

prov/efa, common: Fix collisions in QKEY generator#11874

Open
alekswn wants to merge 3 commits intoofiwg:mainfrom
alekswn:fix-efa_generate_rdm_connid
Open

prov/efa, common: Fix collisions in QKEY generator#11874
alekswn wants to merge 3 commits intoofiwg:mainfrom
alekswn:fix-efa_generate_rdm_connid

Conversation

@alekswn
Copy link
Contributor

@alekswn alekswn commented Feb 7, 2026

This PR replaces time-based connection ID (QKEY) generation with a domain-scoped 31-bit LFSR random number generator in the EFA provider.

Changes

Before:

  • Called gettimeofday() on every connection ID generation
  • Used time value as seed for one-off xorshift calculation
  • No state persistence between calls

After:

  • Domain-scoped LFSR31 state initialized once with ofi_generate_seed()
  • New ofi_lfsr31_r() function in common using primitive polynomial x^31 + x^3 + 1
  • Thread-safe state updates protected by domain lock

Benefits

  • No syscalls - Eliminates gettimeofday() overhead
  • Guaranteed uniqueness - Period of 2^31-1 with no duplicates
  • Thread-safe - Domain lock protects RNG state

Replace time-based seeding with domain-scoped xorshift RNG state for
generating RDM connection IDs (QKEY values). The RNG state is now:
- Initialized once per domain using ofi_generate_seed()
- Protected by domain lock to ensure thread-safe updates
- Constrained to 31-bit positive values to avoid privileged range

This eliminates repeated gettimeofday() calls and ensures unique
connection IDs across the domain lifetime while maintaining the
0x7fffffff upper bound requirement.

The implementation rejects values with the high bit set (>0x7fffffff)
to stay within the non-privileged QKEY range. While this rejection
sampling has ~50% overhead, it preserves the full 31-bit output space
and avoids birthday paradox collisions that would occur with simple
masking of 32-bit state to 31-bit output.

Signed-off-by: Alexey Novikov <nalexey@amazon.com>
Add ofi_lfsr31_r(), a 31-bit Linear Feedback Shift Register using
primitive polynomial x^31 + x^3 + 1. This generator provides:
- Maximal period of 2^31-1 (2.1 billion unique values)
- No duplicates within the full period
- Guaranteed coverage of all values 1 to 0x7FFFFFFF
- Output naturally constrained to 31-bit positive range

Unlike xorshift with masking, LFSR31 ensures each value appears
exactly once before the sequence repeats, avoiding birthday paradox
collisions. The trinomial implementation (only 2 taps) provides
efficient computation.

Primitive polynomial reference:
"Error Correction Coding" by Todd K. Moon (Wiley, 2005)
https://web.eecs.utk.edu/~jplank/plank/papers/CS-07-593/primitive-polynomial-table.txt

Signed-off-by: Alexey Novikov <nalexey@amazon.com>
Replace xorshift with rejection sampling in efa_generate_rdm_connid()
with ofi_lfsr31_r(). This eliminates the ~50% rejection overhead from
discarding xorshift values >0x7FFFFFFF.

Benefits:
- No rejection sampling needed (LFSR31 output is always ≤0x7FFFFFFF)
- Guaranteed unique connection IDs within 2^31-1 period
- Simpler code without do-while loop

The LFSR31 state naturally stays within the non-privileged QKEY range
[1, 0x7FFFFFFF], making it ideal for this use case.

Signed-off-by: Alexey Novikov <nalexey@amazon.com>
@alekswn alekswn changed the title prov/efa: Fix efa generate rdm connid prov/efa: Fix collisions in QKEY generator Feb 7, 2026
@alekswn
Copy link
Contributor Author

alekswn commented Feb 9, 2026

bot:aws:retest

3 similar comments
@alekswn
Copy link
Contributor Author

alekswn commented Feb 9, 2026

bot:aws:retest

@alekswn
Copy link
Contributor Author

alekswn commented Feb 9, 2026

bot:aws:retest

@alekswn
Copy link
Contributor Author

alekswn commented Feb 9, 2026

bot:aws:retest

@shijin-aws shijin-aws requested a review from j-xiong February 9, 2026 21:40
@shijin-aws shijin-aws changed the title prov/efa: Fix collisions in QKEY generator prov/efa, common: Fix collisions in QKEY generator Feb 9, 2026
@shijin-aws
Copy link
Contributor

need a review from @j-xiong on 82ca8ed

@alekswn
Copy link
Contributor Author

alekswn commented Feb 9, 2026

@j-xiong Could you restart Intel CI?

size_t mtu_size;
size_t addrlen;
/* Random state to generate QKEY */
uint32_t connid_random_state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the random state needs to be stored at the IBV device level because QPs in different PDs (Libfabric domains) can still get packets meant for each other if the QKEY matches

That's still local to each process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants