Skip to content
This repository was archived by the owner on Mar 16, 2023. It is now read-only.

Research notes, SIMD implementation, faster multiplication and GF(2^m) #1

@HarryR

Description

@HarryR

I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)

However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf

This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt libff with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.

From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the G(p) field over G(2^n) (I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)

You would assume that an optimising compiler could produce something similar to the following for fq_impl_int128:

Karatsuba Multiplication 128-bit x 128-bit = 256-bit

// c1c0 = a * b
void GF2m_mul_2x2_xmm(__m128i *c1, __m128i *c0, __m128i b,
__m128i a)
{
 __m128i t1, t2;
 *c0 = _mm_clmulepi64_si128(a, b, 0x00);
 *c1 = _mm_clmulepi64_si128(a, b, 0x11);
 t1 = _mm_shuffle_epi32(a, 0xEE);
 t1 = _mm_xor_si128(a, t1);
 t2 = _mm_shuffle_epi32(b, 0xEE);
 t2 = _mm_xor_si128(b, t2);
 t1 = _mm_clmulepi64_si128(t1, t2, 0x00);
 t1 = _mm_xor_si128(*c0, t1);
 t1 = _mm_xor_si128(*c1, t1);
 t2 = t1;
 t1 = _mm_slli_si128(t1, 8);
 t2 = _mm_srli_si128(t2, 8);
 *c0 = _mm_xor_si128(*c0, t1);
 *c1 = _mm_xor_si128(*c1, t2);
}

Karatsuba Multiplication 256-bit x 256-bit = 512-bit

// c3c2c1c0 = a1a0 * b1b0
{
 a1 = _mm_set_epi64x(a->d[3], a->d[2]);
 a0 = _mm_set_epi64x(a->d[1], a->d[0]);
 b1 = _mm_set_epi64x(b->d[3], b->d[2]);
 b0 = _mm_set_epi64x(b->d[1], b->d[0]);
 GF2m_mul_2x2_xmm(&c1, &c0, a0, b0);
 GF2m_mul_2x2_xmm(&c3, &c2, a1, b1);
 a0 = _mm_xor_si128(a0, a1);
 b0 = _mm_xor_si128(b0, b1);
 GF2m_mul_2x2_xmm(&c5, &c4, a0, b0);
 c4 = _mm_xor_si128(c4, c0);
 c4 = _mm_xor_si128(c4, c2);
 c5 = _mm_xor_si128(c5, c1);
 c5 = _mm_xor_si128(c5, c3);
 c1 = _mm_xor_si128(c1, c4);
 c2 = _mm_xor_si128(c2, c5);
}

Quick Squaring (256-bit)^2 = 512-bit

// c3c2c1c0 = a1a0 * a1a0
{
 c0 = _mm_clmulepi64_si128(a0, a0, 0x00);
 c1 = _mm_clmulepi64_si128(a0, a0, 0x11);
 c2 = _mm_clmulepi64_si128(a1, a1, 0x00);
 c3 = _mm_clmulepi64_si128(a1, a1, 0x11);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions