Skip to content

Commit dc52fd2

Browse files
Nicolas Pitregitster
authored andcommitted
block-sha1: split the different "hacks" to be individually selected
This is to make it easier for them to be selected individually depending on the architecture instead of the other way around i.e. having each architecture select a list of hacks up front. That makes for clearer documentation as well. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 30ba0de commit dc52fd2

File tree

1 file changed

+18
-5
lines changed

1 file changed

+18
-5
lines changed

block-sha1/sha1.c

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,16 @@
1111

1212
#if defined(__i386__) || defined(__x86_64__)
1313

14+
/*
15+
* Force usage of rol or ror by selecting the one with the smaller constant.
16+
* It _can_ generate slightly smaller code (a constant of 1 is special), but
17+
* perhaps more importantly it's possibly faster on any uarch that does a
18+
* rotate with a loop.
19+
*/
20+
1421
#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
1522
#define SHA_ROL(x,n) SHA_ASM("rol", x, n)
1623
#define SHA_ROR(x,n) SHA_ASM("ror", x, n)
17-
#define SMALL_REGISTER_SET
1824

1925
#else
2026

@@ -24,9 +30,6 @@
2430

2531
#endif
2632

27-
/* This "rolls" over the 512-bit array */
28-
#define W(x) (array[(x)&15])
29-
3033
/*
3134
* If you have 32 registers or more, the compiler can (and should)
3235
* try to change the array[] accesses into registers. However, on
@@ -43,13 +46,23 @@
4346
* Ben Herrenschmidt reports that on PPC, the C version comes close
4447
* to the optimized asm with this (ie on PPC you don't want that
4548
* 'volatile', since there are lots of registers).
49+
*
50+
* On ARM we get the best code generation by forcing a full memory barrier
51+
* between each SHA_ROUND, otherwise gcc happily get wild with spilling and
52+
* the stack frame size simply explode and performance goes down the drain.
4653
*/
47-
#ifdef SMALL_REGISTER_SET
54+
55+
#if defined(__i386__) || defined(__x86_64__)
4856
#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
57+
#elif defined(__arm__)
58+
#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
4959
#else
5060
#define setW(x, val) (W(x) = (val))
5161
#endif
5262

63+
/* This "rolls" over the 512-bit array */
64+
#define W(x) (array[(x)&15])
65+
5366
/*
5467
* Where do we get the source from? The first 16 iterations get it from
5568
* the input data, the next mix it from the 512-bit array.

0 commit comments

Comments
 (0)