Skip to content

Commit 926172c

Browse files
torvaldsgitster
authored andcommitted
block-sha1: improve code on large-register-set machines
For x86 performance (especially in 32-bit mode) I added that hack to write the SHA1 internal temporary hash using a volatile pointer, in order to get gcc to not try to cache the array contents. Because gcc will do all the wrong things, and then spill things in insane random ways. But on architectures like PPC, where you have 32 registers, it's actually perfectly reasonable to put the whole temporary array[] into the register set, and gcc can do so. So make the 'volatile unsigned int *' cast be dependent on a SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just x86 and x86-64. With that, the routine is fairly reasonable even when compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on a G5: * Paulus asm version: about 3.67s * Yours with no change: about 5.74s * Yours without "volatile": about 3.78s so with this the C version is within about 3% of the asm one. And add a lot of commentary on what the heck is going on. Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 66c9c6c commit 926172c

File tree

1 file changed

+24
-1
lines changed

1 file changed

+24
-1
lines changed

block-sha1/sha1.c

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
8282
#define SHA_ASM(op, x, n) ({ unsigned int __res; __asm__(op " %1,%0":"=r" (__res):"i" (n), "0" (x)); __res; })
8383
#define SHA_ROL(x,n) SHA_ASM("rol", x, n)
8484
#define SHA_ROR(x,n) SHA_ASM("ror", x, n)
85+
#define SMALL_REGISTER_SET
8586

8687
#else
8788

@@ -93,7 +94,29 @@ void blk_SHA1_Final(unsigned char hashout[20], blk_SHA_CTX *ctx)
9394

9495
/* This "rolls" over the 512-bit array */
9596
#define W(x) (array[(x)&15])
96-
#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
97+
98+
/*
99+
* If you have 32 registers or more, the compiler can (and should)
100+
* try to change the array[] accesses into registers. However, on
101+
* machines with less than ~25 registers, that won't really work,
102+
* and at least gcc will make an unholy mess of it.
103+
*
104+
* So to avoid that mess which just slows things down, we force
105+
* the stores to memory to actually happen (we might be better off
106+
* with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
107+
* suggested by Artur Skawina - that will also make gcc unable to
108+
* try to do the silly "optimize away loads" part because it won't
109+
* see what the value will be).
110+
*
111+
* Ben Herrenschmidt reports that on PPC, the C version comes close
112+
* to the optimized asm with this (ie on PPC you don't want that
113+
* 'volatile', since there are lots of registers).
114+
*/
115+
#ifdef SMALL_REGISTER_SET
116+
#define setW(x, val) (*(volatile unsigned int *)&W(x) = (val))
117+
#else
118+
#define setW(x, val) (W(x) = (val))
119+
#endif
97120

98121
/*
99122
* Where do we get the source from? The first 16 iterations get it from

0 commit comments

Comments
 (0)