Skip to content

Commit f915a3e

Browse files
committed
arm64: word-at-a-time: improve byte count calculations for LE
Do the same optimization as x86-64: do __ffs() on the intermediate value that found whether there is a zero byte, before we've actually computed the final byte mask. The logic is: has_zero(): Check if the word has a zero byte in it, which indicates the end of the loop, and prepare a value to be used for the rest of the sequence. The standard LE implementation just creates a word that has the high bit set in each byte of the word that was zero. Example: 0xaa00bbccdd00eeff -> 0x0080000000800000 prep_zero_mask(): Possibly do more prep to then clean up the initial fast result from has_zero, so that it can be combined with another zero mask with a simple logical "or" to create a final mask. This is only used on big-endian machines that use a different algorithm, and is a no-op here. create_zero_mask(): This is "step 1" of creating the count and the mask, and is meant for any common operations between the two. In the old implementation, this actually created the zero mask, that was then used for masking and for counting the number of bits in the mask. In the new implementation, this is a no-op. count_zero(): This takes the mask bits, and counts the number of bytes before the first zero byte. In the old implementation, it counted the number of bits in the final byte mask (which was the same as the C standard "find last set bit" that uses the silly "starts at one" counting) and shifted the value down by three. In the new implementation, we know the intermediate mask isn't zero, and it just does "find first set" with the sane semantics without any off-by-one issues, and again shifts by three (which also masks off the bit offset in the zero byte itself). Example: 0x0080000000800000 -> 2 zero_bytemask(): This takes the mask bits, and turns it into an actual byte mask of the bytes preceding the first zero byte. In the old implementation, this was a no-op, because the work had already been done by create_zero_mask(). In the new implementation, this does what create_zero_mask() used to do. Example: 0x0080000000800000 -> 0x000000000000ffff The difference between the old and the new implementation is that "count_zero()" ends up scheduling better because it is being done on a value that is available earlier (before the final mask). But more importantly, it can be implemented without the insane semantics of the standard bit finding helpers that have the off-by-one issue and have to special-case the zero mask situation. On arm64, the new "count_zero()" ends up just "rbit + clz" plus the shift right that then ends up being subsumed by the "add to final length". Signed-off-by: Linus Torvalds <[email protected]>
1 parent 4b8fa11 commit f915a3e

File tree

1 file changed

+3
-8
lines changed

1 file changed

+3
-8
lines changed

arch/arm64/include/asm/word-at-a-time.h

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,15 @@ static inline unsigned long has_zero(unsigned long a, unsigned long *bits,
2727
}
2828

2929
#define prep_zero_mask(a, bits, c) (bits)
30+
#define create_zero_mask(bits) (bits)
31+
#define find_zero(bits) (__ffs(bits) >> 3)
3032

31-
static inline unsigned long create_zero_mask(unsigned long bits)
33+
static inline unsigned long zero_bytemask(unsigned long bits)
3234
{
3335
bits = (bits - 1) & ~bits;
3436
return bits >> 7;
3537
}
3638

37-
static inline unsigned long find_zero(unsigned long mask)
38-
{
39-
return fls64(mask) >> 3;
40-
}
41-
42-
#define zero_bytemask(mask) (mask)
43-
4439
#else /* __AARCH64EB__ */
4540
#include <asm-generic/word-at-a-time.h>
4641
#endif

0 commit comments

Comments
 (0)