You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend by interleaving the CRC32_u64 calls at a lower level.
`CRC32_u64` generates `CRC32` x86 instruction which has 3 cycle latency. Because of that, the `crc` variable below causes a loop carried dependency of 3 cycles per iteration.
```
for (int i = 0; i < 8; i++) {
crc = CRC32_u64(static_cast<uint32_t>(crc), absl::little_endian::Load64(p));
p += 8;
}
```
Total latency for a 64-byte block is 29 cycles (codegen: https://godbolt.org/z/zxsrGMEPs, llvm-mca: https://godbolt.org/z/xrTMhhd1E).
So, it is more efficient to interleave (up to 3 calls because of the 3 cycle latency) the `CRC32_u64` calls at a lower level. Even if we interleave 3 streams, the total latency for (three) 64-byte blocks is 33 cycles (codegen: https://godbolt.org/z/5ojzPdj3h, llvm-mca: https://godbolt.org/z/5cEPxvddW). And this is without considering any inlining.
PiperOrigin-RevId: 799757460
Change-Id: I80118d5c1736ae31d69e5624c94cc0a6513ef28f
0 commit comments