Skip to content

Commit 88b11b4

Browse files
sailorfragraggi
authored andcommitted
tun: AMD64 optimized checksum
This adds AMD64 assembly implementations of IP checksum computation, one for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2). All performance numbers reported are from a Ryzen 7 4750U but similar improvements are expected for a wide range of processors. The generic IP checksum implementation has also been further improved to be significantly faster using bits.AddUint64 (for a 64KiB buffer the throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are also reported on ARM64 but I do not have specific numbers). The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s and the AVX2 implementation is slightly over 107,000MiB/s. Unfortunately, for very small sizes (e.g. the expected size for an IPv4 header) setting up SIMD computation involves some overhead that makes computing a checksum for small buffers slower than a non-SIMD implementation. Even more unfortunately, testing for this at runtimen in Go and calling a func optimized for small buffers mitigates most of the improvement due to call overhead. The break even point is around 256 byte buffers; IPv4 headers are no more than 60 bytes including extensions. IPv6 headers do not have a checksum but are a fixed size of 40 bytes. As a result, the generated assembly code uses an alternate approach for buffers of less than 256 bytes. Additionally, buffers of less than 32 bytes need to be handled specially because the strategy for reading buffers that are not a multiple of 8 bytes fails when the buffer is too small. As suggested by additional benchmarking, pseudo header computation has been rewritten to be faster (benchmark time reduced by 1/2 to 1/4). Updates tailscale/corp#9755 Signed-off-by: Adrian Dewhurst <[email protected]>
1 parent ec6f23b commit 88b11b4

10 files changed

+2266
-115
lines changed

tun/checksum.go

Lines changed: 678 additions & 86 deletions
Large diffs are not rendered by default.

tun/checksum_amd64.go

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
package tun
2+
3+
import "golang.org/x/sys/cpu"
4+
5+
// checksum computes an IP checksum starting with the provided initial value.
6+
// The length of data should be at least 128 bytes for best performance. Smaller
7+
// buffers will still compute a correct result. For best performance with
8+
// smaller buffers, use shortChecksum().
9+
var checksum = checksumAMD64
10+
11+
func init() {
12+
if cpu.X86.HasAVX && cpu.X86.HasAVX2 && cpu.X86.HasBMI2 {
13+
checksum = checksumAVX2
14+
return
15+
}
16+
if cpu.X86.HasSSE2 {
17+
checksum = checksumSSE2
18+
return
19+
}
20+
}

tun/checksum_amd64_test.go

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
//go:build amd64
2+
3+
package tun
4+
5+
import (
6+
"golang.org/x/sys/cpu"
7+
)
8+
9+
var archChecksumFuncs = []archChecksumDetails{
10+
{
11+
name: "generic32",
12+
available: true,
13+
f: checksumGeneric32,
14+
},
15+
{
16+
name: "generic64",
17+
available: true,
18+
f: checksumGeneric64,
19+
},
20+
{
21+
name: "generic32Alternate",
22+
available: true,
23+
f: checksumGeneric32Alternate,
24+
},
25+
{
26+
name: "generic64Alternate",
27+
available: true,
28+
f: checksumGeneric64Alternate,
29+
},
30+
{
31+
name: "AMD64",
32+
available: true,
33+
f: checksumAMD64,
34+
},
35+
{
36+
name: "SSE2",
37+
available: cpu.X86.HasSSE2,
38+
f: checksumSSE2,
39+
},
40+
{
41+
name: "AVX2",
42+
available: cpu.X86.HasAVX && cpu.X86.HasAVX2 && cpu.X86.HasBMI2,
43+
f: checksumAVX2,
44+
},
45+
}

tun/checksum_generated_amd64.go

Lines changed: 18 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)