Add stratum fastpath benchmarks and benchmark summaries

Distortions81 · Distortions81 · commit 0daaba849030 · 2026-02-23T00:41:41.000-07:00
diff --git a/documentation/TESTING.md b/documentation/TESTING.md
@@ -72,6 +72,129 @@ go test -race ./...
 ### Performance / Timing
 - **`submit_timing_test.go`** - Measures latency from `handleBlockShare` entry to `submitblock` invocation
 - Benchmark suites live alongside the code as `*_bench_test.go` files; run them with `go test -run '^$' -bench . -benchmem ./...`.
+- **`miner_decode_bench_test.go`** - Stratum decode microbenchmarks comparing full JSON unmarshal vs fast/manual sniffing for `ping`, `subscribe`, `authorize`, and `submit`.
+- **`stratum_fastpath_bench_test.go`** - Stratum encode microbenchmarks comparing normal vs fast-path response encoding (`true`, `pong`, subscribe response in CKPool and expanded modes).
+
+### Stratum Fast-Path Benchmarks
+
+Use these commands to compare normal vs fast decode/encode paths without running unit tests:
+
+```bash
+# Decode comparison (full JSON unmarshal vs fast/manual sniff path)
+go test -run '^$' -bench 'BenchmarkStratumDecode(FastJSON|Manual)' -benchmem .
+
+# Encode comparison (normal vs fast response encoding)
+go test -run '^$' -bench 'BenchmarkStratumEncode' -benchmem .
+
+# Run both together
+go test -run '^$' -bench 'BenchmarkStratum(Decode(FastJSON|Manual)|Encode)' -benchmem .
+```
+
+For more stable comparisons across changes/machines, run multiple samples and (optionally) compare with `benchstat`:
+
+```bash
+# Baseline / candidate example
+go test -run '^$' -bench 'BenchmarkStratum(Decode(FastJSON|Manual)|Encode)' -benchmem -count=5 . > before.txt
+go test -run '^$' -bench 'BenchmarkStratum(Decode(FastJSON|Manual)|Encode)' -benchmem -count=5 . > after.txt
+
+# Optional (if benchstat is installed)
+benchstat before.txt after.txt
+```
+
+### Stratum Fast-Path Benchmark Snapshot (example)
+
+Example local run command:
+
+```bash
+go test -run '^$' -bench 'BenchmarkStratum(Decode(FastJSON|Manual)|Encode)' -benchmem -benchtime=100ms .
+```
+
+Environment for the sample numbers below:
+
+- `goos`: `linux`
+- `goarch`: `amd64`
+- `cpu`: `AMD Ryzen 9 7950X 16-Core Processor`
+- `pkg`: `goPool`
+
+Key results (microbenchmarks):
+
+- **Decode (`mining.submit`)**
+  - Full decode (`fastJSONUnmarshal`): `366.6 ns/op`, `461 B/op`, `11 allocs/op`
+  - Fast/manual sniff path: `107.3 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **3.4x faster** with the fast path in this benchmark
+- **Decode (`mining.ping`)**
+  - Full decode: `129.8 ns/op`, `106 B/op`, `3 allocs/op`
+  - Fast/manual sniff path: `39.22 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **3.3x faster**
+- **Encode (`true` response)**
+  - Normal encode: `157.6 ns/op`, `204 B/op`, `4 allocs/op`
+  - Fast encode: `48.34 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **3.3x faster**
+- **Encode (`pong` response)**
+  - Normal encode: `168.9 ns/op`, `205 B/op`, `4 allocs/op`
+  - Fast encode: `45.13 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **3.7x faster**
+- **Encode (`mining.subscribe`, CKPool mode)**
+  - Normal encode: `346.7 ns/op`, `501 B/op`, `11 allocs/op`
+  - Fast encode: `62.73 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **5.5x faster**
+- **Encode (`mining.subscribe`, expanded mode)**
+  - Normal encode: `630.7 ns/op`, `1063 B/op`, `17 allocs/op`
+  - Fast encode: `105.9 ns/op`, `0 B/op`, `0 allocs/op`
+  - Roughly **6.0x faster**
+
+Notes:
+
+- These are **microbenchmarks** of parsing/encoding paths, not full end-to-end pool throughput benchmarks.
+- Re-run on your target hardware and compare with `benchstat` before using the numbers for capacity planning.
+
+### Hex Fast-Path Benchmarks
+
+Hex encode/decode microbenchmarks live in `job_utils_hex_bench_test.go` and compare LUT-based helpers vs stdlib (`encoding/hex`) and alternate implementations.
+
+Example focused command (decode + encode + uint32 hex parse):
+
+```bash
+go test -run '^$' -bench 'Benchmark(DecodeHexToFixedBytesBytes_(32_(PoolPairLUT|Std)|4_(PoolPairLUT|Std))|ParseUint32BEHexBytes_(LUT|Switch)|Encode(BytesToFixedHex_32_Std|32ToHex64Lower_(Unrolled|2ByteLUTLoop|LUTLoop)|ToString_32_(Std|StdStackBuf|Unrolled)))' -benchmem -benchtime=100ms .
+```
+
+Environment for the sample numbers below:
+
+- `goos`: `linux`
+- `goarch`: `amd64`
+- `cpu`: `AMD Ryzen 9 7950X 16-Core Processor`
+- `pkg`: `goPool`
+
+Key results (microbenchmarks):
+
+- **Decode 32-byte hex into fixed bytes**
+  - stdlib `hex.Decode`: `20.64 ns/op`, `0 allocs/op`
+  - goPool pair-LUT helper (`decodeHexToFixedBytesBytes`): `16.37 ns/op`, `0 allocs/op`
+  - Roughly **1.26x faster** in this benchmark
+- **Decode 4-byte hex into fixed bytes**
+  - stdlib `hex.Decode`: `3.450 ns/op`, `0 allocs/op`
+  - goPool pair-LUT helper (`decodeHexToFixedBytesBytes`): `3.360 ns/op`, `0 allocs/op`
+  - Essentially **similar** performance in this benchmark
+- **Parse 8-char uint32 hex (`parseUint32BEHexBytes`)**
+  - LUT parser: `2.018 ns/op` (lower), `2.000 ns/op` (upper), `0 allocs/op`
+  - switch parser: `4.042 ns/op` (lower), `4.489 ns/op` (upper), `0 allocs/op`
+  - LUT path is roughly **2x faster**
+- **Encode 32 bytes -> 64 hex bytes (byte buffer output)**
+  - stdlib `hex.Encode`: `17.97 ns/op`, `0 allocs/op`
+  - LUT loop: `15.03 ns/op`, `0 allocs/op`
+  - 2-byte LUT loop: `18.73 ns/op`, `0 allocs/op`
+  - Unrolled LUT encode: `8.139 ns/op`, `0 allocs/op`
+  - Unrolled path is roughly **2.2x faster** than stdlib in this benchmark
+- **Encode 32 bytes -> hex string**
+  - `hex.EncodeToString`: `55.35 ns/op`, `128 B/op`, `2 allocs/op`
+  - stdlib with stack buffer + `string(out[:])`: `33.65 ns/op`, `64 B/op`, `1 alloc/op`
+  - unrolled encode + `string(out[:])`: `20.63 ns/op`, `64 B/op`, `1 alloc/op`
+  - Fast path significantly reduces CPU time and cuts one allocation
+
+Notes:
+
+- These are **microbenchmarks** of helper functions (not end-to-end share processing).
+- For change comparisons, use `-count` and `benchstat` as shown in the Stratum benchmark section above.
 
 ## CPU Profiling with Simulated Miners
 
diff --git a/stratum_fastpath_bench_test.go b/stratum_fastpath_bench_test.go
@@ -0,0 +1,93 @@
+package main
+
+import (
+	"net"
+	"testing"
+	"time"
+)
+
+type benchDiscardConn struct{}
+
+func (benchDiscardConn) Read([]byte) (int, error)         { return 0, nil }
+func (benchDiscardConn) Write(b []byte) (int, error)      { return len(b), nil }
+func (benchDiscardConn) Close() error                     { return nil }
+func (benchDiscardConn) LocalAddr() net.Addr              { return &net.IPAddr{} }
+func (benchDiscardConn) RemoteAddr() net.Addr             { return &net.IPAddr{} }
+func (benchDiscardConn) SetDeadline(time.Time) error      { return nil }
+func (benchDiscardConn) SetReadDeadline(time.Time) error  { return nil }
+func (benchDiscardConn) SetWriteDeadline(time.Time) error { return nil }
+
+func benchmarkEncodeMinerConn(fastEncode bool, ckpool bool) *MinerConn {
+	return &MinerConn{
+		id:   "bench-encode",
+		conn: benchDiscardConn{},
+		cfg: Config{
+			StratumFastEncodeEnabled: fastEncode,
+			CKPoolEmulate:            ckpool,
+		},
+	}
+}
+
+func BenchmarkStratumEncodeTrueResponse_Normal(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(false, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeTrueResponse(1)
+	}
+}
+
+func BenchmarkStratumEncodeTrueResponse_Fast(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(true, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeTrueResponse(1)
+	}
+}
+
+func BenchmarkStratumEncodePongResponse_Normal(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(false, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writePongResponse(7)
+	}
+}
+
+func BenchmarkStratumEncodePongResponse_Fast(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(true, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writePongResponse(7)
+	}
+}
+
+func BenchmarkStratumEncodeSubscribeResponse_CKPool_Normal(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(false, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeSubscribeResponse(2, "01020304", 4, "sid")
+	}
+}
+
+func BenchmarkStratumEncodeSubscribeResponse_CKPool_Fast(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(true, true)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeSubscribeResponse(2, "01020304", 4, "sid")
+	}
+}
+
+func BenchmarkStratumEncodeSubscribeResponse_Expanded_Normal(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(false, false)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeSubscribeResponse(2, "01020304", 4, "sid")
+	}
+}
+
+func BenchmarkStratumEncodeSubscribeResponse_Expanded_Fast(b *testing.B) {
+	mc := benchmarkEncodeMinerConn(true, false)
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		mc.writeSubscribeResponse(2, "01020304", 4, "sid")
+	}
+}