This applies to C++ generated code and core library functions (specifically coded_stream.h).
The current Varint encoding/decoding implementation relies on tight loops that can lead to high branch misprediction rates and suboptimal instruction pipelining, especially on modern AArch64 servers. In high-throughput scenarios, Varint processing becomes a bottleneck.
I have implemented a specialized version of UnsafeVarint that uses explicit length checks and helper functions designed for compiler loop unrolling.
- Loop Unrolling: By pre-calculating the required bytes and using a fixed-length loop in
EncodeBytes, we allow the compiler to unroll the logic, reducing branch misses significantly.
- Performance Gains: Initial benchmarks on ARM64 servers show an average 30%+ improvement in encoding performance for a single core.
Proposed Code Snippet:
// Within google/protobuf/io/coded_stream.h
PROTOBUF_ALWAYS_INLINE static uint8_t* EncodeBytes(uint32_t value, uint8_t* ptr, size_t num) {
for (size_t i = 0; i < num; ++i) {
ptr[i] = (uint8_t)((value >> (7 * i)) & 0x7F) | (1U << 7);
}
ptr[num] = (uint8_t)(value >> (num * 7));
ptr += (num + 1);
return ptr;
}
PROTOBUF_ALWAYS_INLINE static uint8_t* UnsafeVarint(uint32_t value, uint8_t* ptr) {
if (value < (1U << 7)) {
*ptr = value & 0x7F;
++ptr;
} else if (value < (1U << 14)) {
ptr = EncodeBytes(value, ptr, 1);
} else if (value < (1U << 21)) {
ptr = EncodeBytes(value, ptr, 2);
} else if (value < (1U << 28)) {
ptr = EncodeBytes(value, ptr, 3);
} else {
ptr = EncodeBytes(value, ptr, 4);
}
return ptr;
}
// Similar logic for uint64...
Describe alternatives you've considered
I am also exploring ARM64 SVE2 (Scalable Vector Extension 2) instructions. Specifically, using BEXT (Bit Extract) and BDEP (Bit Deposit) equivalents in SVE2 could potentially:
- Increase encoding performance to 2.5x of the current implementation.
- Improve decoding performance by approximately 65%.
Question for Maintainers:
Does the Protobuf architecture currently support or welcome architecture-specific intrinsics (like SVE2) for such core operations? If so, I would like to submit a PR including both the general loop-unrolling optimization and the AArch64-specific SIMD enhancements.
Additional context
The benchmarks were conducted on a standard ARM64 cloud instance. I can provide detailed google/benchmark results if required.
This applies to C++ generated code and core library functions (specifically
coded_stream.h).The current Varint encoding/decoding implementation relies on tight loops that can lead to high branch misprediction rates and suboptimal instruction pipelining, especially on modern AArch64 servers. In high-throughput scenarios, Varint processing becomes a bottleneck.
I have implemented a specialized version of
UnsafeVarintthat uses explicit length checks and helper functions designed for compiler loop unrolling.EncodeBytes, we allow the compiler to unroll the logic, reducing branch misses significantly.Proposed Code Snippet:
Describe alternatives you've considered
I am also exploring ARM64 SVE2 (Scalable Vector Extension 2) instructions. Specifically, using
BEXT(Bit Extract) andBDEP(Bit Deposit) equivalents in SVE2 could potentially:Question for Maintainers:
Does the Protobuf architecture currently support or welcome architecture-specific intrinsics (like SVE2) for such core operations? If so, I would like to submit a PR including both the general loop-unrolling optimization and the AArch64-specific SIMD enhancements.
Additional context
The benchmarks were conducted on a standard ARM64 cloud instance. I can provide detailed
google/benchmarkresults if required.