Skip to content

Performance: Optimize Varint Encoding/Decoding with Loop Unrolling (and potentially SVE2) #26931

@liuyang-664

Description

@liuyang-664

This applies to C++ generated code and core library functions (specifically coded_stream.h).

The current Varint encoding/decoding implementation relies on tight loops that can lead to high branch misprediction rates and suboptimal instruction pipelining, especially on modern AArch64 servers. In high-throughput scenarios, Varint processing becomes a bottleneck.

I have implemented a specialized version of UnsafeVarint that uses explicit length checks and helper functions designed for compiler loop unrolling.

  1. Loop Unrolling: By pre-calculating the required bytes and using a fixed-length loop in EncodeBytes, we allow the compiler to unroll the logic, reducing branch misses significantly.
  2. Performance Gains: Initial benchmarks on ARM64 servers show an average 30%+ improvement in encoding performance for a single core.

Proposed Code Snippet:

// Within google/protobuf/io/coded_stream.h

PROTOBUF_ALWAYS_INLINE static uint8_t* EncodeBytes(uint32_t value, uint8_t* ptr, size_t num) {
  for (size_t i = 0; i < num; ++i) {
    ptr[i] = (uint8_t)((value >> (7 * i)) & 0x7F) | (1U << 7);
  }
  ptr[num] = (uint8_t)(value >> (num * 7));
  ptr += (num + 1);
  return ptr;
}

PROTOBUF_ALWAYS_INLINE static uint8_t* UnsafeVarint(uint32_t value, uint8_t* ptr) {
  if (value < (1U << 7)) {
    *ptr = value & 0x7F;
    ++ptr;
  } else if (value < (1U << 14)) {
    ptr = EncodeBytes(value, ptr, 1);
  } else if (value < (1U << 21)) {
    ptr = EncodeBytes(value, ptr, 2);
  } else if (value < (1U << 28)) {
    ptr = EncodeBytes(value, ptr, 3);
  } else {
    ptr = EncodeBytes(value, ptr, 4);
  }
  return ptr;
}
// Similar logic for uint64...

Describe alternatives you've considered
I am also exploring ARM64 SVE2 (Scalable Vector Extension 2) instructions. Specifically, using BEXT (Bit Extract) and BDEP (Bit Deposit) equivalents in SVE2 could potentially:

  • Increase encoding performance to 2.5x of the current implementation.
  • Improve decoding performance by approximately 65%.

Question for Maintainers:
Does the Protobuf architecture currently support or welcome architecture-specific intrinsics (like SVE2) for such core operations? If so, I would like to submit a PR including both the general loop-unrolling optimization and the AArch64-specific SIMD enhancements.

Additional context
The benchmarks were conducted on a standard ARM64 cloud instance. I can provide detailed google/benchmark results if required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestuntriagedauto added to all issues by default when created.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions