Skip to content

Encoded Histogram format

Lee Campbell edited this page Sep 14, 2016 · 5 revisions

The Encoded Histogram format is a binary format of a histogram header and its count values. The format is used as part of the V2-Log-Encoding to allow histograms to be persisted or shared.

Format

[Cookie][LengthOfCompressedContents][CompressedHeader]

At a high level, the format is just the cookie to identify the type of histogram, the length of the compressed contents and then the compressed contents.

Cookie

The cookie is a 32bit integer used to identify the characteristics of the subsequent data. The cookie is used to identify the version of the encoding format.

The current format uses a constant value of 478450452 (or 0x1c849314 in hex) for the cookie to represent a v2 encoding.

As a side note cookie values have been chosen so that when they are base64Encoded the output string will start with "HIST"

LengthOfCompressedContents

This 32 bit integer value simply defines how long the [CompressedHeader] content is.

CompressedHeader format

DEFALTE(
	[Cookie][payloadLengthInBytes][normalizingIndexOffset][numberOfSignificantValueDigits][lowestTrackableUnitValue][highestTrackableValue][integerToDoubleValueConversionRatio][CompressedContents]		
)

The Zlib compressed payload of an HdrHistogram. The uncompressed payload is stored in binary in the following format

01234567890123456789012345678901
[A ][B ][C ][D ][E     ][F     ][G....>
  • A. Cookie value. 32bit integer (4bytes) that stores a code for the type of Histogram.
  • B. Payload length as a 32bit integer (4bytes).
  • C. NomalizingIndex offset as an 32bit integer (4bytes). Currently not implemented in HdrHistogram.NET
  • D. Lowest Trackable Unit value as a 64bit integer (8bytes).
  • E. Highest Trackable Unit value as a 64bit integer (8bytes).
  • F. Integer to double conversion ratio as a double floating point number (8bytes)
  • G. Compressed contents

CompressedContents

ZigZaggedEncoding(
	[count]*
)

A ZigZag encoded (LEB128-64b9B-variant format) compression of the counts for an HdrHistogram. The LEB128-64b9B-variant format is

  • Little Endian Base128 Encoding
  • 64bit values store as a maximum of 9Bytes. The standard LEB128 can use 10Bytes.
  • Only supports values up to 2^63 (not 2^64) i.e. HdrHistogram only support positive signed long values for counts.
  • Zig Zagged so that common/small positive and negative values use the least amount of space. As counts are only allowed to be 0 or a positive 64bit number, we can use the negative range of numbers to represent consecutive zero values. As it can be very common for there to be numerous consecutive buckets with counts of zero, this can provide significant space saving.

Consecutive 0 counts will be compressed into a single negative number indicating the span of the zero count sequence. Positive counts will be represented as the actual number. Counts are recorded as 64bit integers.

LEB128's variable length encoding provides for using a smaller number of bytes for smaller values, and the use of ZigZag encoding allows small(closer to zero) negative values to use fewer bytes.

The LEB128-64b9B-variant encoding used here diverges from the "original" LEB128 as it extends to 64 bit values. In the original LEB128, a 64 bit value can take up to 10 bytes in the stream, where this variant's encoding of a 64 bit values will max out at 9 bytes. As such, this encoder/decoder should NOT be used for encoding or decoding "standard" LEB128 formats (e.g.Google Protocol Buffers).

See also

  • ZLIB RCF-1950 compression - [https://en.wikipedia.org/wiki/Zlib], [https://www.ietf.org/rfc/rfc1950.txt]
  • Base64 Encoding - [https://en.wikipedia.org/wiki/Base64], [https://www.base64encode.org/]
  • ZigZag LEB128 compression - [https://en.wikipedia.org/wiki/LEB128], [https://gist.github.com/mfuerstenau/ba870a29e16536fdbaba]

Clone this wiki locally