Sequential Tabular Encoding Format (STEF)

Tigran Najaryan
December 2024

Status: [Development]

Table of Contents

Glossary
Conventions
Record Schema
STEF Data Format
Data Frame
- Columnar Representation
Codecs
Value Formats
Dictionaries
- Resetting Dictionaries
STEF/gRPC Protocol

Glossary

STEF Data Stream - a sequence of bytes that is compliant with this specification.
Writer - an implementation of STEF protocol that can create a STEF Stream.
Reader - an implementation of STEF protocol that can read and decode a STEF Stream.

Conventions

The data format described in this specification uses a custom notation. Unless otherwise specified all data is tightly packed. Fields may occupy a number of bits that is not an integer multiply of 8 and may cross the boundaries of bytes. There is no implied padding after such fields. When any padding between fields is used it is explicitly called out.

Fields are MSB-ordered, i.e. most-significant bits of the byte are used first. For example a 3-bit field A, followed by a 2-bit field B looks like this in a byte:

+-----------+-------+-----------+
|A2  A1  A0 |B1  B0 |  unused   |
+-----------+-------+-----------+
  7   6   5   4   3   2   1   0

Bit fields spanning more than one byte continue using the same MSB ordering, i.e. adding a 5-bit field C to the above example results in this:

            Byte 0                                Byte 1

+-----------+-------+-----------+    +-------+-----------------------+
|A2  A1  A0 |B1  B0 |C4  C3  C2 |    |C1  C0 |        unused         |
+-----------+-------+-----------+    +-------+-----------------------+
  7   6   5   4   3   2   1   0        7   6   5   4   3   2   1   0

Formats of data structures are described in the following notation:

<Name of the structure> {
    <Field Name 1>: <Field Size or Type>
    <Field Name 2>: <Field Size or Type>
    ... more fields
}

There is one or more fields in the structure. A field is named and has a size or type notation after the colon.

Here is the list of field size and type notations:

Field: N - Field is exactly N bits long.

Field: N..M - Field is between N and M bits long. The actual size is specified in the text that follows. If M is omitted then the upper bound is unspecified.

Field: U64 - Field holds an unsigned integer value using the Uvarint64 variable-length encoding.

Field: S64 - Field holds an signed integer value using the Varint64 variable-length encoding.

Field: UC - Field holds an unsigned integer value using the UvarintCompact variable-length encoding.

Field: SC - Field holds an signed integer value using the VarintCompact variable-length encoding.

Field: <ST> = C - Field has a constant value of C. Size/type can be any one of the notations above.

/Field: <ST>/ - Field is optional.

*Field: <ST> - Field is repeated zero or more times and each instance has a size/type .

Record Schema

STEF data stream logically represents a sequence of records. A record is a collection of fields. The structure of a record is called the "schema" of the particular STEF data stream, or simply STEF schema.

The schema of the STEF data stream is defined statically and all records in the data stream comply with the same schema. STEF schema describes fields in a record, field types, possible values for the fields. Supported field types are:

Primitive: either a bool, int64, uint64, float64, string or bytes.
Array: an array of elements of any type. The array may contain zero or more elements.
Oneof: can contain only one out of specified values at a time or the sentinel None value.
Multimap: an associative array of key-value pairs. The array can have zero or more pair. Keys and values can be of arbitrary type.
Struct: a collection of fields of any type. Fields may be marked "optional" in the struct, in which case their "presence" is tracked explicitly. The struct that represents the structure of the STEF data stream record is called the "root" struct.

Here is an example schema for a basic record that holds a single measurement of a metric:

struct Measurement root {
    MetricName string
    Attributes Attributes
    Timestamp uint64
    Value PointValue
}

oneof PointValue {
    Int64 int64
    Float64 float64
}

multimap Attributes {
    key string
    value string
}

This schema defines a root struct Measurement that represents the STEF stream records. Each record will contain MetricName string, zero or more Attributes (each a key-value pair), a Timestamp and a PointValue, which can be either an int64 or a float64 value.

A STEF stream of this schema may for example contain the following records:

MetricName	Attributes	Timestamp	Value
"cpu.usage"	Key="cpu" Value="1"	1783726193	Float64=0.4
"cpu.usage"	Key="cpu" Value="2"	1783726193	Float64=0.1
"memory.usage"	Key="memory" Value="virtual"	1783726194	Int64=100000
"system.healthy"		1783726194	Int64=1
"system.healthy"		1783726195	Int64=0

Recursive Type Definitions

Non-primitive types in STEF schema may recursively refer to each other or to themselves. For example let's look at a definition of ArrayishValue:

oneof ArrayishValue {
    Int64 int64
    Array []ArrayishValue
}

An ArrayishValue like this can for example represent values:

15
[1,2]
[]
[1,2,[32,-1,[14]],[]].

Loops in data defined via recursive data types are prohibited. STEF record is strictly a tree of fields.

Dictionary Encoding

Primitive values of string and bytes types and structs can be optionally dictionary-encoded. When dictionary encoding is enabled for a particular field, STEF encoders will add previously seen values of the field into a dictionary and when a repeat value is seen instead of encoding it again, a reference to an existing dictionary entry will be encoded instead. Here is an example with dictionary encoding of metric names:

struct Measurement root {
    MetricName string dict(Names)
    Attributes Attributes
    Timestamp uint64
    Value PointValue
}

Here the MetricName field is dictionary-encoded and the name of the dictionary is Names. Fields of the same type may share dictionaries, e.g:

struct Person root {
    First string dict(Names)
    Last string dict(Names)
}

In this example Names is a single dictionary that contains previously seen value both from field First and field Last.

A field of struct type can be also dictionary encoded, by declaring the struct itself dictionary-encoded. For example the Country struct is dictionary-encoded below:

struct Address root {
    Street string
    City string
    State string
    Country Country
}

struct Country dict(Countries) {
    Name string
    ISOCode string
}

A dictionary-encoded struct that is used as a field in multiple places in the schema will use one dictionary that is shared for all structs of that type.

Schema Tree

STEF schema definition forms a tree of fields. The root of the tree is the root struct, each field a children of the root and then recursively each field of non-primitive type containing children that represent the non-primitive type. We can represent the above Measurement example by the following tree:

root Measurement
 |- MetricName string
 |- Attributes Attributes
 |   |- key string
 |   |- value string
 |- Timestamp uint64
 |- Value PointValue
     |- Int64 int64
     |- Float64 float64

The tree is constructed by traversal of schema definition in the following order:

We start at the root struct and assign the root node to the root struct.
If the node type is struct or oneof we visit each field in the order of declaration and assign each field a child node, one child per field.
If the node type is array we add one child to the node. The child represents array's element ype.
If the node type is multimap we add two children to the node. The first child represents the key, the second child represents the value.
The process is repeated for all non-primitive node types or until a declaration loop is detected due to types declarations in the schema referring to each other or to themselves recursively. If a loop is detected the branch of the tree is considered constructed and the algorithm backtracks.

Leaf nodes of STEF schema tree are always primitive types or are non-primitive types where a type recursion is detected.

Let's illustrate a more complete example that includes recursive type definitions:

struct Measurement root {
    MetricName string
    Attributes Attributes
    Timestamp uint64
    Value PointValue
}

oneof PointValue {
    Int64 int64
    Float64 float64
}

multimap Attributes {
    key string
    value AnyValue
}

oneof AnyValue {
    String string
    Array []AnyValue
    KVList KVList
}

multimap KVList {
    key string
    value AnyValue
}

In the above schema AnyValue is a self-referential recursive type and the types AnyValue and KVList mutually refer to each other. The corresponding schema tree looks like this:

root Measurement
 |- MetricName string
 |- Attributes Attributes
 |   |- key string
 |   |- value AnyValue   
 |       |- String string
 |       |- Array []   
 |           |- AnyValue          <--- loop detected here, backtrack. Non-primitive leaf.
 |       |- KVList KVList
 |           |- key string 
 |           |- value AnyValue    <--- loop detected here, backtrack. Non-primitive leaf.
 |- Timestamp uint64
 |- Value PointValue
     |- Int64 int64
     |- Float64 float64

Each record in STEF data stream logically is a collection of values associated with leaf nodes in this tree.

It is important to note 2 things.

Firstly, for each individual record not every leaf node in a tree always has an associated value. This is because a oneof may only contain one value at a time (and thus only one of its children will have a value associated) and optional fields may have no value at all. Furthermore, entire subtrees may have no values associated with them if they are contained in a non-present optional field or are in a oneof that stores a different choice.

Secondly, because the schema allows recursive types a record may contain more than one value associated with the same node in the schema tree. Consider the following AnyValue:

AnyValue = { KVList = { key = "abc", value = { AnyValue = { String = "xyx" } } } }

Represented as a tree this AnyValue can be laid out as:

AnyValue
 |- KVList
     |- key = "abc"   
     |- Value
         |- AnyValue
             |- String = "xyz"

Note how there are 2 AnyValue elements, although the schema will only have one AnyValue node.

If we associate these elements with the schema tree, we will have essentially 2 AnyValue elements associated with the AnyValue node in the schema tree.

STEF Data Format

STEF data is represented as a byte stream and has the following format:

STEF {
    FixedHeader: ..
    VarHeader Frame: ..
    *Data Frame: ..
}

It starts with a FixedHeader, followed by VarHeader frame, followed by zero or more Data Frames.

Header

FixedHeader {
    Signature: 32 = 4 ASCII bytes: "S", "T", "E", "F"
    Version: 4
    Compression: 2
    Random: 2
}

Version specifies the major version number of STEF stream format:

0 - Current version number defined by this specification.
1-15 - Reserved. The readers compliant with this specification MUST refuse to decode STEF stream if these values are specified.

Compression is the compression method for VarHeader and Data frame content:

0 - no compression.
1 - Zstd compression.
2 and 3 - Reserved. The readers compliant with this specification MUST refuse to decode STEF stream if these values are specified.

Random - a random value beween 0 and 3.

Frame

Frame structure is used to represent either a VarHeader or a Data Frame.

Frame {
    RestartDictionaries: 1
    RestartCompression: 1
    RestartCodecs: 1
    Random: 5
    UncompressedSize: U64
    /CompressedSize: U64/
    Content: ..
}

RestartDictionaries indicates that all dictionaries are cleared and restarted when this frame starts. Can be used to limit the size of the dictionaries that the readers must keep in memory. Has no effect on VarHeader and SHOULD be used only for Data Frames.

RestartCompression indicates that the compression stream is started anew from this frame's Content. Reader's state of compression decoder MUST be reset when this bit is set. If this bit is unset the state of the compression encoder carries over through the Content field of frames. This bit has effect only if Flags field in the Header specifies a compression.

RestartCodecs indicates that the state of encoders/decoders is cleared and started anew from this frame's Content.

The UncompressedSize size field specifies the total size in bytes of the frame Content in uncompressed form. If no compression is used the UncompressedSize field is also the same as the size of the Content field in bytes.

The CompressedSize field is optional and is only present if Compression field in the Header specifies a compression. The CompressedSize field specifies the size of the Content field in bytes.

VarHeader Frame

VarHeader is a Frame with a Content field that contains the Schema of STEF data and a list of user-specified key/value pairs.

VarHeader {
    WireSchemaSize: U64
    WireSchema: 0..
    UserDataCount: U64
    *UserDataKey: LenPrefixedString
    *UserDataValue: LenPrefixedString
}

LenPrefixedString {
    BytesSize: U64
    Bytes: 0..
}

WireSchemaSize is the size of WireSchema field in bytes. WireSchemaSize may be 0, which indicates that the WireSchema field is empty and no schema is recorded for this STEF data stream.

UserDataCount is the number of user-specified key/value pairs in the VarHeader frame, i.e. the number of repetitions of UserDataKey and UserDataValue fields. Arbitrary information may be recorded in these key/value pairs.

WireSchema is the STEF schema of the data stream in the following format:

WireSchema {
    StructCounts: U64
    *StructFieldCount: U64 
}

StructCounts field is the number of repetitions of StructFieldCount fields. StructFieldCount is the number of fields in a struct.

WireSchema caries only the parts of the schema, which are necessary to be communicated between readers and writers that work with evolving versions of the same schema.

StructFieldCount values are a linearized list of the number of fields in each struct/oneof in the schema. The order of the structs in this list is the same as the order in which the structs are encountered by depth-first traversal of the Schema Tree, starting with the root struct for this particular STEF stream. The distinction is important with multi-root schemas: the WireSchema may look differently for different root structs defined in the same schema. Only the first encounter of a struct/oneof in the schema tree is recorded when linearizing the list of field counts, subsequent encounters of the same struct/oneof are not recorded in the WireSchema.

Let's illustrate using the example schema with recursive types in the Schema Tree section, with Measurement root struct. The struct/oneof field counts tree will look like this:

Schema part                      Struct Field Count
===================================================
root Measurement                        4
 |- MetricName string                   -
 |- Attributes Attributes               -
 |   |- key string                      -
 |   |- value AnyValue                  3
 |       |- String string               -
 |       |- Array []                    -
 |           |- AnyValue                3 <-- repeat encounter of AnyValue, skipped.
 |       |- KVList KVList               -
 |           |- key string              -
 |           |- value AnyValue          3
 |- Timestamp uint64                    -
 |- Value PointValue                    2
     |- Int64 int64                     -
     |- Float64 float64                 -

(a dash - is shown next to fields that are not structs or oneofs).

The linearized list of field counts for this schema will be:

4, 3, 2

This linearized list of field counts will be recorded in WireSchema.

Data Frame

Data Frame has the Content of the following structure:

Content {
    RecordCount: U64
    SizeOfSizes: U64
    *Size: UC
    *ColumnData: 0..
}

RecordCount is the number of record in this data frame.

SizeOfSizes is the size of the following Sizes field in bytes.

Size contains a sequence of UvarintCompact values, one value per column, each column representing the number of bytes of data in that column. 0 is valid value for Size and indicates that the column has no data. The Size field is repeated as many time as there are columns in the schema.

Note if a column has no data, all its subcolumns also have no data. In this case the Size field of value 0 is recorded only for the parent column and is omitted for all child columns.

ColumnData represents the data for one column. The ColumnData field is repeated as many time as there are columns in the schema.

Each column contains a sequence of data elements. The elements may of different sizes within the same column. Columns may contain a different number of elements. The next section describes the column data.

Columnar Representation

Above we described the logical data model, how the data STEF data is presented to readers and writers of records. Now we are going to look into how this data is encoded in bits and bytes in STEF data stream.

STEF format uses columnar representation. Each node in STEF schema tree is represented by one column, including nodes of primitive and non-primitive types. The number of columns in STEF data stream is fixed and is defined by its schema. Each column is assigned a number that corresponds to the depth-first traversal order of the constructed schema tree.

Note that some fields in a schema may contain no values at all for any of the records. In this case the column will be empty. Nevertheless the column is still numbered.

The encoded data for each of these columns is contained in one ColumnData field of a data frame. The ColumnData contains encoded data of the particular column for all records that the frame represents.

Here is an example column number assignment and codec types for the sample schema we were looking into earlier:

Tree                         Column     Codec Type
root Measurement                1        struct
 |- MetricName string           2        string
 |- Attributes Attributes       3        multimap
 |   |- key string              4        string
 |   |- value AnyValue          5        oneof    
 |       |- String string       6        string   
 |       |- Array []            7        array 
 |           |- AnyValue        5        oneof (note column 5 because recursion is detected)
 |       |- KVList KVList       8        multimap                       
 |           |- key string      9        string                         
 |           |- value AnyValue  5        oneof (note column 5 because recursion is detected)    
 |- Timestamp uint64            10       uint64
 |- Value PointValue            11       oneof
     |- Int64 int64             12       int64
     |- Float64 float64         13       float64

Now, let's see how the encoding of records into ColumnData fields work.

The input to the encoding process is a single record, laid out as a tree of values. The encoder also has the constructed schema tree and ColumnData that correspond to the tree nodes. The encoding algorithm is the following:

Start at the root of the record tree (remember, this is a struct).
Visit the record tree in depth-first order and perform encoding operation of the visited node into the column of the associated schema node, using the appropriate codec for the node type.

Codecs

This section describes codecs for various data types that can be used in the schema.

Some codecs encode data in the form of delta (or difference) from a previous instance of that same field. Note that choice of words: we say the "previous instance" instead of saying "the value in the previous record". This distinction is important.

Although, in the simplest cases the previous instance is indeed in the previous record, in more complex cases it may not be the case. This is for example possible if the previous record did not contain a value for this field at all (this is possible if the field is optional). This is also possible if the schema declares an array of structs and encoder is encoding the elements of that array. In that case "the previous instance" refers to the previous array element (provided that the element has a value for that field!).

Struct Codec

Struct codec has 2 encoding modes: full and dictionary.

Full Struct Encoding

Struct {
    /FullEncoding: 1 = 1/
    ModifiedMask: N
    /PresenceMask: M/
}

If the schema defines dictionary encoding for this struct then FullEncoding is written with a value of 1. This indicates that the following struct is fully-encoded as opposed to being a reference to an existing dictionary entry.

If the schema does not define dictionary encoding for this struct then FullEncoding is not present.

ModifiedMask is a field composed of N bits, where N equals the number of fields in the struct, with bits laid out in the order matching the declaration order of the fields. If the field's value is different in this instance of the struct compared to that same field's value in the previous record then the corresponding bit will be set to 1 otherwise the bit will be set to 0.

PresenceMask field is optional and is only present if the struct contains any fields that are declared optional in STEF schema. PresenceMask will contain M bits, where M equals the number of optional fields in the struct, with bits laid out in the order matching the declaration order of the fields. Note: non-optional fields do not have corresponding bits in the PresenceMask, so generally speaking M<=N. If the optional field is present in this instance of the struct then the corresponding bit will be set to 1 otherwise the bit will be set to 0.

After the ModifiedMask and PresenceMask fields are encoded the struct encoder will recursively call encoders for struct fields, in the order of field declaration. The encoders for fields will then encode field values in their corresponding columns.

If the schema defines dictionary encoding for this struct then this value of the struct will be added to the dictionary to make it available for future references.

Dictionary Struct Encoding

Struct {
    FullEncoding: 1 = 0
    RefNum: UC
}

If the schema defines dictionary encoding for this struct then the encoder will first lookup up the struct's value in the dictionary. If the value is not found in the dictionary then full encoding will be used for the struct.

If the struct is found in the dictionary, then a reference to the existing entry will be recorded instead.

The FullEncoding bit will be set to 0 to indicate the dictionary entry reference follows. RefNum will be set to the number of the found entry in the dictionary.

OneOf Codec

OneOf {
    Choice: 1..6
}

Fields declared in oneof are numbered, starting from number 1 and assigned incrementing numbers in the order of declaration. This assigned number is called the "oneof choice number". Choice number 0 corresponds to "None" choice.

OneOf codec encodes choice number in the column using as many bits as are necessary to encode the highest field number (at least one bit is always used, even if the oneof has no fields, so that a value of 0 can be encoded to represent the "None" choice).

The chosen field's value is encoded in the child column using field's codec. After the Choice is encoded the OneOf encoder will call the encoders for the currently chosen field, that will then encode the chosen field's values in its corresponding column. Note that all other fields of the oneof (other than the current chosen field) will have nothing appended to their columns.

Array Codec

Array {
    Length: UC
}

Array codec encodes in one column the length of the array.

The array elements are encoded one by one in the child column using child element's codec using the differential encoding applicable to the child element encoder, i.e. the array encoder will compare each element with the element of the previous root record located at the same array index and encode only fields that are modified in the current element compared to the previous element. For array elements that do not have a corresponding element in the previous record (i.e. the current array is longer than the previous array) the entire element is encoded with differential encoding from a "new" state of the element's type.

If the root record schema declares a complex schema with nested arrays, structs, oneofs, etc, the previous element with which the current element is compared is located by traversing the record tree starting from the root of the previous record, following the same path as the current element in the current record.

MultiMap Codec

MultiMaps have 2 different encodings possible: value-only and full.

Full MultiMap Encoding

The following is written to the MultiMap column:

MultiMap {
    LengthTransposed: U64
}

LengthTransposed is computed as (Length << 1) | 1, where Length is the number of key-value pairs in the MultiMap.

The multimap's keys are encoded in the key child column and values in the value child column, in the order they are listed in the multimap. Keys and values are encoded using key's and value's codec using the differential encoding applicable to the key's or value's codec. As always the codecs compare each key or value with the key or value of the previous root record located at the same index.

If the root record schema declares a complex schema with nested arrays, structs, oneofs, etc, the previous MultiMap with which the current MultiMap is compared is located by traversing the record tree starting from the root of the previous record, following the same path as the current MultiMap in the current record.

Value-Only MultiMap Encoding

This encoding is used when the current instance of the MultiMap has exactly the same keys as the instance of the MultiMap in the previous root record. In that case the keys are not encoded at all and only values that differ from the previous instance's corresponding value are encoded.

The encoder compares the values of this instance of the MultiMap with the values of the previous instance in the previous root record, key-by-key. For all values at index i bit number i is set in a value ChangedKeys. After all values are iterated ChangedKeysShifted is computed as ChangedKeys << 1.

The following is written to the MultiMap column:

MultiMap {
    ChangedKeysShifted: U64
}

All values that are different in this instance are then written to the value child column. Nothing is written to the key child column.

Value-only encoding can be used if the number of key-value pairs in the MultiMap is less than or equal to 62. If the MultiMap has more than 62 key-value pairs then Full MultiMap Encoding is always used even if the MultiMap has the same keys as the previous instance.

String Codec

String is represented using one of the 2 possible encodings: direct-encoded and by reference.

The first field of both encodings is a Varint64 number. After reading this number the sign of the number will indicate which of the encodings is used: direct string or string reference.

Direct String Encoding

Direct string encoding:

Field	Value	Notes
Len	Varint64	Length of the string in bytes. A non-negative Varint64 number.
Bytes	Bytes	Len bytes of the string.

If the length of the string in bytes is >=2, the encoded string is appended to the corresponding dictionary at the current RefNum and the current RefNum of the dictionary is incremented. To understand which dictionary is used see the description of the block or of the field contains this String value.

Reference String Encoding

By reference:

Field	Value	Notes
RefNum	Varint64=X	A negative Varint64 number

The RefNum is calculated as RefNum = -X-1. The value of the string is the one that is stored in the corresponding dictionary at RefNum index.

Uint64 and Int64 Codecs

Uint64 {
    DeltaOfDelta: s64
}

Uint64 and Int64 values are represented a delta of delta from the previous value. Given previous value PrevVal and current value CurVal, DeltaOfDelta is computed as:

Initial state:
PrevDelta = 0
PrevVal = 0

On each iteration:
Delta = CurVal - PrevVal
DeltaOfDelta = Delta - PrevDelta
PrevDelta = Delta
PrevVal = CurVal

Float64 Codec

64-bit IEEE numbers are encoded using Gorilla encoding (section 4.1.2) with the following state of the encoder/decoder:

The "previous" value at the start and at the codec reset is initialized to 0.0.
The "previous leading zeros" and "previous trailing zeros" values at the start and at the codec reset are initialized to 0.

We also modify the encoding with the following optimization for the case 3 of encoding scheme. The original algorithm says:

If the block of meaningful bits falls within the block of previous meaningful bits, i.e., there are at least as many leading zeros and as many trailing zeros as with the previous value, use that information for the block position and just store the meaningful XORed value.

We modify this clause in the following way:

If the block of meaningful bits falls within the block of previous meaningful bits, i.e., there are at least as many leading zeros and as many trailing zeros as with the previous value, calculate the total number of bits require for encoding via schema 3a (Control bit '0') and for encoding via schema 3b (Control bit '1') and choose the schema with fewer total number of bits encoding. The net result of this optimization is that if the condition 53-prevLeading-prevTrailing > sigbits is true it is beneficial to to switch to schema 3b with new values of leading and trailing zeros even though the XOR value fits in the previous values of leading and trailing zeros.

Bool Codec

Bool values are encoded as single bits.

Value Formats

Uvarint64

Unsigned LEB128 encoded 64-bit number, 1-10 bytes.

Varint64

Varint64 is a signed integer in [-2^63..2^63-1] range, encoded using zigzag encoding into Uvarint64.

UvarintCompact

UvarintCompact is an unsigned number in [0..2^48-1] range, written to BitStream in compact form, in continuous bits in big endian format, the number of total bits ranges from 1 to 56.

The format is the following:

Prefix Bits	Followed by big endian bits
1	No bits. Encodes value of 0.
01	2 bit value
001	5 bit value
0001	12 bit value
00001	19 bit value
000001	26 bit value
0000001	33 bit value
00000001	48 bit value

VarintCompact

TBD

Dictionaries

Writer and reader maintain multiple dictionaries that allow referencing previously seen values.

Each of the dictionaries maintains a set of elements and their associated RefNum. When STEF data stream is started the dictionaries are empty (contain no elements).

Dictionaries maintained by writer and reader of the STEF data are kept in sync as they advance through the STEF data stream. This allows the reader to correctly locate the dictionary element referenced by RefNum number in the STEF data stream written previously by the writer.

Resetting Dictionaries

Dictionaries can be reset by the writer. When the dictionaries are reset they are set to the empty state. Immediately after resetting dictionaries the writer MUST close the frame and start a new frame with RestartDictionaries bit set in the flags. The RestartDictionaries flag indicates to the reader that it MUST also reset its dictionaries.

STEF/gRPC Protocol

STEF data can be communicated over gRPC via sTEFDestination service.

STEFDestination service is used to deliver STEF data from a Sender to a Receiver. The Sender is responsible for producing a STEF Data Stream and uses a STEF Writer. The Receiver correspondingly uses a STEF Reader to decode the received STEF Data Stream.

The sequence diagram of the typical operations is the following:

Sender                               Receiver
    
  │             gRPC Connect              │  
  ├──────────────────────────────────────►│  
  │       STEFClientFirstMessage          │  
  ├──────────────────────────────────────►│  
  │     STEFDestinationCapabilities       │  
  │◄──────────────────────────────────────┤  
  │         STEFClientMessage             │  
  ├──────────────────────────────────────►│  
  .                 ...                   .  
  │         STEFDataResponse              │  
  │◄──────────────────────────────────────┤  
  │         STEFClientMessage             │  
  .                 ...                   .  
  ├──────────────────────────────────────►│  
  │         STEFDataResponse              │  
  .                 ...                   .

The Sender opens a gRPC connection to the Receiver and starts a Stream() to STEFDestination.
Once the stream is open the Sender MUST send a STEFClientMessage with first_message field set.
The Destination MUST reply with a STEFServerMessage with capabilities field set.
The Sender MUST examine received capabilities and if the client is able to operate as requested by capabilities the Sender MUST begin sending STEFClientMessage containing STEF data. The Destination MUST periodically respond with STEFServerMessage containing STEFDataResponse. It is not required that there is always a corresponding STEFDataResponse for each STEFClientMessage - the Receiver may choose to respond once to acknowledge several previously received STEFClientMessage messages.

first_message field

The Sender MUST set this field in the the first STEFClientMessage sent. All other fields MUST be unset when first_message is set. All subsequent messages MUST have first_message unset.

stef_bytes field

The field stef_bytes contains a sequence of bytes of the STEF stream.

The receiver of the stef_bytes field is responsible for assembling the STEF data stream from a sequence of STEFClientMessage messages by concatenating bytes from stef_bytes in the order the messages are received.

is_end_of_chunk field

The is_end_of_chunk field indicates that the last byte of stef_bytes field is an end of a chunk. A chunk is either the STEF header, or a STEF frame.

This flag normally is used by receivers to accumulates STEF bytes until the end of the chunk is encountered and only then start decoding the chunk. Senders MUST ensure they mark this field true at least once in a while otherwise receivers may never start decoding the data.

Record Ids

gRPC protocol responses allow acknowledging receipt or specifying that the decoder failed at a particular place in the STEF stream. This indication is done via record ids.

The encoder and decoder that operate on two ends of the gRPC stream maintain the record id and increment it precisely in lockstep with every record processed.

The receiver of the STEF stream MUST use the record id value of the decoder and reference that value in ack_record_id field of STEFDataResponse to indicate successful receipt of STEF stream up to that sequence id.

If for whatever reason the decoded stream contains invalid values that cannot be accepted by the receiver, although STEF format decoding itself worked correctly and the decoder can continue accepting and decoding the subsequence STEF stream bytes, the receiver MUST indicate such invalid values back to the sender by referencing the record id of invalid value in bad_data_record_ids of STEFDataResponse. The sequence of operations in this case looks like this:

Sender                               Receiver
    
  │        STEFClientMessage              │  
  ├──────────────────────────────────────►│  
  │ STEFDataResponse{bad_data_record_ids} │  
  │◄──────────────────────────────────────┤  
  │        STEFClientMessage              │  
  ├──────────────────────────────────────►│  
  │        STEFClientMessage              │  
  ├──────────────────────────────────────►│  
  .                 ...                   .

As opposed to the above scenario, if the receiver is unable to decode the STEF stream, such that it loses the synchronization and is unable to continue correctly decoding the stream, then in addition to reporting the failure via bad_data_record_ids the receiver MUST also close the gRPC stream and let the sender re-open the stream. The sender in this case MAY NOT re-send the data that was marked as bad via previously reported bad_data_record_ids field. This is important to make sure the receiver's decoder does not fail again on the same data, causing an infinite loop of failures and retries. The sequence of operations in this case looks like this:

Sender                               Receiver
    
  │        STEFClientMessage              │  
  ├──────────────────────────────────────►│  
  │ STEFDataResponse{bad_data_record_ids} │  
  │◄──────────────────────────────────────┤
  │            gRPC Disconnect            │  
  │◄──────────────────────────────────────┤
  │             gRPC Connect              │  
  ├──────────────────────────────────────►│  
  │       STEFClientFirstMessage          │  
  ├──────────────────────────────────────►│  
  │     STEFDestinationCapabilities       │  
  │◄──────────────────────────────────────┤  
  │         STEFClientMessage             │  
  ├──────────────────────────────────────►│  
  .                 ...                   .

Streaming

STEF gRPC protocols can keep gRPC stream open and stream new STEF data as it arrives and is ready to be encoded.

The concept of chunks and indication of chunk end allows STEF decoders to decode complete chunks that are already received and then yield the execution to the caller and indicate EOF, then when a new complete chunks arrive and become available the same decoder can easily resume decoding from where it left. This is particularly important for efficient implementation of STEF destinations that need to receive thousands of gRPC connection and process received data as it becomes available without blocking and without the need for decoder implementation to be able to pause and resume operation from an arbitrary byte in STEF stream.

Connection Management

Senders SHOULD periodically close the gRPC stream and open a new one. This is important to ensure gRPC streams remain balanced in load-balanced scenarios with rolling servers behind the load balancer.

Dictionary Limits

Receivers, which anticipate a large number of incoming gRPC streams should take into account memory usage required per stream. To limit the memory usage the receivers can indicate dictionary size limits in STEFDestinationCapabilities message.

Senders MUST honor dictionary limits returned in DictionaryLimits message by resetting dictionaries when their total byte size grows to the specified limit. See Resetting Dictionaries for explanation on how the writer (i.e. gRPC Sender) and reader (i.e. gRPC receiver) MUST handle dictionary resets.

The total byte size of dictionaries is calculated by Sender approximately as the total number of bytes all elements occupy in memory. This calculation is to certain degree platform and implementation dependent. Receivers MUST allow for a margin of error in this calculation and set the limits conservatively.

FilesExpand file tree

specification.md

Latest commit

History