Skip to content

Latest commit

 

History

History
77 lines (59 loc) · 3.67 KB

File metadata and controls

77 lines (59 loc) · 3.67 KB

Variant Binary Encoding

This directory contains binary artifacts encoded using the Parquet Variant binary encoding. These files are not valid Parquet files, but rather raw binary data.

Structure

  • data_dictionary.json - contains the JSON representation for each example

Each example consists of 2 files:

  • .metadata -- the binary contents of the metadata field
  • .value -- the binary contents of the value field

Descriptions

  1. primitive_<type> -- Examples primitive (basic_type = 1), one for each of the primitive types listed in the spec
  2. short_string -- Example of short string (basic_type = 2)
  3. object_empty -- Example of object (basic_type = 3) with no fields
  4. object_primitive -- Example of object with only primitive fields
  5. object_nested -- Example of object with other objects in fields
  6. array_empty -- Example of array (basic_type = 4) with no elements
  7. array_primitive -- Example of array with only primitive elements
  8. array_nested -- Example of an with objects and other arrays in the elements

Regenerating these files

The files in this directory were initially generated by running the regen.py script which used Apache Spark to generate the files. The files have been subsequently modified when necessary to ensure that they conform to the Parquet spec.

Modification 1: Created metadata and value for primitive_null as a single byte (0x01)

Per #81, Spark did not generate any metadata for null and left primitive_null.metadata empty. The metadata for primitive_null should be the same 3 bytes as other primitive types

  • header = 0x01
  • dictionary_size = 0x00
  • dictionary_size + 1 = 1 byte values: 0x00
cp primitive_int8.metadata primitive_null.metadata

The value for a primitive should be a value_header and no value_data, resulting in a single 0 byte:

echo -n 'a' | tr a '\0' > primitive_null.value

Modification 2: Created TimeNTZ/Timestamp with timezone nanos/Timestamp without timezone nanos/UUID with Iceberg test code

Currently, Spark does not support Variant values containing UUID, Time, or nanosecond-precision Timestamp. the primitive_time.[metadata/value], primitive_timestamp_nanos.[metadata/value], primitive_timestampntz_nanos.[metadata/value] and primitive_uuid.[metadata/data] was generated by Iceberg test code