This directory contains binary artifacts encoded using the Parquet Variant binary encoding. These files are not valid Parquet files, but rather raw binary data.
data_dictionary.json- contains the JSON representation for each example
Each example consists of 2 files:
.metadata-- the binary contents of themetadatafield.value-- the binary contents of thevaluefield
primitive_<type>-- Examples primitive (basic_type= 1), one for each of the primitive types listed in the specshort_string-- Example of short string (basic_type= 2)object_empty-- Example of object (basic_type= 3) with no fieldsobject_primitive-- Example of object with only primitive fieldsobject_nested-- Example of object with other objects in fieldsarray_empty-- Example of array (basic_type= 4) with no elementsarray_primitive-- Example of array with only primitive elementsarray_nested-- Example of an with objects and other arrays in the elements
The files in this directory were initially generated by running the regen.py
script which used Apache Spark to generate the files. The files have been subsequently modified
when necessary to ensure that they conform to the Parquet spec.
Per #81, Spark did not generate
any metadata for null and left primitive_null.metadata empty.
The metadata for primitive_null should be the same 3 bytes as other primitive types
- header =
0x01 - dictionary_size =
0x00 dictionary_size + 1 = 1byte values:0x00
cp primitive_int8.metadata primitive_null.metadataThe value for a primitive should be a value_header and no value_data,
resulting in a single 0 byte:
echo -n 'a' | tr a '\0' > primitive_null.valueModification 2: Created TimeNTZ/Timestamp with timezone nanos/Timestamp without timezone nanos/UUID with Iceberg test code
Currently, Spark does not support Variant values containing UUID, Time, or nanosecond-precision Timestamp. the primitive_time.[metadata/value], primitive_timestamp_nanos.[metadata/value], primitive_timestampntz_nanos.[metadata/value] and primitive_uuid.[metadata/data] was generated by Iceberg test code