Skip to content

Commit 6c0c3cd

Browse files
zeroshadeianmcooklidavidmpaleolimbotwgtmac
authored
GH-46908: [Docs][Format] Add variant extension type docs (#47456)
### Rationale for this change To support the addition of the Parquet Variant data type and the Iceberg adoption of the variant type, we need a defined way to pass this data through Arrow-compatible systems. As such, we need a specification for a canonical Arrow extension type to represent Variant data. ### What changes are included in this PR? Updates to the docs which define the Arrow Variant Extension type * GitHub Issue: #46908 --------- Co-authored-by: Ian Cook <[email protected]> Co-authored-by: David Li <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Gang Wu <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: Yan Tingwang <[email protected]>
1 parent 37dcfcc commit 6c0c3cd

File tree

3 files changed

+684
-1
lines changed

3 files changed

+684
-1
lines changed

docs/source/format/CanonicalExtensions.rst

Lines changed: 127 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -417,7 +417,133 @@ better zero-copy compatibility with various systems that also store booleans usi
417417

418418
Metadata is an empty string.
419419

420-
=========================
420+
.. _parquet_variant_extension:
421+
422+
Parquet Variant
423+
===============
424+
425+
Variant represents a value that may be one of:
426+
427+
* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)
428+
429+
* Array: An ordered list of Variant values
430+
431+
* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys
432+
433+
Particularly, this provides a way to represent semi-structured data which is stored as a
434+
`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in
435+
a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
436+
variant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unless
437+
they want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actual
438+
binary values look like.
439+
440+
* Extension name: ``arrow.parquet.variant``.
441+
442+
* The storage type of this extension is a ``Struct`` that obeys the following rules:
443+
444+
* A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
445+
446+
* At least one (or both) of the following:
447+
448+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
449+
(unshredded variants consist of just the ``metadata`` and ``value`` fields only)
450+
451+
* A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct``
452+
453+
* If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must**
454+
be a ``Struct`` consisting of at least one (or both) of the following:
455+
456+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
457+
458+
* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).
459+
460+
* If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded
461+
from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following:
462+
463+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
464+
465+
* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).
466+
467+
* Extension type parameters:
468+
469+
This type does not have any parameters.
470+
471+
* Description of the serialization:
472+
473+
Extension metadata is an empty string.
474+
475+
.. note::
476+
477+
It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
478+
or run-end-encoded with a preferred (*but not required*) runs type of ``int8``.
479+
480+
.. note::
481+
482+
The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive.
483+
484+
.. _variant_primitive_type_mapping:
485+
486+
Primitive Type Mappings
487+
-----------------------
488+
489+
+----------------------+------------------------+
490+
| Arrow Primitive Type | Variant Primitive Type |
491+
+======================+========================+
492+
| Null | Null |
493+
+----------------------+------------------------+
494+
| Boolean | Boolean (true/false) |
495+
+----------------------+------------------------+
496+
| Int8 | Int8 |
497+
+----------------------+------------------------+
498+
| Uint8 | Int16 |
499+
+----------------------+------------------------+
500+
| Int16 | Int16 |
501+
+----------------------+------------------------+
502+
| Uint16 | Int32 |
503+
+----------------------+------------------------+
504+
| Int32 | Int32 |
505+
+----------------------+------------------------+
506+
| Uint32 | Int64 |
507+
+----------------------+------------------------+
508+
| Int64 | Int64 |
509+
+----------------------+------------------------+
510+
| Float | Float |
511+
+----------------------+------------------------+
512+
| Double | Double |
513+
+----------------------+------------------------+
514+
| Decimal32 | decimal4 |
515+
+----------------------+------------------------+
516+
| Decimal64 | decimal8 |
517+
+----------------------+------------------------+
518+
| Decimal128 | decimal16 |
519+
+----------------------+------------------------+
520+
| Date32 | Date |
521+
+----------------------+------------------------+
522+
| Time64 | TimeNTZ |
523+
+----------------------+------------------------+
524+
| Timestamp(us, UTC) | Timestamp (micro) |
525+
+----------------------+------------------------+
526+
| Timestamp(us) | TimestampNTZ (micro) |
527+
+----------------------+------------------------+
528+
| Timestamp(ns, UTC) | Timestamp (nano) |
529+
+----------------------+------------------------+
530+
| Timestamp(ns) | TimestampNTZ (nano) |
531+
+----------------------+------------------------+
532+
| Binary | Binary |
533+
+----------------------+------------------------+
534+
| LargeBinary | Binary |
535+
+----------------------+------------------------+
536+
| BinaryView | Binary |
537+
+----------------------+------------------------+
538+
| String | String |
539+
+----------------------+------------------------+
540+
| LargeString | String |
541+
+----------------------+------------------------+
542+
| StringView | String |
543+
+----------------------+------------------------+
544+
| UUID extension type | UUID |
545+
+----------------------+------------------------+
546+
421547
Community Extension Types
422548
=========================
423549

0 commit comments

Comments
 (0)