You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GH-46908: [Docs][Format] Add variant extension type docs (#47456)
### Rationale for this change
To support the addition of the Parquet Variant data type and the Iceberg
adoption of the variant type, we need a defined way to pass this data
through Arrow-compatible systems. As such, we need a specification for a
canonical Arrow extension type to represent Variant data.
### What changes are included in this PR?
Updates to the docs which define the Arrow Variant Extension type
* GitHub Issue: #46908
---------
Co-authored-by: Ian Cook <[email protected]>
Co-authored-by: David Li <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: Yan Tingwang <[email protected]>
Copy file name to clipboardExpand all lines: docs/source/format/CanonicalExtensions.rst
+127-1Lines changed: 127 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -417,7 +417,133 @@ better zero-copy compatibility with various systems that also store booleans usi
417
417
418
418
Metadata is an empty string.
419
419
420
-
=========================
420
+
.. _parquet_variant_extension:
421
+
422
+
Parquet Variant
423
+
===============
424
+
425
+
Variant represents a value that may be one of:
426
+
427
+
* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)
428
+
429
+
* Array: An ordered list of Variant values
430
+
431
+
* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys
432
+
433
+
Particularly, this provides a way to represent semi-structured data which is stored as a
434
+
`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in
435
+
a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
436
+
variant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unless
437
+
they want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actual
438
+
binary values look like.
439
+
440
+
* Extension name: ``arrow.parquet.variant``.
441
+
442
+
* The storage type of this extension is a ``Struct`` that obeys the following rules:
443
+
444
+
* A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
445
+
446
+
* At least one (or both) of the following:
447
+
448
+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
449
+
(unshredded variants consist of just the ``metadata`` and ``value`` fields only)
450
+
451
+
* A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct``
452
+
453
+
* If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must**
454
+
be a ``Struct`` consisting of at least one (or both) of the following:
455
+
456
+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
457
+
458
+
* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).
459
+
460
+
* If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded
461
+
from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following:
462
+
463
+
* A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
464
+
465
+
* A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).
466
+
467
+
* Extension type parameters:
468
+
469
+
This type does not have any parameters.
470
+
471
+
* Description of the serialization:
472
+
473
+
Extension metadata is an empty string.
474
+
475
+
.. note::
476
+
477
+
It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
478
+
or run-end-encoded with a preferred (*but not required*) runs type of ``int8``.
479
+
480
+
.. note::
481
+
482
+
The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive.
0 commit comments