|
104 | 104 | - [Last Checkpoint File Schema](#last-checkpoint-file-schema)
|
105 | 105 | - [JSON checksum](#json-checksum)
|
106 | 106 | - [How to URL encode keys and string values](#how-to-url-encode-keys-and-string-values)
|
| 107 | + - [Delta Data Type to Parquet Type Mappings](#delta-data-type-to-parquet-type-mappings) |
107 | 108 |
|
108 | 109 | <!-- END doctoc generated TOC please keep comment here to allow auto update -->
|
109 | 110 |
|
@@ -1813,7 +1814,7 @@ An array stores a variable length collection of items of some type.
|
1813 | 1814 | Field Name | Description
|
1814 | 1815 | -|-
|
1815 | 1816 | type| Always the string "array"
|
1816 |
| -elementType| The type of element stored in this array represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
| 1817 | +elementType| The type of element stored in this array, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition |
1817 | 1818 | containsNull| Boolean denoting whether this array can contain one or more null values
|
1818 | 1819 |
|
1819 | 1820 | ### Map Type
|
@@ -2114,3 +2115,27 @@ uppercase and lowercase as part of percent-encoding. Thus, we require a stricter
|
2114 | 2115 | 3. Always [percent-encode](https://datatracker.ietf.org/doc/html/rfc3986#section-2) reserved octets
|
2115 | 2116 | 4. Never percent-encode non-reserved octets
|
2116 | 2117 | 5. A percent-encoded octet consists of three characters: `%` followed by its 2-digit hexadecimal value in uppercase letters, e.g. `>` encodes to `%3E`
|
| 2118 | + |
| 2119 | +## Delta Data Type to Parquet Type Mappings |
| 2120 | +Below table captures how each Delta data type is stored physically in Parquet files. Parquet files are used for storing the table data or metadata ([checkpoints](#checkpoints)). Parquet has a limited number of [physical types](https://parquet.apache.org/docs/file-format/types/). Parquet [logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) are used to extend the types by specifying how the physical types should be interpreted. |
| 2121 | + |
| 2122 | +For some of the Delta data types, there are multiple ways store the values physically in Parquet file. For example, `timestamp` can be stored either as `int96` or `int64`. The exact physical type depends on the engine that is writing the Parquet file and/or engine specific configuration options. For a Delta lake table reader, it is recommended that the Parquet file reader support at least the Parquet physical and logical types mentioned in the below table. |
| 2123 | + |
| 2124 | +Delta Type Name | Parquet Physical Type | Parquet Logical Type |
| 2125 | +-|-|- |
| 2126 | +boolean| `boolean` | |
| 2127 | +byte| `int32` | `INT(bitwidth = 8, signed = true)` |
| 2128 | +short| `int32` | `INT(bitwidth = 16, signed = true)` |
| 2129 | +int| `int32` | `INT(bitwidth = 32, signed = true)` |
| 2130 | +long| `int64` | `INT(bitwidth = 64, signed = true)` |
| 2131 | +date| `int32` | `DATE` |
| 2132 | +timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)` |
| 2133 | +timestamp without time zone| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = false, units = microseconds)` |
| 2134 | +float| `float` | |
| 2135 | +double| `double` | |
| 2136 | +decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMAL(scale, precision)` |
| 2137 | +string| `binary` | `string (UTF-8)` |
| 2138 | +binary| `binary` | |
| 2139 | +array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST` |
| 2140 | +map| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) for further details | `MAP` |
| 2141 | +struct| `group` | |
0 commit comments