Skip to content

Commit 45ad641

Browse files
authored
Add Delta data type to Parquet physical type mappings in PROTOCOL.md (delta-io#2048)
## Description Currently, Delta protocol doesn't specify how a Delta data type is stored physically in Parquet files. This PR is attempting to document the Delta data type to Parquet physical/logical type mappings. ## How was this patch tested? NA ## Does this PR introduce _any_ user-facing changes? No
1 parent 3bf9704 commit 45ad641

File tree

1 file changed

+26
-1
lines changed

1 file changed

+26
-1
lines changed

PROTOCOL.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@
104104
- [Last Checkpoint File Schema](#last-checkpoint-file-schema)
105105
- [JSON checksum](#json-checksum)
106106
- [How to URL encode keys and string values](#how-to-url-encode-keys-and-string-values)
107+
- [Delta Data Type to Parquet Type Mappings](#delta-data-type-to-parquet-type-mappings)
107108

108109
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
109110

@@ -1813,7 +1814,7 @@ An array stores a variable length collection of items of some type.
18131814
Field Name | Description
18141815
-|-
18151816
type| Always the string "array"
1816-
elementType| The type of element stored in this array represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
1817+
elementType| The type of element stored in this array, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
18171818
containsNull| Boolean denoting whether this array can contain one or more null values
18181819

18191820
### Map Type
@@ -2114,3 +2115,27 @@ uppercase and lowercase as part of percent-encoding. Thus, we require a stricter
21142115
3. Always [percent-encode](https://datatracker.ietf.org/doc/html/rfc3986#section-2) reserved octets
21152116
4. Never percent-encode non-reserved octets
21162117
5. A percent-encoded octet consists of three characters: `%` followed by its 2-digit hexadecimal value in uppercase letters, e.g. `>` encodes to `%3E`
2118+
2119+
## Delta Data Type to Parquet Type Mappings
2120+
Below table captures how each Delta data type is stored physically in Parquet files. Parquet files are used for storing the table data or metadata ([checkpoints](#checkpoints)). Parquet has a limited number of [physical types](https://parquet.apache.org/docs/file-format/types/). Parquet [logical types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) are used to extend the types by specifying how the physical types should be interpreted.
2121+
2122+
For some of the Delta data types, there are multiple ways store the values physically in Parquet file. For example, `timestamp` can be stored either as `int96` or `int64`. The exact physical type depends on the engine that is writing the Parquet file and/or engine specific configuration options. For a Delta lake table reader, it is recommended that the Parquet file reader support at least the Parquet physical and logical types mentioned in the below table.
2123+
2124+
Delta Type Name | Parquet Physical Type | Parquet Logical Type
2125+
-|-|-
2126+
boolean| `boolean` |
2127+
byte| `int32` | `INT(bitwidth = 8, signed = true)`
2128+
short| `int32` | `INT(bitwidth = 16, signed = true)`
2129+
int| `int32` | `INT(bitwidth = 32, signed = true)`
2130+
long| `int64` | `INT(bitwidth = 64, signed = true)`
2131+
date| `int32` | `DATE`
2132+
timestamp| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = true, units = microseconds)`
2133+
timestamp without time zone| `int96` or `int64` | `TIMESTAMP(isAdjustedToUTC = false, units = microseconds)`
2134+
float| `float` |
2135+
double| `double` |
2136+
decimal| `int32`, `int64` or `fixed_length_binary` | `DECIMAL(scale, precision)`
2137+
string| `binary` | `string (UTF-8)`
2138+
binary| `binary` |
2139+
array| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for further details | `LIST`
2140+
map| either as `2-level` or `3-level` representation. Refer to [Parquet documentation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) for further details | `MAP`
2141+
struct| `group` |

0 commit comments

Comments
 (0)