diff --git a/LogicalTypes.md b/LogicalTypes.md index e7a0ce04..d9fd6a29 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. -The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros), +as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns +of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the +[Thrift definition](src/main/thrift/parquet.thrift) for details. ## Temporal Types diff --git a/README.md b/README.md index ae7272fb..afecf332 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types]. Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index, and Data Page). These statistics are according to a sort order, which is defined for each column in the file footer. Parquet supports common -sort orders for logical and primitve types. The details are documented in the +sort orders for logical and primitve types and also special orders for types +where the common sort order is not unambiguously defined (e.g., NaN ordering +for floating point types). The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. ## Nested Encoding diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 59ec5f17..0f90f200 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -309,6 +309,13 @@ struct Statistics { 7: optional bool is_max_value_exact; /** If true, min_value is the actual minimum value for a column */ 8: optional bool is_min_value_exact; + /** + * count of NaN values in the column; only present if physical type is FLOAT + * or DOUBLE, or logical type is FLOAT16. + * Readers MUST distinguish between nan_count not being present and nan_count == 0. + * If nan_count is not present, readers MUST NOT assume nan_count == 0. + */ + 9: optional i64 nan_count; } /** Empty structs to use as logical type annotations */ @@ -670,7 +677,7 @@ enum BoundaryOrder { /** Data page header */ struct DataPageHeader { /** - * Number of values, including NULLs, in this data page. + * Number of values, including nulls, in this data page. * * If a OffsetIndex is present, a page must begin at a row * boundary (repetition_level = 0). Otherwise, pages may begin @@ -717,9 +724,9 @@ struct DictionaryPageHeader { * The remaining section containing the data is compressed if is_compressed is true **/ struct DataPageHeaderV2 { - /** Number of values, including NULLs, in this data page. **/ + /** Number of values, including nulls, in this data page. **/ 1: required i32 num_values - /** Number of NULL values, in this data page. + /** Number of null values, in this data page. Number of non-null = num_values - num_nulls which is also the number of values in the data section **/ 2: required i32 num_nulls /** @@ -1030,6 +1037,9 @@ struct RowGroup { /** Empty struct to signal the order defined by the physical or logical type */ struct TypeDefinedOrder {} +/** Empty struct to signal IEEE 754 total order for floating point types */ +struct IEEE754TotalOrder {} + /** * Union to specify the order used for the min_value and max_value fields for a * column. This union takes the role of an enhanced enum that allows rich @@ -1038,6 +1048,7 @@ struct TypeDefinedOrder {} * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or * physical type (if there is no logical type). + * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order. * * If the reader does not support the value of this union, min and max stats * for this column should be ignored. @@ -1082,23 +1093,105 @@ union ColumnOrder { * BYTE_ARRAY - unsigned byte-wise comparison * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because the precise sorting order is ambiguous for floating + * point types due to underspecified handling of NaN and -0/+0, + * it is recommended that writers use IEEE_754_TOTAL_ORDER + * for these types. + * + * If TYPE_ORDER is used for floating point types, then the following * compatibility rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-null + * values are NaN. + * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. * - When looking for NaN values, min and max should be ignored. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - It is suggested to always set the nan_count field for floating + * point types, especially also if it is zero. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where min_values and max_values are not optional + * so a NaN value must be written if all non-null values in a page + * are NaN. * - If the computed max value is zero (whether negative or positive), * `+0.0` should be written into the max statistics field. * - If the computed min value is zero (whether negative or positive), * `-0.0` should be written into the min statistics field. */ 1: TypeDefinedOrder TYPE_ORDER; + + /* + * The floating point type is ordered according to the totalOrder predicate, + * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of + * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering. + * + * Intuitively, this orders floats mathematically, but defines -0 to be less + * than +0, -NaN to be less than anything else, and +NaN to be greater than + * anything else. It also defines an order between different bit representations + * of the same value. + * + * The formal definition is as follows: + * a) If xy, totalOrder(x, y) is false. + * c) If x=y: + * 1) totalOrder(−0, +0) is true. + * 2) totalOrder(+0, −0) is false. + * 3) If x and y represent the same floating-point datum: + * i) If x and y have negative sign, totalOrder(x, y) is true if and + * only if the exponent of x ≥ the exponent of y + * ii) otherwise totalOrder(x, y) is true if and only if the exponent + * of x ≤ the exponent of y. + * d) If x and y are unordered numerically because x or y is NaN: + * 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with + * negative sign bit and y is a non-NaN floating-point number. + * 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with + * positive sign bit and x is a non-NaN floating-point number. + * 3) If x and y are both NaNs, then totalOrder reflects a total ordering + * based on: + * i) negative sign orders below positive sign + * ii) signaling orders below quiet for +NaN, reverse for −NaN + * iii) lesser payload, when regarded as an integer, orders below + * greater payload for +NaN, reverse for −NaN. + * + * Note that this ordering can be implemented efficiently in software by bit-wise + * operations on the integer representation of the floating point values. + * E.g., this is a possible implementation for DOUBLE in Rust: + * + * pub fn totalOrder(x: f64, y: f64) -> bool { + * let mut x_int = x.to_bits() as i64; + * let mut y_int = y.to_bits() as i64; + * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; + * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; + * return x_int <= y_int; + * } + * + * When writing statistics for columns with this order, the following rules + * must be followed: + * - Writing the nan_count field is mandatory when using this ordering, + * especialy also if it is zero. + * - NaNs should not be written to min or max statistics fields except + * in the column index, where min_values and max_values are not optional + * so a NaN value must be written if all non-null values in a page + * are NaN. In this case, the min_values[i] and max_values[i] fields + * should be set to the smallest and largest NaN values contained + * in the page, as defined by the IEEE 754 total order. + * + * When reading statistics for columns with this order, the following rules + * should be followed: + * - Readers should consult the nan_count field to determine whether NaNs + * are present. + * - A reader can compute nan_count + null_count == num_values to deduce + * whether all non-null values are NaN. In the page index, which does not + * have a num_values field, the presence of a NaN value in min_values + * or max_values indicates that all non-null values are NaN. + */ + 2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER; } struct PageLocation { @@ -1170,6 +1263,19 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16, + * NaN values are not to be included in these bounds. If all non-null values + * of a page are NaN, then a writer must do the following: + * - If the order of this column is TypeDefinedOrder, then no column index + * must be written for this column chunk. While this is unfortunate for + * performance, it is necessary to avoid conflict with legacy files that + * still included NaN in min_values and max_values even if the page had + * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended. + * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i] + * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values + * of a page are NaN, then min_values[i] and max_values[i] must be set to + * the smallest and largest NaN value contained in the page, as defined + * by the IEEE 754 total order. */ 2: required list min_values 3: required list max_values @@ -1193,7 +1299,6 @@ struct ColumnIndex { * null counts are 0. */ 5: optional list null_counts - /** * Contains repetition level histograms for each page * concatenated together. The repetition_level_histogram field on @@ -1211,6 +1316,16 @@ struct ColumnIndex { * Same as repetition_level_histograms except for definitions levels. **/ 7: optional list definition_level_histograms; + + /** + * A list containing the number of NaN values for each page. Only present + * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16. + * If this field is not present, readers MUST assume that there might or + * might not be NaN values in any page, as NaNs should not be included + * in min_values or max_values. + */ + 8: optional list nan_counts + } struct AesGcmV1 {