Commit 391facc
committed
PARQUET-2249: Introduce IEEE 754 total order
This commit adds a new column order `IEEE754TotalOrder`, which can be
used for floating point types (`FLOAT`, `DOUBLE`, `FLOAT16`).
The advantage of the new order is a well-defined ordering between -0,+0
and the various possible bit patterns of NaNs. Thus, every single
possible bit pattern of a floating point value has a well-defined order
now, so there are no possibilities where two implementations might
apply different orders when the new column order is used.
With the default column order, there were many problems w.r.t. NaN
values which lead to reading engines not being able to use statistics
of floating point columns for scan pruning even in the case where no
NaNs were in the data set. The problems are discussed in detail in the
next section.
This solution to the problem is the result of the extended discussion
in PR apache#196 [1], which ended with the consensus that IEEE 754 total
ordering is the best approach to solve the problem in a simple manner
without introducing special fields for floating point columns (such as
`nan_counts`, which was proposed in that PR). Please refer to the
discussion in that PR for all the details why this solution was chosen
over various design alternatives.
Note that this solution is fully backward compatible and should not
break neither old readers nor writers, as a new column order is added.
Legacy writers can continue not writing this new order and instead
writing the default type defined order. Legacy readers should avoid
using any statistics on columns that have a column order they do not
understand and therefore should just not use the statistics for columns
ordered using the new order.
The remainder of this message explains in detail what the problems are
and how the proposed solution fixes them.
Problem Description
===================
Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in `DOUBLE` or `FLOAT`
columns. Concretely the following problems exist:
Statistics don't tell whether NaNs are present
----------------------------------------------
As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However, NaN
is ordered in most database systems. For example, Postgres, DB2, and
Oracle treat NaN as greater than any other value, while MSSQL and MySQL
treat it as less than any other value. An overview over what different
systems are doing can be found here [2]. The gist of it is that
different systems with different semantics exist w.r.t. NaNs and most
of the systems do order NaNs; either less than or greater than all
other values.
For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate
`x > 1.0` should include NaN values. If a page has max = 0.0 now, the
engine would not be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.
Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with min = 2.0 couldn't be skipped in this case by the reader.
Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem
on first sight.
Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all. And as state, that can be many seemingly innocuous greater than or
less than predicates in most databases systems. Conversely, it would be
nice if Parquet would enable scan pruning in these cases, regardless of
whether the reader and writer agree upon whether NaN is smaller,
greater, or incomparable to all other values.
Note that the problem exists especially if the Parquet file doesn't
include any NaNs, so this is not only a problem in the edge case where
NaNs are present; it is a problem in the way more common case of NaNs
not being present.
Handling NaNs in a ColumnIndex
------------------------------
There is currently no well-defined way to write a spec-conforming
`ColumnIndex` once a page has only NaN (and possibly null) values. NaN
values should not be included in min/max bounds, but if a page contains
only NaN values, then there is no other value to put into the min/max
bounds. However, bounds in a ColumnIndex are non-optional, so we have
to put something in here. The spec does not describe what engines
should do in this case. Parquet-mr takes the safe route and does not
write a column index once NaNs are present. But this is a huge
pessimization, as a single page containing NaNs will prevent writing a
column index for the column chunk containing that page, so even pages
in that chunk that don't contain NaNs will not be indexed.
It would be nice if there was a defined way of writing the ColumnIndex
when NaNs (and especially only-NaN pages) are present.
Handling only-NaN pages & column chunks
---------------------------------------
Note: Hereinafter, whenever the term only-NaN is used, it refers to a
page or column chunk, whose only non-null values are NaNs. E.g., an
only-NaN page is allowed to have a mixture of null values and NaNs or
only NaNs, but no non-NaN non-null values.
The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is
now unclear whether the min/max value wasn't written because there were
only NaNs, or simply because the writing engine did decide to omit them
for whatever other reason, which is allowed by the spec as the field is
optional.
Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, I can skip this page if I am looking for
only non-NaN values" or "no stats written, you have to read this page
as it is undefined what values it contains".
It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.
Solution
========
IEEE 754 total order solves all the mentioned problems. As NaNs now
have a defined place in the ordering, they can be incorporated into min
and max bounds. In fact, in contrast to the default ordering, they do
not need any special casing anymore, so all the remarks how readers and
writers should special-handle NaNs and -0/+0 no longer apply to the new
ordering.
As NaNs are incorporated into min and max, a reader can now see whether
NaNs are contained through the statistics. Thus, a reading engine just
has to map its NaN semantics to the NaN semantics of total ordering.
For example, if the semantics of the reading engine treat all NaNs
(also -NaNs) as greater than all other values, a reading engine having
a predicate `x > 5.0` (which should include NaNs) may not filter any
pages / row groups if either min or max are (+/-)NaN.
Only-NaN pages can now also be included in the column index, as they
are no longer a special case.
[1]
apache#196
[2]
apache/arrow-rs#264 (comment)1 parent 1dbc814 commit 391facc
3 files changed
+64
-4
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
254 | 254 | | |
255 | 255 | | |
256 | 256 | | |
257 | | - | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
258 | 261 | | |
259 | 262 | | |
260 | 263 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
158 | 158 | | |
159 | 159 | | |
160 | 160 | | |
161 | | - | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
162 | 164 | | |
163 | 165 | | |
164 | 166 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1030 | 1030 | | |
1031 | 1031 | | |
1032 | 1032 | | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
1033 | 1036 | | |
1034 | 1037 | | |
1035 | 1038 | | |
| |||
1038 | 1041 | | |
1039 | 1042 | | |
1040 | 1043 | | |
| 1044 | + | |
1041 | 1045 | | |
1042 | 1046 | | |
1043 | 1047 | | |
| |||
1082 | 1086 | | |
1083 | 1087 | | |
1084 | 1088 | | |
1085 | | - | |
1086 | | - | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
1087 | 1095 | | |
1088 | 1096 | | |
1089 | 1097 | | |
| |||
1099 | 1107 | | |
1100 | 1108 | | |
1101 | 1109 | | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
1102 | 1157 | | |
1103 | 1158 | | |
1104 | 1159 | | |
| |||
0 commit comments