feat: metrics for parquet writer#651
Conversation
wgtmac
left a comment
There was a problem hiding this comment.
A couple of metrics edge cases to fix before this lands.
| ICEBERG_ASSIGN_OR_RAISE(auto truncated_lower, | ||
| TruncateUtils::TruncateLowerBound( | ||
| *iceberg_type, lower_bound.value(), truncate_length)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto truncated_upper, |
There was a problem hiding this comment.
This turns a missing representable upper bound into a hard metrics failure. For truncate(N), Java returns null from BinaryUtil.truncateBinaryMax / UnicodeUtil.truncateStringMax when values like 0xff... cannot produce a safe upper bound, and ParquetMetrics just omits the upper bound. Here TruncateUpperBound bubbles an InvalidArgument out of writer->metrics(), so a valid file can be written but fail while building DataFile metrics. Can we treat this case as no upper bound instead?
| return Metrics(); | ||
| } | ||
| return ParquetMetrics::GetMetrics(*schema_, *parquet_schema_, *metrics_config_, | ||
| *metadata_, {}); |
There was a problem hiding this comment.
This drops the write-side field metrics. Java passes model.metrics() into ParquetMetrics.metrics(...), which is where float/double NaN counts come from. With {} here, nan_value_counts stays empty even when the file has NaNs, and the tests currently skip that assertion. We should either collect/pass FieldMetrics here or leave NaN metrics unsupported explicitly.
No description provided.