Skip to content

feat: metrics for parquet writer#651

Open
WZhuo wants to merge 2 commits into
apache:mainfrom
WZhuo:parquet_metrics
Open

feat: metrics for parquet writer#651
WZhuo wants to merge 2 commits into
apache:mainfrom
WZhuo:parquet_metrics

Conversation

@WZhuo
Copy link
Copy Markdown
Contributor

@WZhuo WZhuo commented May 11, 2026

No description provided.

@WZhuo WZhuo force-pushed the parquet_metrics branch from 63f7e38 to 4ec4aae Compare May 21, 2026 04:11
@WZhuo WZhuo force-pushed the parquet_metrics branch from 4ec4aae to 01f825d Compare May 21, 2026 09:05
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of metrics edge cases to fix before this lands.

ICEBERG_ASSIGN_OR_RAISE(auto truncated_lower,
TruncateUtils::TruncateLowerBound(
*iceberg_type, lower_bound.value(), truncate_length));
ICEBERG_ASSIGN_OR_RAISE(auto truncated_upper,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns a missing representable upper bound into a hard metrics failure. For truncate(N), Java returns null from BinaryUtil.truncateBinaryMax / UnicodeUtil.truncateStringMax when values like 0xff... cannot produce a safe upper bound, and ParquetMetrics just omits the upper bound. Here TruncateUpperBound bubbles an InvalidArgument out of writer->metrics(), so a valid file can be written but fail while building DataFile metrics. Can we treat this case as no upper bound instead?

return Metrics();
}
return ParquetMetrics::GetMetrics(*schema_, *parquet_schema_, *metrics_config_,
*metadata_, {});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This drops the write-side field metrics. Java passes model.metrics() into ParquetMetrics.metrics(...), which is where float/double NaN counts come from. With {} here, nan_value_counts stays empty even when the file has NaNs, and the tests currently skip that assertion. We should either collect/pass FieldMetrics here or leave NaN metrics unsupported explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants