-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Feature Request / Improvement
Iceberg's Metrics Reporting API
We've started a discussion about the Metrics Reporting API in apache/iceberg-rust#1466. It's part of the catalog spec and concerns itself with monitoring Iceberg client's accesses to files in object storage, e.g. number of files considered, scanned and skipped during scan planning (and similar ones for commits). These types of metrics are otherwise not visible from the catalog and the metrics reporting API provides a standard interface to aggregate such metrics across clients.
Currently, only Iceberg Java ships with an implementation for metrics reporting. While providing a pluggable interface, it comes with default implementations LoggingMetricsReporter
and RestMetricsReporter
. The latter is used in combination with REST catalogs and sends recorded metrics over for server-side processing.
Existing Telemetry APIs
On the draft implementation, @sdd raised a good point that we now have other, often more idiomatic interfaces available apache/iceberg-rust#1496 (comment). In Rust for example, we've decided on using the facade metrics
which users can back by any exporter they like, offering simple integrations with existing observability systems. In Go, opentelemetry offers similar functionality.
Using existing telemetry APIs, reporting code could look much simpler and backing integrations will be easier (no custom code needed).
Metric Names
Emitting metrics straight from the library will mean we also need to standardize on metric names or implementations could diverge, defeating the idea of a unified way of monitoring Iceberg clients.
I would like to propose a naming system similar to @sdd's PoC comprised of
iceberg.<operation>.<resource>.<count-type>
for example iceberg.scan.data_files.scanned
, iceberg.scan.delete_manifests.skipped
or iceberg.commit.delete_files.added
. Existing metrics can be taken from ScanMetricsResult.java
and CommitMetricsResult.java
.
Catalog Spec
The Metrics Reporting API is part of the catalog spec which suggests that we should consider implementing it anyway. If we can prove with an experiment that (for example) an opentelemetry exporter can consume a spec-compliant reporter interface, we should be good. If we can't, we need to take this into consideration.
With the spec's API, multiple metrics are bundled together into a single report. This doesn't seem natural for other metrics APIs and could become an implementation burden.
I want to use this issue to:
-
start a general discussion about metrics reporting in Go because I find it tremendously useful when working with many clients, and would like to contribute such functionality
-
extend the discussion about following the Java implementation vs. using more idiomatic approaches because I would like to see different implementations moving into a similar direction
- find agreement on metric names if we choose this path
See also apache/iceberg-python#474 (comment) for a similar discussion in Python.