Skip to content

[RFC] Promote TelemetryPlugin and TelemetryAwarePlugin from @ExperimentalApi to @PublicApiΒ #20692

@karenyrx

Description

@karenyrx

Is your feature request related to a problem? Please describe

Summary

We're interested in promoting the TelemetryPlugin and TelemetryAwarePlugin interfaces from @ExperimentalApi to @PublicApi status to provide stability guarantees for the growing number of plugins that depend on this extension point.

We're seeking community feedback on:

  • What gaps or technical challenges currently prevent stabilization?
  • Whether scalability, performance, or API design concerns exist that need addressing
  • Community interest in helping assess the readiness of these APIs for stabilization

Current Status

Both interfaces are currently marked as @ExperimentalApi:

This means they can change or be removed at any time (major, minor, or patch releases) without backwards compatibility guarantees.

Use Cases

1. External Plugins Adopting Telemetry

Multiple plugins are beginning to implement TelemetryAwarePlugin to emit metrics and traces:

  • ml-commons: ML plugin implementing TelemetryAwarePlugin to receive MetricsRegistry for instrumenting ML operations
  • time-series-db: Time-series database plugin implementing TelemetryAwarePlugin to emit telemetry metrics for database operations

These plugins need API stability to avoid breaking changes in production environments.

2. Core OpenSearch Already Relies on These Interfaces

The core OpenSearch codebase has deeply integrated these interfaces:

  • Node initialization (Node.java:687-700): Filters and validates TelemetryPlugin implementations during node startup
  • TelemetryModule (TelemetryModule.java): Core module that loads and registers telemetry plugins
  • Plugin component creation (Node.java:1099): Injects Tracer and MetricsRegistry into TelemetryAwarePlugin components
  • Official telemetry-otel plugin: Ships with OpenSearch as the reference TelemetryPlugin implementation

If the interface changes, it would break not just external plugins but also core OpenSearch telemetry infrastructure.

Maturity & Stability

  1. Age: Introduced in OpenSearch 2.9.0 (July 2023) β€” has been in production for ~2.5 years
  2. Interface Stability: The API surface is minimal and well-designed:
    public interface TelemetryPlugin {
        Optional<Telemetry> getTelemetry(TelemetrySettings telemetrySettings);
        String getName();
    }
  3. No Breaking Changes: The interface has remained stable since its introduction
  4. Production Usage: Actively used by the bundled telemetry-otel plugin and test frameworks

Risks of Not Promoting

  1. Plugin Ecosystem Fragmentation: External plugins hesitate to adopt telemetry because the API could break at any time - this includes ml-commons and time-series-db
  2. Breaking Changes in Minor Releases: Since it's @ExperimentalApi, the interface could theoretically change in a patch release, breaking all dependent plugins
  3. Signals Telemetry is Not Production-Ready: Despite being shipped with OpenSearch for 2.5 years, the experimental tag suggests it's not ready for production use

Proposal: Phased Stabilization Starting with Metrics

We propose a phased approach to stabilization, starting with metrics-related components, followed by tracing in a later release.

Phase 1: Metrics Stabilization

@PublicApi(since = "TBD")
public interface TelemetryPlugin { ... }

@PublicApi(since = "TBD")
public interface TelemetryAwarePlugin { ... }

// Metrics-related types
@PublicApi(since = "TBD")
public interface MetricsRegistry { ... }

@PublicApi(since = "TBD")
public interface Counter { ... }

@PublicApi(since = "TBD")
public interface Histogram { ... }

@PublicApi(since = "TBD")
public interface Tags { ... }

// Core dependencies
@PublicApi(since = "TBD")
public interface Telemetry { ... }

@PublicApi(since = "TBD")
public interface TelemetrySettings { ... }

@PublicApi(since = "TBD")
public interface MetricsTelemetry { ... }

Phase 2: Tracing Stabilization (Target: Later Release)

Stabilize tracing-related types after metrics prove stable:

  • Tracer, Span, SpanScope, ScopedSpan, SpanContext, SpanCreationContext, Attributes, etc.

Rationale for Phased Approach

  • Smaller scope: ~8 types for metrics vs. ~20 total
  • Clear use case: Both ml-commons and time-series-db primarily need metrics
  • Lower risk: Metrics API is simpler and more stable than distributed tracing
  • Learn from adoption: Gather feedback from Phase 1 before committing to tracing APIs

Expected benefits:

  • Backwards compatibility guarantees for metrics within major releases
  • Enables production use of metrics in external plugins
  • Smaller initial commitment while validating stability approach

Open questions for the community:

  • Is metrics-only stabilization useful, or do plugins need both metrics and tracing together?
  • Are there known gaps in the metrics API that should be addressed first?
  • Do scalability or performance concerns exist at high metrics volumes?
  • Are there planned changes to OpenTelemetry integration that would require breaking changes?

Related Dependencies: Transitive API Stability

A critical challenge in promoting TelemetryPlugin to @PublicApi is that all types referenced in its method signatures must also be stable.

For TelemetryPlugin:

@PublicApi  // Stable interface
public interface TelemetryPlugin {
    Optional<Telemetry> getTelemetry(TelemetrySettings settings);
    //      ^^^^^^^^^^                ^^^^^^^^^^^^^^^^^^
    //      Both currently @ExperimentalApi - must also stabilize
}

For TelemetryAwarePlugin:

@PublicApi  // Stable interface
public interface TelemetryAwarePlugin {
    Collection<Object> createComponents(..., Tracer tracer, MetricsRegistry metricsRegistry);
    //                                   ^^^^^^           ^^^^^^^^^^^^^^^^
    //                                   Both @ExperimentalApi - must also stabilize
}

Complete List of Dependent Types

Metrics-related (proposed for Phase 1):

  • MetricsRegistry - parameter to TelemetryAwarePlugin.createComponents()
  • Counter - returned by MetricsRegistry.createCounter()
  • Histogram - returned by MetricsRegistry.createHistogram()
  • Tags - parameter to Counter.add() and Histogram.record() methods
  • TaggedMeasurement - used in MetricsRegistry.createGauge()
  • MetricsTelemetry - exposed by Telemetry.getMetricsTelemetry()

Core types (needed for both metrics and tracing):

  • Telemetry - returned by TelemetryPlugin.getTelemetry()
  • TelemetrySettings - parameter to TelemetryPlugin.getTelemetry()

Tracing-related (proposed for Phase 2 - later release):

  • TracingTelemetry - exposed by Telemetry.getTracingTelemetry()
  • Tracer - parameter to TelemetryAwarePlugin.createComponents()
  • Span - returned by Tracer.startSpan()
  • SpanScope - returned by Tracer.withSpanInScope()
  • ScopedSpan - returned by Tracer.startScopedSpan()
  • SpanContext - returned by Tracer.getCurrentSpan()
  • SpanCreationContext - parameter to Tracer.startSpan()
  • SpanKind - already @publicapi(since = "2.11.0")
  • Attributes - used by Span.addAttribute()
  • TransportTracer - parent interface of Tracer

Questions for the Community

We believe the metrics APIs are ready for stabilization, but want to identify any gaps or concerns first.

API Completeness & Design

  • Are there known gaps in the metrics API that plugins commonly encounter?
  • Have you needed workarounds or hacks for metrics collection?
  • Are there OpenTelemetry metrics features we should expose before stabilizing?

Performance & Scalability

  • Are there performance bottlenecks at high metrics volumes?
  • Does the current API design impose limitations on optimization?
  • Do you have concerns about the plugin lifecycle or metrics registry initialization?

OpenTelemetry Compatibility

  • Are there planned OTel upgrades that would require breaking changes to metrics APIs?
  • Do the current abstractions adequately insulate plugins from OTel evolution?

Maintainer & Testing Capacity

  • What's the maintainer capacity for supporting stable metrics APIs long-term?
  • Are there edge cases in the implementation that need addressing first?
  • Do we have adequate test coverage to guarantee metrics API stability?

Plugin Developer Needs

  • Is metrics-only stabilization useful, or do you need tracing APIs stabilized simultaneously?
  • Are there breaking changes you'd want to make before stabilization?
  • Would you be willing to test pre-release stable APIs and provide feedback?

Potential Path Forward

Phase 1: Gap Assessment (2-4 weeks)

  • Survey plugin developers (ml-commons, time-series-db, telemetry-otel) about metrics API pain points
  • Document known gaps or missing functionality
  • Identify planned OTel upgrades affecting metrics
  • Assess maintainer capacity for long-term support

Phase 2: Metrics API Audit (3-4 weeks)

  • Review metrics-related types for completeness (~8 types)
  • Document compatibility guarantees and expected behavior
  • Identify breaking changes needed before stabilization
  • Set up japicmp compatibility tests for metrics APIs

Phase 3: RFC & Review (2-3 weeks)

  • Publish formal RFC with findings from Phases 1-2
  • Gather community feedback on metrics stabilization
  • Get maintainer buy-in on long-term support commitment

Phase 4: Metrics Stabilization

  • Promote metrics interfaces to @PublicApi
  • Update documentation and migration guides
  • Add comprehensive compatibility tests
  • Monitor early adoption

Related component

Libraries

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    LibrariesLucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respoenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions