-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
Summary
We're interested in promoting the TelemetryPlugin and TelemetryAwarePlugin interfaces from @ExperimentalApi to @PublicApi status to provide stability guarantees for the growing number of plugins that depend on this extension point.
We're seeking community feedback on:
- What gaps or technical challenges currently prevent stabilization?
- Whether scalability, performance, or API design concerns exist that need addressing
- Community interest in helping assess the readiness of these APIs for stabilization
Current Status
Both interfaces are currently marked as @ExperimentalApi:
This means they can change or be removed at any time (major, minor, or patch releases) without backwards compatibility guarantees.
Use Cases
1. External Plugins Adopting Telemetry
Multiple plugins are beginning to implement TelemetryAwarePlugin to emit metrics and traces:
- ml-commons: ML plugin implementing
TelemetryAwarePluginto receiveMetricsRegistryfor instrumenting ML operations - time-series-db: Time-series database plugin implementing
TelemetryAwarePluginto emit telemetry metrics for database operations
These plugins need API stability to avoid breaking changes in production environments.
2. Core OpenSearch Already Relies on These Interfaces
The core OpenSearch codebase has deeply integrated these interfaces:
- Node initialization (
Node.java:687-700): Filters and validatesTelemetryPluginimplementations during node startup - TelemetryModule (
TelemetryModule.java): Core module that loads and registers telemetry plugins - Plugin component creation (
Node.java:1099): InjectsTracerandMetricsRegistryintoTelemetryAwarePlugincomponents - Official telemetry-otel plugin: Ships with OpenSearch as the reference
TelemetryPluginimplementation
If the interface changes, it would break not just external plugins but also core OpenSearch telemetry infrastructure.
Maturity & Stability
- Age: Introduced in OpenSearch 2.9.0 (July 2023) β has been in production for ~2.5 years
- Interface Stability: The API surface is minimal and well-designed:
public interface TelemetryPlugin { Optional<Telemetry> getTelemetry(TelemetrySettings telemetrySettings); String getName(); }
- No Breaking Changes: The interface has remained stable since its introduction
- Production Usage: Actively used by the bundled
telemetry-otelplugin and test frameworks
Risks of Not Promoting
- Plugin Ecosystem Fragmentation: External plugins hesitate to adopt telemetry because the API could break at any time - this includes ml-commons and time-series-db
- Breaking Changes in Minor Releases: Since it's
@ExperimentalApi, the interface could theoretically change in a patch release, breaking all dependent plugins - Signals Telemetry is Not Production-Ready: Despite being shipped with OpenSearch for 2.5 years, the experimental tag suggests it's not ready for production use
Proposal: Phased Stabilization Starting with Metrics
We propose a phased approach to stabilization, starting with metrics-related components, followed by tracing in a later release.
Phase 1: Metrics Stabilization
@PublicApi(since = "TBD")
public interface TelemetryPlugin { ... }
@PublicApi(since = "TBD")
public interface TelemetryAwarePlugin { ... }
// Metrics-related types
@PublicApi(since = "TBD")
public interface MetricsRegistry { ... }
@PublicApi(since = "TBD")
public interface Counter { ... }
@PublicApi(since = "TBD")
public interface Histogram { ... }
@PublicApi(since = "TBD")
public interface Tags { ... }
// Core dependencies
@PublicApi(since = "TBD")
public interface Telemetry { ... }
@PublicApi(since = "TBD")
public interface TelemetrySettings { ... }
@PublicApi(since = "TBD")
public interface MetricsTelemetry { ... }Phase 2: Tracing Stabilization (Target: Later Release)
Stabilize tracing-related types after metrics prove stable:
Tracer,Span,SpanScope,ScopedSpan,SpanContext,SpanCreationContext,Attributes, etc.
Rationale for Phased Approach
- Smaller scope: ~8 types for metrics vs. ~20 total
- Clear use case: Both ml-commons and time-series-db primarily need metrics
- Lower risk: Metrics API is simpler and more stable than distributed tracing
- Learn from adoption: Gather feedback from Phase 1 before committing to tracing APIs
Expected benefits:
- Backwards compatibility guarantees for metrics within major releases
- Enables production use of metrics in external plugins
- Smaller initial commitment while validating stability approach
Open questions for the community:
- Is metrics-only stabilization useful, or do plugins need both metrics and tracing together?
- Are there known gaps in the metrics API that should be addressed first?
- Do scalability or performance concerns exist at high metrics volumes?
- Are there planned changes to OpenTelemetry integration that would require breaking changes?
Related Dependencies: Transitive API Stability
A critical challenge in promoting TelemetryPlugin to @PublicApi is that all types referenced in its method signatures must also be stable.
For TelemetryPlugin:
@PublicApi // Stable interface
public interface TelemetryPlugin {
Optional<Telemetry> getTelemetry(TelemetrySettings settings);
// ^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
// Both currently @ExperimentalApi - must also stabilize
}For TelemetryAwarePlugin:
@PublicApi // Stable interface
public interface TelemetryAwarePlugin {
Collection<Object> createComponents(..., Tracer tracer, MetricsRegistry metricsRegistry);
// ^^^^^^ ^^^^^^^^^^^^^^^^
// Both @ExperimentalApi - must also stabilize
}Complete List of Dependent Types
Metrics-related (proposed for Phase 1):
MetricsRegistry- parameter toTelemetryAwarePlugin.createComponents()Counter- returned byMetricsRegistry.createCounter()Histogram- returned byMetricsRegistry.createHistogram()Tags- parameter toCounter.add()andHistogram.record()methodsTaggedMeasurement- used inMetricsRegistry.createGauge()MetricsTelemetry- exposed byTelemetry.getMetricsTelemetry()
Core types (needed for both metrics and tracing):
Telemetry- returned byTelemetryPlugin.getTelemetry()TelemetrySettings- parameter toTelemetryPlugin.getTelemetry()
Tracing-related (proposed for Phase 2 - later release):
TracingTelemetry- exposed byTelemetry.getTracingTelemetry()Tracer- parameter toTelemetryAwarePlugin.createComponents()Span- returned byTracer.startSpan()SpanScope- returned byTracer.withSpanInScope()ScopedSpan- returned byTracer.startScopedSpan()SpanContext- returned byTracer.getCurrentSpan()SpanCreationContext- parameter toTracer.startSpan()SpanKind- already @publicapi(since = "2.11.0")Attributes- used bySpan.addAttribute()TransportTracer- parent interface ofTracer
Questions for the Community
We believe the metrics APIs are ready for stabilization, but want to identify any gaps or concerns first.
API Completeness & Design
- Are there known gaps in the metrics API that plugins commonly encounter?
- Have you needed workarounds or hacks for metrics collection?
- Are there OpenTelemetry metrics features we should expose before stabilizing?
Performance & Scalability
- Are there performance bottlenecks at high metrics volumes?
- Does the current API design impose limitations on optimization?
- Do you have concerns about the plugin lifecycle or metrics registry initialization?
OpenTelemetry Compatibility
- Are there planned OTel upgrades that would require breaking changes to metrics APIs?
- Do the current abstractions adequately insulate plugins from OTel evolution?
Maintainer & Testing Capacity
- What's the maintainer capacity for supporting stable metrics APIs long-term?
- Are there edge cases in the implementation that need addressing first?
- Do we have adequate test coverage to guarantee metrics API stability?
Plugin Developer Needs
- Is metrics-only stabilization useful, or do you need tracing APIs stabilized simultaneously?
- Are there breaking changes you'd want to make before stabilization?
- Would you be willing to test pre-release stable APIs and provide feedback?
Potential Path Forward
Phase 1: Gap Assessment (2-4 weeks)
- Survey plugin developers (ml-commons, time-series-db, telemetry-otel) about metrics API pain points
- Document known gaps or missing functionality
- Identify planned OTel upgrades affecting metrics
- Assess maintainer capacity for long-term support
Phase 2: Metrics API Audit (3-4 weeks)
- Review metrics-related types for completeness (~8 types)
- Document compatibility guarantees and expected behavior
- Identify breaking changes needed before stabilization
- Set up japicmp compatibility tests for metrics APIs
Phase 3: RFC & Review (2-3 weeks)
- Publish formal RFC with findings from Phases 1-2
- Gather community feedback on metrics stabilization
- Get maintainer buy-in on long-term support commitment
Phase 4: Metrics Stabilization
- Promote metrics interfaces to
@PublicApi - Update documentation and migration guides
- Add comprehensive compatibility tests
- Monitor early adoption
Related component
Libraries
Describe alternatives you've considered
No response
Additional context
No response