Visibility Challenge - Long-Running Processes and Ungraceful Shutdowns #4646

mladjan-gadzic · 2025-09-01T12:06:52Z

mladjan-gadzic
Sep 1, 2025

TL;DR:

Partial Trace Connector solution proposal fixes the problem of losing trace data when processes crash or are terminated unexpectedly. The connector would periodically export incomplete spans from long-running operations, ensuring better observability and debugging capabilities even when processes don't finish properly. Key benefits include real-time monitoring and complete traces despite system failures, though questions remain about implementation details like timing intervals and completion detection methods.

Overview

This proposal introduces a new component called a "partial trace connector" to handle incomplete spans from processes that may be long-running, crash or terminate unexpectedly, ensuring trace completeness in distributed systems.

Current Problem

When a process crashes or terminates unexpectedly, spans that were in progress are lost, leading to incomplete traces and making it difficult to debug issues in distributed systems.

Visibility Challenges for Long-Running Processes

Long-running processes with active spans present significant visibility challenges in observability systems:

Lack of Real-Time Monitoring:

Operations that run for minutes, hours, or days provide no observability data until completion
System operators cannot see current progress, resource utilization, or performance metrics
Debugging active issues becomes impossible without visibility into ongoing operations

Performance Bottleneck Detection:

Slow operations only become apparent after completion, making proactive optimization difficult
Resource contention and deadlocks cannot be detected until spans end
SLA violations may occur without early warning systems

Operational Blindness:

Teams lose visibility into system state during critical long-running operations
Capacity planning becomes challenging without understanding active workload characteristics
Incident response is hampered by lack of real-time operational context

Business Impact:

Long-running batch jobs, data processing pipelines, and background tasks operate as "black boxes"
Progress tracking requires separate monitoring solutions outside the tracing ecosystem
User-facing operations with long processing times appear to be "stuck" from an observability perspective

Proposed Solution

Partial Trace Connector

Add a new component called partial trace connector that:

Keeps unfinished spans locally
Forwards completed spans to the rest of the pipeline
Completes partial traces when processes die
Periodically exports partial spans for long-running operations visibility
Ensures the rest of the exporters receive complete traces

How It Works

SDK Behavior: The SDK periodically exports spans, but instead of going directly to Jaeger (or other backends), spans first go through the partial trace connector
Connector Logic:
- No end time specified: Treat as partial trace, store locally
- End time specified: Treat as complete trace, remove any previous partial instance of that span/trace and forward the complete span
Long-Running Process Visibility: Periodically export partial spans (with special markers) to provide visibility into ongoing operations without waiting for completion
Process Failure Handling: When a process dies, the connector completes any remaining partial traces and pushes them through the pipeline

Specification Details

Span Classification

Partial Span: No end time specified, represents ongoing operation
Complete Span: End time specified, represents finished operation
Intermediate Span: Partial span exported for visibility purposes with special markers indicating it's still active

Connector Behavior

Store partial spans locally until completion
Forward complete spans to downstream components
Periodically export intermediate spans for long-running operation visibility
Handle span completion on process termination
Mark intermediate spans with appropriate attributes/flags to distinguish from completed spans

Open Questions

Heartbeat Interval Decision Point

Key Question: Should the decision of when a partial span is considered "complete" be made by:

The Connector - Based on configured timeouts/heartbeat intervals
The SDK - SDK decides when to mark spans as complete

Considerations:

Connector-based: More centralized control, consistent behavior across SDKs
SDK-based: More flexibility per application, better performance characteristics

Long-Running Operation Visibility

Key Question: How should intermediate span exports for long-running operations be configured?

Options:

Fixed Interval: Export partial spans at regular intervals (e.g., every 30 seconds)
- Simple to implement and understand
- Consistent visibility across all operations
- May be inefficient for very short or very long operations
Progressive Backoff: Start with frequent exports, then reduce frequency for very long operations
- Initial rapid visibility for debugging urgent issues
- Reduced overhead for stable long-running processes
- Complexity in determining backoff schedule
Configurable per Span: Allow spans to specify their own export interval preferences
- Maximum flexibility for different operation types
- Applications can optimize based on business requirements
- Potential for configuration sprawl and inconsistency
Hybrid Approach: Combine fixed intervals with span-specific configuration
- Default behavior with override capabilities
- Balance between simplicity and flexibility
- Best of both worlds but more complex to implement

Visibility Benefits:

Progress Monitoring: Track completion percentage and processing rates for batch operations
Resource Utilization: Monitor memory, CPU, and I/O patterns during active operations
Performance Profiling: Identify hotspots and optimization opportunities in real-time
Alerting Capability: Set up alerts for operations that exceed expected duration or resource usage
Dependency Tracking: Understand which services are actively waiting on long-running operations
Capacity Planning: Analyze concurrent operation patterns to optimize resource allocation

Implementation Considerations:

Performance Impact: Balance visibility needs with export overhead
Storage Costs: Frequent intermediate exports increase data volume
User Experience: Faster visibility vs. system efficiency trade-offs
Data Consistency: Ensure intermediate spans don't interfere with final trace analysis
Query Performance: Design intermediate span markers to enable efficient filtering in observability backends

Benefits

Pipeline Compatibility: Rest of the pipeline doesn't need to change - all components receive complete traces
Reliability: No lost spans due to process crashes
Debugging: Complete traces even when processes terminate unexpectedly
Real-time Visibility: Monitor long-running operations in progress without waiting for completion
Performance Monitoring: Detect slow operations early and identify bottlenecks in active processes
Operational Awareness: Understand current system state and active workflows
Transparency: Existing exporters work without modification

Implementation Considerations

Storage mechanism for partial spans
Cleanup strategies for orphaned spans
Performance impact of local storage
Configuration options for heartbeat intervals
Intermediate span export intervals and backoff strategies
Span attribute marking for intermediate vs. completed spans
Memory management for long-running partial spans
Deduplication logic to handle multiple intermediate exports of the same span
Backward compatibility with existing setups

Additional Context: Common Visibility Gaps

Process State

No indication of current operation being performed
Lack of progress indicators or completion estimates
Missing status reporting for multi-stage operations

Performance Metrics

No visibility into resource utilization
Missing timing information for different phases
Lack of throughput or rate measurements

Error Handling

Silent failures with no logging or alerts
Poor error context when failures do occur
No retry or recovery status information

Operational Awareness

Difficulty distinguishing between healthy delays and hung processes
No way to safely interrupt or restart stuck operations
Missing health checks or heartbeat mechanisms

Impact of Poor Visibility

The lack of visibility in long running processes leads to:

Increased debugging time and operational overhead
Poor user experience due to uncertainty about process status
Difficulty in capacity planning and performance optimization
Reduced confidence in system reliability
Challenges in meeting SLA requirements

kamphaus · 2025-09-01T17:14:22Z

kamphaus
Sep 1, 2025

Hi @mladjan-gadzic,
Long-Running traces are of interest for CICD: open-telemetry/semantic-conventions#1648
We in the CICD SIG have been following previous discussions on the topic but we did not yet have the capacity to delve deeper into it.
There are some existing issues here in the opentelemetry-specification repo related to long-running spans:

Did you have the opportunity to go over these previous issues?

Regarding when intermediate spans are exported, how would this interact with timeout detection in the collector/backend system?
Would there be a need to configure a maximum timeout in the SDK before which an intermediate span must be exported to prevent the timeout in the collector from being reached? Or would later intermediate spans reopen a timed out span?
Could this lead to a span flapping between timed out and ongoing?

Could a long running span send an intermediate span as soon as it is started?

Would it be possible to have durable long running spans in the SDK?
Ie. that a process could recover from a crash and resume operation where it left off under the same long-running span.

1 reply

mladjan-gadzic Sep 2, 2025
Author

Did you have the opportunity to go over these previous issues?

I have, as a matter of fact, I've proposed two PRs for .net and python, but it seems that community is not interested in it.

Regarding when intermediate spans are exported, how would this interact with timeout detection in the collector/backend system?

It should not interfere with timeout detection in collector/backend system because connector comes before those and forwards finished spans as any other connector.

Would there be a need to configure a maximum timeout in the SDK before which an intermediate span must be exported to prevent the timeout in the collector from being reached? Or would later intermediate spans reopen a timed out span?

Ideally there would be configurable heartbeat interval when span is exported.

Could this lead to a span flapping between timed out and ongoing?

It should not, because from SDK perspective not-finished span in just being exported and collector/backend will not receive it becuase it is not finished so it will be kept on connector side.

Could a long running span send an intermediate span as soon as it is started?

This is going to be handled by heartbeat interval. To make SDK less chatter, there should be delay introduced so that short lived spans do not produce multiple spans during its lifetime when not necessary.

Would it be possible to have durable long running spans in the SDK?

This seems to be out of scope ATM.

pellared · 2025-09-01T17:45:59Z

pellared
Sep 1, 2025
Collaborator

Related issues that are supposed to tackle most of the described problems:

2 replies

mladjan-gadzic Sep 2, 2025
Author

Is there any movement on this? How this can be pushed forward?

pellared Sep 2, 2025
Collaborator

How can this be pushed forward?

I’d suggest focusing on open-telemetry/semantic-conventions#2133. A good next step could be:

Propose some concrete semantic conventions in the issue (e.g. define event names, attributes).
Raise it during the Semantic Conventions and/or Specification SIG meetings to collect early feedback.
Open a PR in open-telemetry/semantic-conventions with the proposed events semantic conventions for span lifecycle.

The key part is to build interest and get reviewers engaged in the discussion.

jsuereth · 2025-09-02T15:01:47Z

jsuereth
Sep 2, 2025
Maintainer

I've been toying with an SDK model to export data out-of-process as quickly as possible (http://github.com/jsuereth/otlp-mmap). As part of that, the question of how to get traces out-of-process quickly lead me down a similar design of thinking about a (mostly hidden in the SDK) mode of writing span-start, span-update and span-end events to circular buffer that would be used on the other side of the otlp-mmap implementation to reconstruct traces and export Spans via OTLP.

This would help be more resilient on failure (events get put into process-shared memory quickly, on process death, collector process can still read), but also may improve hot-path performance (at cost of some memory overhead between processes).

I do think some kind of underlying event model for Span may make sense, whether or not we expose this via protocol or public API - not sure of yet, but I'm amenable to this direction.

2 replies

kamphaus Sep 2, 2025

If we want to have real-time monitoring of long-running traces in any backends we would have to expose incomplete spans to those backends without having to wait on intermediate processing exporting the completed spans.

jsuereth Sep 2, 2025
Maintainer

@kamphaus exactly. For previous examples of this, I know OpenTelemetry C++ supported a "/tracez" endpoint that would let you scrape active running traces via a span processor that basically updated an in-memory storage of live spans, allowing at list a process-local capability of understanding long-running spans. It did not handle ungraceful shutdown though, and ungraceful shutdown is something I think we may need to look into alternative SDK design space for.

When you bundle ungraceful shutdown and long-running processes, it makes the idea of exposing a raw-event model that can reconstitute into trace more interesting.

Visibility Challenge - Long-Running Processes and Ungraceful Shutdowns #4646

Uh oh!

Uh oh!

mladjan-gadzic Sep 1, 2025

TL;DR:

Overview

Current Problem

Visibility Challenges for Long-Running Processes

Proposed Solution

Partial Trace Connector

How It Works

Specification Details

Span Classification

Connector Behavior

Open Questions

Heartbeat Interval Decision Point

Long-Running Operation Visibility

Benefits

Implementation Considerations

Additional Context: Common Visibility Gaps

Process State

Performance Metrics

Error Handling

Operational Awareness

Impact of Poor Visibility

Replies: 3 comments · 5 replies

Uh oh!

kamphaus Sep 1, 2025

Uh oh!

Uh oh!

mladjan-gadzic Sep 2, 2025 Author

Uh oh!

Uh oh!

pellared Sep 1, 2025 Collaborator

Uh oh!

mladjan-gadzic Sep 2, 2025 Author

Uh oh!

pellared Sep 2, 2025 Collaborator

Uh oh!

jsuereth Sep 2, 2025 Maintainer

Uh oh!

kamphaus Sep 2, 2025

Uh oh!

jsuereth Sep 2, 2025 Maintainer

mladjan-gadzic
Sep 1, 2025

Replies: 3 comments 5 replies

kamphaus
Sep 1, 2025

mladjan-gadzic Sep 2, 2025
Author

pellared
Sep 1, 2025
Collaborator

mladjan-gadzic Sep 2, 2025
Author

pellared Sep 2, 2025
Collaborator

jsuereth
Sep 2, 2025
Maintainer

jsuereth Sep 2, 2025
Maintainer