Visibility Challenge - Long-Running Processes and Ungraceful Shutdowns #4646
Replies: 3 comments 5 replies
-
Hi @mladjan-gadzic,
Did you have the opportunity to go over these previous issues? Regarding when intermediate spans are exported, how would this interact with timeout detection in the collector/backend system? Could a long running span send an intermediate span as soon as it is started? Would it be possible to have durable long running spans in the SDK? |
Beta Was this translation helpful? Give feedback.
-
Related issues that are supposed to tackle most of the described problems: |
Beta Was this translation helpful? Give feedback.
-
I've been toying with an SDK model to export data out-of-process as quickly as possible (http://github.com/jsuereth/otlp-mmap). As part of that, the question of how to get traces out-of-process quickly lead me down a similar design of thinking about a (mostly hidden in the SDK) mode of writing span-start, span-update and span-end events to circular buffer that would be used on the other side of the otlp-mmap implementation to reconstruct traces and export Spans via OTLP. This would help be more resilient on failure (events get put into process-shared memory quickly, on process death, collector process can still read), but also may improve hot-path performance (at cost of some memory overhead between processes). I do think some kind of underlying event model for Span may make sense, whether or not we expose this via protocol or public API - not sure of yet, but I'm amenable to this direction. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR:
Partial Trace Connector
solution proposal fixes the problem of losing trace data when processescrash
or areterminated
unexpectedly. The connector would periodically export incomplete spans fromlong-running
operations, ensuring better observability and debugging capabilities even when processes don't finish properly. Key benefits include real-time monitoring and complete traces despite system failures, though questions remain about implementation details like timing intervals and completion detection methods.Overview
This proposal introduces a new component called a "partial trace connector" to handle incomplete spans from processes that may be long-running, crash or terminate unexpectedly, ensuring trace completeness in distributed systems.
Current Problem
When a process crashes or terminates unexpectedly, spans that were in progress are lost, leading to incomplete traces and making it difficult to debug issues in distributed systems.
Visibility Challenges for Long-Running Processes
Long-running processes with active spans present significant visibility challenges in observability systems:
Lack of Real-Time Monitoring:
Performance Bottleneck Detection:
Operational Blindness:
Business Impact:
Proposed Solution
Partial Trace Connector
Add a new component called partial trace connector that:
How It Works
SDK Behavior: The SDK periodically exports spans, but instead of going directly to Jaeger (or other backends), spans first go through the partial trace connector
Connector Logic:
Long-Running Process Visibility: Periodically export partial spans (with special markers) to provide visibility into ongoing operations without waiting for completion
Process Failure Handling: When a process dies, the connector completes any remaining partial traces and pushes them through the pipeline
Specification Details
Span Classification
Connector Behavior
Open Questions
Heartbeat Interval Decision Point
Key Question: Should the decision of when a partial span is considered "complete" be made by:
Considerations:
Long-Running Operation Visibility
Key Question: How should intermediate span exports for long-running operations be configured?
Options:
Fixed Interval: Export partial spans at regular intervals (e.g., every 30 seconds)
Progressive Backoff: Start with frequent exports, then reduce frequency for very long operations
Configurable per Span: Allow spans to specify their own export interval preferences
Hybrid Approach: Combine fixed intervals with span-specific configuration
Visibility Benefits:
Implementation Considerations:
Benefits
Implementation Considerations
Additional Context: Common Visibility Gaps
Process State
Performance Metrics
Error Handling
Operational Awareness
Impact of Poor Visibility
The lack of visibility in long running processes leads to:
Beta Was this translation helpful? Give feedback.
All reactions