-
Notifications
You must be signed in to change notification settings - Fork 270
Description
Area(s)
area:cicd
Is your change request related to a problem? Please describe.
This is not strictly a semantic-convention discussion, but it's come about because of trying to produce traces for CI/CD and the sem-conv CI/CD WG is the closest place for a home for this discussion.
The problem I am trying to solve is producing spans of large durations, from a kubernetes controller which may restart.
I'm specifically trying to solve this in argo-workflows, and I gave a talk at argocon (slides) which gives a lot of context to this. Argo Workflows can and is used as a CI/CD tool, but also for many other batch and machine learning scenarios where traces of the flow of a workflow would be useful. I'm pretending that the trace is the lifespan of a workflow, which may be several hours (some run for days). There are child spans representing the phase of the workflow and the individual nodes within the DAG that is being executed.
Argo Workflows
The workflow controller is best thought of here as a kubernetes operator, in this case running something not that far removed from a kubernetes job, just that the job is a DAG rather than a single pod. In the usual kubernetes controller manner, this controller is stateless, and can therefore restart at any time. Any necessary state is stored in the workflow Custom Resource. This is how the controller currently works.
I've used the GO SDK to add OTLP tracing to Argo Workflows, which works unless the controller restarts.
Ideas
Delayed span transmission.
My initial thought
My problem seemed to be that the opentelemetry SDK does not allow me to resume a Span. I'm using the golang SDK, but this is fundamental to the operation of spans. Once a Tracer is shutdown it will end all spans, and that's the end of my trace and spans. The spans will get transmitted as ended.
I therefore supposed you could put the SDK or a set of traces/spans into a mode where the end of the span didn't get transmitted at shutdown, and instead the span could be stored outside of the SDK to be transmitted later by the resumed controller once it started up. The SDK could then also facilitate "storing a span to a file".
This is possibly implementable right now with enough hacking, I haven't managed to get time for it.
Links
I could use span links and create multiple traces for a single workflow. This would crudely work, but I'd argue against it unless the presentation layer can effectively hide this from users.
The current UI for argo-workflows already has a basic view showing a timeline of a workflow for
This won't display or concern the user with controller restarts.
The target audience for these spans is probably somewhat less technical than the existing audience for http microservice tracing. Having to explain why their trace is in multiple parts and that they'll just have to deal with it isn't ideal. Span Metrics are a valuable tool here, and they'll be much more complicated or impossible in some cases.
It may be that the presentation layer can hide this - I have limited exposure to the variety of these in the market and how they deal with links.
Events
We could emit events for span start and end, and make it the "collectors" problem to correlate these and create spans. This is how Github Action tracing works - thanks to @adrielp for telling me about this.
For this to work something has to retain state - either the "End Span" event contains enough to construct the whole span (e.g. Start time) or the collector has to correlate a start event and end event, so the start event needs storage. The collector storing state would be wrong - in a cloud native environment I'd not even expect the same collector to receive the end
as received the start
. I'm trying not to be opinionated on how you configure your collector, but maybe we have to be.
Protocol changes
A different approach to delaying span transmission, but with a similar goal.
We could change the protocol to allow span end to be "ended due to shutdown", and then allow a future span end for the same span_id
to end it properly. This probably just pushes the problem onwards to the collector or the storage backend to do correlation in a similar way to events, so isn't an improvement.
Describe the solution you'd like
I don't work in the telemetry business, and so I'm sure I'm missing other prior art.
I'm open to any mechanism to solve this, and would prefer we came up with a common pattern for this and other longer span/restartable binary problems. I believe these problems will also be there in some of the android/mobile and browser implementations, to which I have little visibility or understanding. Some of these proposed solutions may not work there, so coming up with a common solution which works for my use case and these would be ideal.
I hope this sparks a useful discussion.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status