WIP - Enable OpenTelemetry integration with existing span and hook architecture #2755
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a first draft of enabling opentelemetry integration with inspect spans, and using inspect hooks to propagate those to model calls (or across other system boundaries). Opentelemetry is the dominant provider for distributed tracing - see screenshots below for examples of how it can be useful. We actually use AWS X-Ray within AISI, but they are compatible and otel is an open standard.
This creates one otel 'trace' per sample, and each inspect span within that will become an otel 'span'. Looking at the output in jaeger, you get something like the following:
These are all traces.
Clicking into a trace gives you more detailed info:
The solver which produced this has a few custom inspect spans which you can see above translated into otel spans:
Trace visualisation is also useful for surfacing errors:
From:
I've validated that this does successfully insert otel headers in httpx requests. These look like
traceparent=00-6e94d6fe73078499fe0ab315cb7ea7d0-95d43b4b053d0c23-01, which can be explained like:This allows propagating trace info across system boundaries so, assuming i'm collecting the emitted data from both systems, I can view correlated activity in a single place.
I'm quite excited about this because it allows linking span-level inspect info with network-level activity from a platform point of view. For example, it will make it far easier to separate and group network activity from different agents. In the future it would be interesting to think about propagating this trace info into sandboxes.
I'm sure there's loads of things i've missed with this implementation, but please let me know if you think this is a direction worth pursuing. All feedback very welcome.
Setup Notes
A common pattern with tracing is to have a local process (or sidecar etc) acting as a collector, which ships trace data elsewhere. For example to create the above, I ran the following docker compose:
and configured inspect via:
Before running the eval.
These requests are on localhost (very normal for a trace collector) so they should be fast. You can see we're also using
BatchSpanProcessorso I think the performance implications of this should be minimal.'Recording' (sending to a collector) is actually independent of trace ID generation and propagation; I fully expect most users will never care about this and will not enable recording. However, within AISI (and I think probably other places with model proxies and centralised platforms like METR hawk) it would still be super valuable to have a trace ID injected into our platform systems (where we are recording) which can be correlated with eval logs.