Skip to content

Conversation

@tcatling
Copy link
Collaborator

@tcatling tcatling commented Nov 11, 2025

This is a first draft of enabling opentelemetry integration with inspect spans, and using inspect hooks to propagate those to model calls (or across other system boundaries). Opentelemetry is the dominant provider for distributed tracing - see screenshots below for examples of how it can be useful. We actually use AWS X-Ray within AISI, but they are compatible and otel is an open standard.

This creates one otel 'trace' per sample, and each inspect span within that will become an otel 'span'. Looking at the output in jaeger, you get something like the following:

image

These are all traces.

Clicking into a trace gives you more detailed info:

image

The solver which produced this has a few custom inspect spans which you can see above translated into otel spans:

    # Define a custom solver that creates nested spans
    @solver
    def custom_solver():
        async def solve(state: TaskState, generate: Generate) -> TaskState:
            # Create a custom span to demonstrate nesting
            async with span("custom_processing", type="processing"):
                # Add some metadata
                state.metadata["processed"] = True

                # Create another nested span
                async with span("validation", type="validation"):
                    state.metadata["validated"] = True

            return state

        return solve

Trace visualisation is also useful for surfacing errors:

image

From:

    # Define a solver that intentionally raises an exception
    @solver
    def error_solver():
        async def solve(state: TaskState, generate: Generate) -> TaskState:
            async with span("before_error", type="processing"):
                state.metadata["before"] = True

            # This span will have an exception recorded
            async with span("error_span", type="processing"):
                state.metadata["about_to_fail"] = True
                raise ValueError("Intentional test error!")

            # This won't be reached
            async with span("after_error", type="processing"):
                state.metadata["after"] = True

            return state

        return solve

I've validated that this does successfully insert otel headers in httpx requests. These look like traceparent=00-6e94d6fe73078499fe0ab315cb7ea7d0-95d43b4b053d0c23-01, which can be explained like:

  00-6e94d6fe73078499fe0ab315cb7ea7d0-95d43b4b053d0c23-01
  │  │                                │                │
  │  │                                │                └─ flags: 01 (sampled/recorded)
  │  │                                └─ parent_span_id: 95d43b4b053d0c23
  │  └─ trace_id: 6e94d6fe73078499fe0ab315cb7ea7d0
  └─ version: 00 (W3C Trace Context v1.0)

This allows propagating trace info across system boundaries so, assuming i'm collecting the emitted data from both systems, I can view correlated activity in a single place.

I'm quite excited about this because it allows linking span-level inspect info with network-level activity from a platform point of view. For example, it will make it far easier to separate and group network activity from different agents. In the future it would be interesting to think about propagating this trace info into sandboxes.

I'm sure there's loads of things i've missed with this implementation, but please let me know if you think this is a direction worth pursuing. All feedback very welcome.

Setup Notes

A common pattern with tracing is to have a local process (or sidecar etc) acting as a collector, which ships trace data elsewhere. For example to create the above, I ran the following docker compose:

version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    container_name: jaeger
    ports:
      - "16686:16686"  # Jaeger UI
      - "14250:14250"  # Jaeger gRPC
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    restart: unless-stopped

  otel-collector:
    image: otel/opentelemetry-collector:latest
    container_name: otel-collector
    command: ["--config=/etc/otel-collector-config.yml"]
    volumes:
      - ./otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "13133:13133" # health_check extension
    depends_on:
      - jaeger
    restart: unless-stopped

and configured inspect via:

    configure_opentelemetry(
        enabled=True,
        service_name="inspect_ai_test",
        exporter="otlp",
        endpoint="http://localhost:4317",
    )

Before running the eval.

These requests are on localhost (very normal for a trace collector) so they should be fast. You can see we're also using BatchSpanProcessor so I think the performance implications of this should be minimal.

'Recording' (sending to a collector) is actually independent of trace ID generation and propagation; I fully expect most users will never care about this and will not enable recording. However, within AISI (and I think probably other places with model proxies and centralised platforms like METR hawk) it would still be super valuable to have a trace ID injected into our platform systems (where we are recording) which can be correlated with eval logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant