-
Notifications
You must be signed in to change notification settings - Fork 6
Tracing with OpenTelemetry #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d35c609 to
761d5d5
Compare
✅
|
|
🐋 This PR was built and pushed to the following Docker images: Image Names: Platforms: Image Tags: Docker metadata{
"buildx.build.ref": "builder-da6e5150-edcb-4fd3-a096-a2e287411dc4/builder-da6e5150-edcb-4fd3-a096-a2e287411dc40/j98wsrj2j2ovzux3mmp3mx360",
"containerimage.descriptor": {
"mediaType": "application/vnd.oci.image.index.v1+json",
"digest": "sha256:addd9dd42c3df1e28e933374b8bc928c88c0777acc6b0c8013a8c1db06f30479",
"size": 1609
},
"containerimage.digest": "sha256:addd9dd42c3df1e28e933374b8bc928c88c0777acc6b0c8013a8c1db06f30479",
"image.name": "ghcr.io/graphql-hive/router:pr-598,ghcr.io/graphql-hive/router:sha-ecaa441"
} |
761d5d5 to
d470f76
Compare
c167cab to
08b5890
Compare
08b5890 to
5374314
Compare
Use `CARGO_PKG_VERSION` for the instrumentation scope version instead of a hardcoded "v0". Also, rename the scope to "graphql-hive.router" for better clarity.
This change adds the `#[inline]` attribute to several functions within the tracing module. This is a micro-optimization that may improve performance by allowing the compiler to potentially reducie function call overhead.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an impressive and comprehensive pull request that introduces OpenTelemetry-based tracing to the router. The changes are well-structured, with clear separation of concerns in the new telemetry module. The addition of custom spans for GraphQL pipeline stages, HTTP requests, and the thoughtful implementation of context propagation and semantic convention compatibility are excellent.
My main feedback is a high-severity issue regarding the use of info_span! for instrumentation on hot paths, which goes against the repository's performance-first style guide. I've left a detailed comment with a suggestion to address this. Overall, this is a fantastic contribution that will significantly improve the observability of the router.
This pull request adds comprehensive OpenTelemetry‑based tracing.
It adds a set of GraphQL‑specific spans, connects the Router to Hive Console's Tracing and other OpenTelemetry exporters (via OTLP).
Goals of this Pull Request:
Things out of scope for this Pull Request:
Code changes
Almost all code that is responsible for tracing and setting up OpenTelemetry is in
internalcrate'stelemetrydirectory.You will see there span definitions, exporters setup, Hive Console Tracing integration and things like that.
Every span created by Hive Router has
hive.kindattribute that describes what kind of action the span represents. This is useful to filter spans without relying on span names, which can change over time.Performance costs?
To minimize the performance impact of tracing to minimum, spans are only created when tracing is enabled. All the computation that is needed to create attributes is also only done when tracing is enabled. I spent some time making sure the overhead is zero or near-zero when tracing is disabled.
With tracing enabled, the overhead is still as low as I could make it, by avoiding allocations, using static strings for attributes names and avoiding complex computations when creating spans and attributes. There's of course some cost (TODO: measure it once again and get the numbers).
How spans are created and used?
Spans are created using
tracing::span_info. Every "phase" has its own span and a struct that helps with creating and managing it. This is intentional, to store all span-related logic in one place and make it easy to understand what attributes are set on each span, and what methods are available to manipulate the span and not pollute the rest of the codebase with span-related logic. Imagine looking for the parsing span and the attributes if it's not centralized in one place.For example, the parsing phase has
GraphQLParseSpanstruct that creates a span and has therecord_cache_hitmethod to record whether the parsing was a cache hit or miss.The
tracingcrate "requires" attributes to be static strings, and predefined at the span creation. This is for performance reasons.That's why all the standard attributes are there from the beginning, that we later on fill in with data, like previously mentioned
record_cache_hitmethod that fills in thecache.hitattribute.You will notice that attributes names are defined as constants in the
telemetry::attributesmodule, but not used when the spans are created. The reason for it is thattracingcrate requires attributes to be static strings, so we can't use variables there.I stored them as constants to avoid typos and have a single source of truth for attribute names, when we need to refer to them later on (like when to fill them in with data, or transforming them for many reasons). This gives us some kind of type safety.
To fillup the gap between static strings used for span creation and constants defined in the
attributesmodule, I added a bunch of tests that verify that all attributes defined in theattributesmodule are actually used when creating spans. This way we can be sure that there are no typos and that all attributes are defined and match.Spans are started and stopped automatically by either using the guard pattern (calling
span.enter()returns a guard that stops the span when dropped) or instrumenting the futures with the span (usinginstrumentmethod).Spans are created by calling
::new()method on the span struct, which returns the struct instance. The span itself is stored inside the struct.What exporters are supported?
Currently, two exporters are supported:
The OTLP exporter can be configured to send traces to any OTLP-compatible backend, like Jaeger, Zipkin, Honeycomb, NewRelic, Datadog and many others.
I tried to make everything configurable, so you can enable/disable specific exporters, set endpoints, headers and other options via environment variables or expressions.
Even the collection of spans can be configured, the batching options can be set, so users can tune the performance of the tracing to their needs.
Semantic conventions for HTTP
There's a standard for HTTP semantic conventions defined by OpenTelemetry for attributes, but not every tool supports it yet.
In Hive Router, we support both deprecated attributes and the stable ones, so that we can interoperate with more tools.
This is also configurable, so users can choose which set of attributes they want to use. We have three modes:
spec_compliant: use only stable attributesdeprecated: use only deprecated attributesspec_and_deprecated: use both sets of attributesHow deduplicated requests are traced?
The
http.inflightspan is created for each outgoing HTTP request to subgraphs when request deduplication is enabled. Multiple concurrent identical requests will create separatehttp.inflightspans in different traces, but only the leader executes the HTTP call.Leader role (
hive.inflight.role="leader"):http.clientas a child span to execute the actual HTTP requestJoiner role (
hive.inflight.role="joiner"):http.client(that span exists in the leader's trace)hive.inflight.key(same fingerprint value)Cross-trace correlation: All inflight spans (leader + joiners) for the same deduplicated request share the same
hive.inflight.keyvalue, allowing observability tools to correlate them across different traces. In future we may introduce a span link.Leader trace (first/winning request):
Joiner trace (deduplicated request):
Changes in Hive SDK
Hive Console's Tracing requires the operation hash to be sent as an attribute, and represent the same hash as in Hive SDK (Usage Reporting). I made changes to Hive SDK and exposed a function to compute the operation hash, and used it in Hive Router to set the
hive.operation.hashattribute on thegraphql.operationspan.I noticed that Hive SDK relies on graphql_parser to minifiy and normalize the query before hashing it, but it's super inefficient (we do Document to String conversions etc) that I hope to reimplement in a more efficient way in near future, but that requires forking and owning the graphql_parser crate.