You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(tracing): support 128-bit trace ids [does not include support for b3 and w3c] (#5326)
## Description
Currently Datadog spans are correlated using 64 bit trace ids. However,
as industry standards have sprung up around distributed tracing
([OpenTracing](https://github.com/opentracing/specification/blob/master/rfc/trace_identifiers.md#trace-context-http-headers),
[OpenCensus](https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Span.md#traceid),
and now
[OpenTelemetry](https://opentelemetry.io/docs/reference/specification/trace/api/#spancontext))
the accepted standard length for trace IDs has settled at 128 bits. This
PR introduced a configuration which can enable the generation and
propagation of 128bit trace ids
(`DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED`).
With this change 128bit trace ids will be `opt-in`. 128bit trace ids
will become the default standard in a future PR.
## Components
### Format
Current format: trace ids are 64bit integers with the following binary
representation: <64 random bits>
Proposed format: trace ids are 128bit integers with the following binary
representation: `<32-bit unix seconds><32 bits of zero><64 random bits>`
### Encoding
128bit integers are not compatible with ddtrace trace encoders and agent
endpoints. As a workaround 128bit trace ids will be encoded in two
fields.
- The 64 lowest order bits will be stored in the `trace_id` field in
Json and MsgPack Encoders.
- The 64 highest order bits will be encoded into hex and stored as a
span tag with the key `_dd.p.tid`.
#### Example
Span:
`<Span(id=3822442170818112150,trace_id=133030438088573679178230931306392465947,parent_id=None,name=bit128)>`
trace_id as binary:
`01100100000101001011101011111101000000000000000000000000000000001101100111101101100000011001111100101111111100100011001000011011`
trace_id lower order bits:
`1101100111101101100000011001111100101111111100100011001000011011`
trace_id lower order bits as integer: `15703349996414972443`
trace_id higher order bits:
`0110010000010100101110101111110100000000000000000000000000000000`
trace_id higher order bits as hex: `6414bafd00000000`
Encoded Span: `span_id=3822442170818112150` ,
`trace_id=15703349996414972443` , `name=bit128`, ` _meta={"_dd.tid":
"6414bafd00000000")>`
### Distributed Tracing
This PR only adds support for propagating 128bit trace ids using the
datadog propagation mode. Although b3 and w3c trace header formats
support 128bit the Datadog tracer truncates all trace ids to 64 bits.
Moving forward this is unacceptable. Supporting 128bit trace ids in b3
and w3c will be added in a future PR.
#### Datadog Distributed Tracing Headers
Similar to the encoding example above the `x-datadog-trace-id` header
will propagate the 64 lower order bits as an integer and
`x-datadog-tags` header will propagate the higher order bits as hex
(using the `_dd.tid` tag).
##### Example
Span:
`<Span(id=3822442170818112150,trace_id=133030438088573679178230931306392465947,parent_id=None,name=bit128)>`
distributed tracing headers: `{"x-datadog-tags":
"t.tid:6414bafd00000000", "x-datadog-trace-id": 15703349996414972443,
"x-datadog-parent-id": 3822442170818112150}`
### Sampling
The 64 lowest order bits are random but the 64 highest order bits
correspond to the unix time and are not random. When 128bit trace ids
are generated we should only use the lowest order 64 bits (random
component) to determine whether a span should be sampled. This will
ensure when trace ids mapped to values from 0 to 1 we get a uniform
random distribution.
## Testing Strategy
- Run the tracer test suite with
`DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED=true`. This will ensure all
tracing operations work as expected when 128bit trace ids are generated
(ex: sampling, distributed tracing, encoding).
- Ensure all 128bit trace id system tests pass.
- Add integration tests for the following scenarios
- 128bit trace ids are encoded without data loss and raising an
OverflowError.
- trace_id field should only contain a 64bit integer and the remaining
bits are stored in the `_dd.p.tid` tag.
- 128bit trace ids are propagated by Datadog distributing tracing
headers.
- Ensure the full 128bit trace id propagated and reconstructed by
downstream services.
- Ensure Spans with 128bit trace ids are sampled at the expected rate
### Performance Testing
1. There is no performance regression when 128bit trace id generation is
disabled. This is the default mode.
2. When 128bit trace id generation is enabled there is a ~60ns increase
(578ns -> 640ns, 10%) to span creation (`Span.__init__(name, ......)`).
This performance regression does not appear avoidable.
- This overhead was measured on M1 using python 3.8. Results vary across
platforms and python versions.
## Next Steps
1. Support 128bit trace id propagation in b3 and w3c headers
- The code change in straight forward (ie avoid truncating incoming
trace ids) but this change will require a significant refactor of
existing tests.
2. Add support for the `DD_TRACE_128_BIT_TRACEID_LOGGING_ENABLED`
environment variable
- This supports logging 64bit trace ids even when
`DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED=true`
4. Enable 128bit trace id on a sample application (ex: internal services
using the ddtrace library)
- Ensure sampling rates are respected
- Ensure logs correlation works for traces with 128bit trace ids (this
is a concern raised in the RFC)
- Ensure distributed tracing works across tracers that only support
64bit trace ids
- Ensure the full 128bit trace id is reconstructed and is viewable in
the Datadog product
5. Enable support for 128bit trace id by default and add public
documentation.
## Checklist
- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/contributing.html#Release-Note-Guidelines)
are followed.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Author is aware of the performance implications of this PR as
reported in the benchmarks PR comment.
## Reviewer Checklist
- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer is aware of, and discussed the performance implications
of this PR as reported in the benchmarks PR comment.
---------
Co-authored-by: Kyle Verhoog <[email protected]>
0 commit comments