Skip to content

Conversation

@nhulston
Copy link
Contributor

@nhulston nhulston commented Feb 4, 2025

What does this PR do?

How sampling works normally (for host-based apps, not Lambda):
The tracer decides whether or not to sample a trace based on the DD_TRACE_SAMPLING_RULES (new) or DD_TRACE_SAMPLING_RATE (deprecated) env vars. Then, the tracer sets the _sampling_priority_v1 metric to the sampling priority on some or all spans. If the priority is <= 0, it should be dropped; otherwise it should be sampled.

Then the agent gets the sampling priority from this metric and decides whether to drop or sample the trace based on this sampling priority and other factors, e.g. if an error occurred or other special rules.

How sampling works in Lambda
Since the serverless agent drops the lambda span received from the tracer and creates a new lambda span, we instead get the sampling priority from the request headers of the /lambda/end-invocation request.

The complex sampling logic exists in the main datadog-agent and in the backend, but not in the serverless Go agent. Therefore, in Lambda (Python, Node, and serverless Go agent), we historically just send all traces to the backend, and let it decide whether to sample/drop traces.

This allows env variables like DD_TRACE_SAMPLING_RULES and DD_TRACE_SAMPLING_RATE to work with Bottlecap for universally instrumented runtimes.

Describe how you validated your changes

Manual testing. Prior to this PR, sampling rules worked in Node+Python, but not in Java/.NET/Golang with Bottlecap.

After these changes, sampling rules work in: Java+Golang now, and Node+Python still work as expected. As for .NET, this is one fix, but it turns out the .NET tracer does not send the correct sampling priority header on the /lambda/end-invocation request. That needs to be fixed.

I also added unit tests:

cargo test test_update_span_context_with_sampling_priority
cargo test test_update_span_context_with_invalid_priority
cargo test test_update_span_context_no_sampling_priority

Additional Notes

In the future, we could implement the complex sampling logic that exists in the main agent to take some load off the backend (this would be done in libdatadog). This consists of logic like drop traces with negative priority but don't drop traces with errors, etc.

There are many types of sampling rules, here's the list in order by priority. This PR covers the first cases 1-4 and 6 but not 5:

  1. remote sampling rules
  2. local sampling rules
  3. remote global sampling rate
  4. local global sampling rate
  5. sampling rates from the agent (max traces per second)
  6. if nothing else, rate defaults to 100% (keep all traces)

@nhulston nhulston changed the title [feat] Extract sampling priority from tracer and drop when sampling priority <= 0 feat: Extract sampling priority from tracer and drop when sampling priority <= 0 Feb 4, 2025
@nhulston nhulston force-pushed the nicholas.hulston/implement-sampling-priority branch from 1e7d896 to c0e8de6 Compare February 4, 2025 18:29
@nhulston nhulston marked this pull request as ready for review February 4, 2025 18:49
@nhulston nhulston requested a review from a team as a code owner February 4, 2025 18:49
@nhulston nhulston marked this pull request as draft February 5, 2025 18:22
@nhulston nhulston changed the title feat: Extract sampling priority from tracer and drop when sampling priority <= 0 feat: Extract sampling priority from tracer and apply to new lambda span Feb 7, 2025
@nhulston nhulston marked this pull request as ready for review February 7, 2025 20:13
@nhulston nhulston requested a review from duncanista February 7, 2025 20:14
Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM – make sure to manually test in a Lambda with an inferred span, double hop inferred span, and cold start. Check in APM that those traces also get dropped.

@nhulston
Copy link
Contributor Author

Tested and works with:

  • Cold start
  • Child span
  • Inferred span with no trace context

Sampling priority is not propagated through single/double hop trace propagation cases; we'd need to implement that in a separate PR. The Go agent logic that handles that is in this file: https://github.com/DataDog/datadog-agent/blob/4b5c8b9270fe4626702db6d66298a060176251d0/pkg/serverless/trace/propagation/extractor.go

They're also dropped in APM

@nhulston nhulston closed this Feb 10, 2025
@nhulston nhulston reopened this Feb 10, 2025
@nhulston nhulston merged commit e02939e into main Feb 10, 2025
23 checks passed
@nhulston nhulston deleted the nicholas.hulston/implement-sampling-priority branch February 10, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants