Skip to content

AWS X-Ray Adaptive Sampling Support #1141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

majanjua-amzn
Copy link
Contributor

AWS X-Ray Adaptive Sampling Support

Description

AWS X-Ray will soon be supporting adaptive sampling through the sampling rule APIs, allowing customers to configure anomaly rate based sampling rate boosts. GetSamplingTargets and GetSamplingRules will be updated to support new inputs and outputs relevant to this feature, and as such the SDK must be updated.

Sampling boost

The SDK now supports sending "boost statistics" to the GetSamplingTargets API. These statistics include the number of total requests (traces), the number of traces with anomalies detected (more on this later), and the number of anomalies sampled. The server then responds with instructions on what sampling rate to set and for how long. The SDK adjusts accordingly

Local configuration

Customer's also need to be able to define what an anomaly is for their applications to effectively provide boost statistics. By default, any 5XX error (or fallback to ERROR attribute) is treated as an anomaly. If provided, a local configuration can define specific criteria including error code regex, operation, and high latency threshold to count statistics based on.

Anomaly Capture (disabled by default)

Anomalies can also be captured directly when left unsampled. When anomalous spans are detected, a reservoir-style span capturing mechanism configured through the above local configuration will send the span directly to the spanExporter. These will appear in the console as partial traces and ensure the customer can see spans even if the boosted sampling rate was unable to capture the anomalies.

Changes

AWS X-Ray component patch update for OTel Java Contrib - see the following diff between the changes here and the release of the contrib we currently consume: link (includes diff from previous patch on the sampler)

  • Added an import for OTel semantic conventions
  • Created a class called AwsSamplingResult that includes the matched sampling rule in the trace state or propagates the received sampling rule from an upstream call using the trace state
  • Added a class called AwsXrayAdaptiveSamplingConfig for the local SDK configuration option
  • Added the config object and a batch span exporter to the AwsXrayRemoteSampler to allow identification and export of anomalies
  • Added adaptSampling function that is called on each span and acts if and only if adaptive sampling configurations are present - this is where the core logic of the feature is
  • Updated calls to GetSamplingTargets to include boost statistics
  • Updated GetSamplingRules and GetSamplingTargets request and response classes where relevant
  • In SamplingRuleApplier:
    • Able to receive boost information
    • Able to receive boost related statistics from the XrayRulesSampler
    • Fixed bug where sampling rules are scheduled to call GetSamplingTargets at slightly different times
  • In XrayRulesSampler:
    • Added attribute to spans when boost config/anomaly capturing is enabled to be able to identify boost-enabled systems in X-Ray backend
    • Accept AwsXrayAdaptiveSamplingConfig and apply it in adaptSampling to change anomaly capturing/boost logic
    • Get and propagate upstream sampling rule in shouldSample using AwsSamplingResult
    • Core implementation of adaptSampling
      • If account has no boost, return
      • [1] Gets rule to report to based on upstream matched rule propagated through trace state, [2] identifies anomalies based on local config or default 5XX, [3] captures anomaly if error capture is enabled, [4] counts boost statistics
      • Maintain anomalyTracesSet that holds trace IDs for anomaly spans to ensure we don't double count anomalies in one trace. When the local root span for this trace is encountered, it is removed from the set
    • Add generateIngressOperation based on ADOT SDK logic for getting operation - used for matching with operations provided in local configuration
  • Add unit tests

ADOT SDK Changes

  • Add YAML import for reading local configuration
  • Remove B3 and multi propagators as they remove/override the sampling rule propagated through trace state and are no longer needed
  • Update customizeSampler to provide the sampler the span exporter and the local adaptive sampling configuration and pass the sampler to the span metrics processor
  • Call adaptSampling on each span from the span metrics processor
  • Add parsing function and associated test

Testing

  • Automated release test passing and going through PR here: aws-application-signals-test-framework#442
  • Manual testing done using 3 services, A -> B -> C, where A has a boosted sampling rule and B and C produce anomalies that are sent to the backend, boosting the sampling rate in A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@majanjua-amzn majanjua-amzn self-assigned this Aug 12, 2025
@majanjua-amzn majanjua-amzn added enhancement New feature or request X-Ray AWS X-Ray components traces Tracing related issues java Pull requests that update Java code labels Aug 12, 2025
@majanjua-amzn majanjua-amzn force-pushed the adaptive-sampling branch 2 times, most recently from 4e263f5 to 2ae32ca Compare August 12, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java Pull requests that update Java code traces Tracing related issues X-Ray AWS X-Ray components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant