Skip to content

Conversation

@majanjua-amzn
Copy link
Contributor

@majanjua-amzn majanjua-amzn commented Aug 26, 2025

Request

  1. Customers should be able to see not only the span that is an anomaly from the error capture logic, but also the rest of the spans in that trace where possible
  2. Statistics should be accurate even if a service is called multiple times as part of the same trace, i.e. the current segment/operation closing and returning is not enough to signal the end of the trace, we need to keep track of the trace ID and ensure any repeated calls to a service as part of the same trace are

Changes

Main changes

  • Replace anomalyTracesSet with traceUsageCache: This cache represents the last operation that was done on a given trace within the last 10 minutes relating to adaptive sampling features. It is implemented using Caffeine, a high-performance cache with built-in entry expiration features. This allows traces to drop out of the cache automatically after time passes while also maintaining a maximum cache size like a HashMap
    • A non-anomaly span that is matched by a non-adaptive-sampling rule with no local anomaly capture configuration will be completely ignored
    • A non-anomaly span that is matched by a rule with adaptive sampling enabled with no local anomaly capture configuration will be added to the cache with traceID->NEITHER to signify the trace has been seen but no anomaly has
    • An anomaly span that is matched by a rule with adaptive sampling enabled with no local anomaly capture configuration will be added to the cache with traceID->SAMPLING_BOOST to signify the trace has been counted towards anomaly statistics
    • etc
  • Replaced logic that relied on whether the span was a local root span to identify the end of the trace with just checking the cache to see if the trace has been seen so it is not counted again
  • Added logic to keep the cache up to date with the latest decision making made based on the current span.

Misc other changes

  • Changed fast-exit condition from !adaptiveSamplingRuleExists to !adaptiveSamplingRuleExists && this.adaptiveSamplingConfig == null to ensure local configuration alone can still function without server side logic for anomaly capturing
  • Renamed all mentions of error to anomaly and span to trace where necessary
  • Default anomaly capture rate limiter to 1 trace per second if a local configuration is provided with no anomaly capture rate configured
  • Added static functions to AwsXrayAdaptiveSampling.UsageType to quickly tell whether a usage is for boost or for anomaly trace capturing

Testing

  • Unit tests updated
  • Deployed to test environment:
    • Verified application that has multiple spans after an anomaly occurs has other spans appropriately exported even if they are not anomalies after an anomaly is identified and exported
    • Verified no change in statistics provided to backend

Example of relevant test cases explored during testing:

  • A -> B -> C, B and C configured with sampling-boost condition on ^500$ error code RegEx and anomaly-trace-capture on ^501$
    B calls C 5 times and receives the following responses:
    a. 500
    b. 501
    c. 500
    d. 200
    e. 200
  • B returns 500 to A

Results:

  • B sends the following boost documents: [SamplingBoostStatisticsDocument{ruleName=RuleA, serviceName=ServiceB, timestamp=Wed Aug 27 21:14:21 UTC 2025, anomalyCount=1, totalCount=1, sampledAnomalyCount=0}]
  • C sends the following boost documents: [SamplingBoostStatisticsDocument{ruleName=RuleA, serviceName=ServiceC, timestamp=Wed Aug 27 21:23:00 UTC 2025, anomalyCount=1, totalCount=1, sampledAnomalyCount=0}]
  • Trace only shows spans for [b-e] as [a] was only used for boost condition, so partial trace shows 4/5 of the spans of interest as expected based on current design
Screenshot 2025-08-27 at 3 28 34 PM

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@majanjua-amzn majanjua-amzn requested a review from wangzlei August 26, 2025 19:14
@majanjua-amzn majanjua-amzn requested a review from a team as a code owner August 26, 2025 19:14
@majanjua-amzn majanjua-amzn added enhancement New feature or request X-Ray AWS X-Ray components traces Tracing related issues java Pull requests that update Java code labels Aug 26, 2025
@majanjua-amzn majanjua-amzn self-assigned this Aug 27, 2025
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/v2.11.x@5ecad33). Learn more about missing BASE report.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@                Coverage Diff                 @@
##             release/v2.11.x    #1165   +/-   ##
==================================================
  Coverage                   ?   66.92%           
  Complexity                 ?      519           
==================================================
  Files                      ?       54           
  Lines                      ?     2676           
  Branches                   ?      372           
==================================================
  Hits                       ?     1791           
  Misses                     ?      749           
  Partials                   ?      136           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@majanjua-amzn majanjua-amzn merged commit fb1d39b into release/v2.11.x Aug 28, 2025
6 of 8 checks passed
@majanjua-amzn majanjua-amzn deleted the anomaly-capture branch August 28, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Pull requests that update Java code traces Tracing related issues X-Ray AWS X-Ray components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants