|
| 1 | +# Trace Loss Tracking with Antithesis Assertions |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document describes the simplified Antithesis assertion strategy implemented to track trace loss in dd-trace-java. |
| 6 | + |
| 7 | +## Implementation |
| 8 | + |
| 9 | +Assertions were added at 3 strategic points in the trace pipeline to provide complete visibility into where and why traces are lost: |
| 10 | + |
| 11 | +### 1. CoreTracer.write() - Sampling Decision Point |
| 12 | + |
| 13 | +**Location:** `dd-trace-core/src/main/java/datadog/trace/core/CoreTracer.java` |
| 14 | + |
| 15 | +**Purpose:** Track traces at the sampling decision point |
| 16 | + |
| 17 | +**Assertions:** |
| 18 | +- `trace_accepted_by_sampling` - Traces that passed sampling and will be sent |
| 19 | +- `trace_dropped_by_sampling` - Traces dropped due to sampling decision |
| 20 | + |
| 21 | +**Data Captured:** |
| 22 | +- `decision`: "accepted" or "dropped_sampling" |
| 23 | +- `trace_id`: Unique trace identifier |
| 24 | +- `span_count`: Number of spans in the trace |
| 25 | +- `sampling_priority`: Sampling priority value |
| 26 | + |
| 27 | +### 2. RemoteWriter.write() - Buffer Acceptance Point |
| 28 | + |
| 29 | +**Location:** `dd-trace-core/src/main/java/datadog/trace/common/writer/RemoteWriter.java` |
| 30 | + |
| 31 | +**Purpose:** Track traces at buffer acceptance and detect drops due to overflow or policy |
| 32 | + |
| 33 | +**Assertions:** |
| 34 | +- `trace_enqueued_for_send` - Traces successfully enqueued for serialization |
| 35 | +- `trace_dropped_buffer_overflow` - Traces dropped due to full buffer |
| 36 | +- `trace_dropped_by_policy` - Traces dropped by policy rules |
| 37 | +- `trace_dropped_writer_closed` - Traces dropped during shutdown |
| 38 | + |
| 39 | +**Data Captured:** |
| 40 | +- `decision`: "enqueued", "dropped_buffer_overflow", "dropped_policy", or "dropped_shutdown" |
| 41 | +- `trace_id`: Unique trace identifier (when available) |
| 42 | +- `span_count`: Number of spans in the trace |
| 43 | +- `sampling_priority`: Sampling priority value (when available) |
| 44 | + |
| 45 | +### 3. PayloadDispatcherImpl.accept() - HTTP Send Point |
| 46 | + |
| 47 | +**Location:** `dd-trace-core/src/main/java/datadog/trace/common/writer/PayloadDispatcherImpl.java` |
| 48 | + |
| 49 | +**Purpose:** Track actual HTTP sends to the agent and detect failures |
| 50 | + |
| 51 | +**Assertions:** |
| 52 | +- `trace_payloads_being_sent` - All send attempts (before HTTP call) |
| 53 | +- `traces_sent_successfully` - Traces successfully sent to agent |
| 54 | +- `traces_failed_to_send` - Traces that failed to send via HTTP |
| 55 | + |
| 56 | +**Data Captured:** |
| 57 | +- `decision`: "sent_success" or "dropped_send_failed" |
| 58 | +- `trace_count`: Number of traces in the payload |
| 59 | +- `payload_size_bytes`: Size of the payload in bytes |
| 60 | +- `http_status`: HTTP response status code |
| 61 | +- `dropped_traces_in_payload`: Count of traces already dropped before this send |
| 62 | +- `dropped_spans_in_payload`: Count of spans already dropped before this send |
| 63 | +- `has_exception`: Whether an exception occurred (for failures) |
| 64 | + |
| 65 | +## Complete Trace Flow |
| 66 | + |
| 67 | +``` |
| 68 | +Application → CoreTracer.write() |
| 69 | + ↓ |
| 70 | + [ASSERTION POINT 1: Sampling] |
| 71 | + ↓ ↓ |
| 72 | + published=true published=false |
| 73 | + ↓ ↓ |
| 74 | + ✅ trace_accepted_by_sampling ❌ trace_dropped_by_sampling |
| 75 | + ↓ |
| 76 | + RemoteWriter.write() |
| 77 | + ↓ |
| 78 | + [ASSERTION POINT 2: Buffer Acceptance] |
| 79 | + ↓ |
| 80 | + traceProcessingWorker.publish() |
| 81 | + ↓ |
| 82 | + ✅ trace_enqueued_for_send |
| 83 | + OR |
| 84 | + ❌ trace_dropped_buffer_overflow |
| 85 | + ❌ trace_dropped_by_policy |
| 86 | + ❌ trace_dropped_writer_closed |
| 87 | + ↓ |
| 88 | + TraceProcessingWorker (batching) |
| 89 | + ↓ |
| 90 | + PayloadDispatcherImpl.accept() |
| 91 | + ↓ |
| 92 | + [ASSERTION POINT 3: HTTP Send] |
| 93 | + ↓ |
| 94 | + 🔵 trace_payloads_being_sent |
| 95 | + ↓ |
| 96 | + api.sendSerializedTraces() |
| 97 | + ↓ ↓ |
| 98 | + response.success() !response.success() |
| 99 | + ↓ ↓ |
| 100 | + ✅ traces_sent_successfully ❌ traces_failed_to_send |
| 101 | +``` |
| 102 | + |
| 103 | +## Metrics Available After Antithesis Testing |
| 104 | + |
| 105 | +After running Antithesis tests, you will be able to calculate: |
| 106 | + |
| 107 | +### Total Traces Processed |
| 108 | +``` |
| 109 | +Total = trace_accepted_by_sampling + trace_dropped_by_sampling |
| 110 | +``` |
| 111 | + |
| 112 | +### Total Traces Lost |
| 113 | +``` |
| 114 | +Lost = trace_dropped_by_sampling |
| 115 | + + trace_dropped_buffer_overflow |
| 116 | + + trace_dropped_by_policy |
| 117 | + + trace_dropped_writer_closed |
| 118 | + + traces_failed_to_send |
| 119 | +``` |
| 120 | + |
| 121 | +### Total Traces Successfully Sent |
| 122 | +``` |
| 123 | +Success = traces_sent_successfully |
| 124 | +``` |
| 125 | + |
| 126 | +### Loss Rate |
| 127 | +``` |
| 128 | +Loss Rate = (Total Traces Lost / Total Traces Processed) * 100% |
| 129 | +``` |
| 130 | + |
| 131 | +### Loss Breakdown by Cause |
| 132 | +- **Sampling Loss:** `trace_dropped_by_sampling / Total Traces Processed` |
| 133 | +- **Buffer Overflow Loss:** `trace_dropped_buffer_overflow / Total Traces Processed` |
| 134 | +- **Policy Loss:** `trace_dropped_by_policy / Total Traces Processed` |
| 135 | +- **Shutdown Loss:** `trace_dropped_writer_closed / Total Traces Processed` |
| 136 | +- **Send Failure Loss:** `traces_failed_to_send / Total Traces Processed` |
| 137 | + |
| 138 | +## Assertion Properties |
| 139 | + |
| 140 | +All assertions use `Assert.sometimes()` which means: |
| 141 | +- They track that the condition occurred at least once during testing |
| 142 | +- They provide detailed context about each occurrence |
| 143 | +- They don't fail the test (they're for tracking, not validation) |
| 144 | + |
| 145 | +## Benefits of This Approach |
| 146 | + |
| 147 | +1. **Clear Tracking:** Each assertion has a unique, descriptive name |
| 148 | +2. **Complete Coverage:** Tracks the entire pipeline from sampling to agent |
| 149 | +3. **Detailed Context:** Captures relevant metadata at each point |
| 150 | +4. **Easy Analysis:** Simple math to calculate loss rates and breakdown |
| 151 | +5. **Actionable Data:** Identifies exactly where and why traces are lost |
| 152 | + |
| 153 | +## Example Analysis |
| 154 | + |
| 155 | +After an Antithesis test run, you might see: |
| 156 | + |
| 157 | +``` |
| 158 | +trace_accepted_by_sampling: 10,000 occurrences |
| 159 | +trace_dropped_by_sampling: 90,000 occurrences |
| 160 | +trace_enqueued_for_send: 10,000 occurrences |
| 161 | +trace_dropped_buffer_overflow: 50 occurrences |
| 162 | +traces_sent_successfully: 9,950 occurrences |
| 163 | +traces_failed_to_send: 0 occurrences |
| 164 | +``` |
| 165 | + |
| 166 | +**Analysis:** |
| 167 | +- Total traces: 100,000 |
| 168 | +- Sampling rate: 10% (10,000 accepted / 100,000 total) |
| 169 | +- Buffer overflow: 0.05% (50 / 100,000) |
| 170 | +- Send success rate: 99.5% (9,950 / 10,000 accepted) |
| 171 | +- Overall success rate: 9.95% (9,950 / 100,000 total) |
| 172 | + |
| 173 | +**Conclusion:** |
| 174 | +- Sampling is working as expected (90% drop rate) |
| 175 | +- Very low buffer overflow (0.05%) |
| 176 | +- Excellent send success rate (99.5%) |
| 177 | +- No HTTP failures |
| 178 | + |
| 179 | +## Dependencies |
| 180 | + |
| 181 | +- **Antithesis SDK:** `com.antithesis:sdk:1.4.5` (already configured in `dd-trace-core/build.gradle`) |
| 182 | +- The SDK is bundled in the tracer JAR and has minimal performance impact in production |
| 183 | + |
| 184 | +## Running Antithesis Tests |
| 185 | + |
| 186 | +Contact the Antithesis team or refer to their documentation for running tests with these assertions enabled. |
| 187 | + |
| 188 | +## Future Enhancements |
| 189 | + |
| 190 | +Potential improvements: |
| 191 | +1. Add `Assert.always()` for critical paths that should never fail |
| 192 | +2. Add `Assert.unreachable()` for error paths that should never occur |
| 193 | +3. Track additional metadata (e.g., service names, operation names) |
| 194 | +4. Add time-based metrics (latency, throughput) |
| 195 | + |
0 commit comments