Skip to content

Commit 136a766

Browse files
Add documentation for trace loss tracking implementation
1 parent 8d81f30 commit 136a766

File tree

1 file changed

+195
-0
lines changed

1 file changed

+195
-0
lines changed

TRACE_LOSS_TRACKING.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Trace Loss Tracking with Antithesis Assertions
2+
3+
## Overview
4+
5+
This document describes the simplified Antithesis assertion strategy implemented to track trace loss in dd-trace-java.
6+
7+
## Implementation
8+
9+
Assertions were added at 3 strategic points in the trace pipeline to provide complete visibility into where and why traces are lost:
10+
11+
### 1. CoreTracer.write() - Sampling Decision Point
12+
13+
**Location:** `dd-trace-core/src/main/java/datadog/trace/core/CoreTracer.java`
14+
15+
**Purpose:** Track traces at the sampling decision point
16+
17+
**Assertions:**
18+
- `trace_accepted_by_sampling` - Traces that passed sampling and will be sent
19+
- `trace_dropped_by_sampling` - Traces dropped due to sampling decision
20+
21+
**Data Captured:**
22+
- `decision`: "accepted" or "dropped_sampling"
23+
- `trace_id`: Unique trace identifier
24+
- `span_count`: Number of spans in the trace
25+
- `sampling_priority`: Sampling priority value
26+
27+
### 2. RemoteWriter.write() - Buffer Acceptance Point
28+
29+
**Location:** `dd-trace-core/src/main/java/datadog/trace/common/writer/RemoteWriter.java`
30+
31+
**Purpose:** Track traces at buffer acceptance and detect drops due to overflow or policy
32+
33+
**Assertions:**
34+
- `trace_enqueued_for_send` - Traces successfully enqueued for serialization
35+
- `trace_dropped_buffer_overflow` - Traces dropped due to full buffer
36+
- `trace_dropped_by_policy` - Traces dropped by policy rules
37+
- `trace_dropped_writer_closed` - Traces dropped during shutdown
38+
39+
**Data Captured:**
40+
- `decision`: "enqueued", "dropped_buffer_overflow", "dropped_policy", or "dropped_shutdown"
41+
- `trace_id`: Unique trace identifier (when available)
42+
- `span_count`: Number of spans in the trace
43+
- `sampling_priority`: Sampling priority value (when available)
44+
45+
### 3. PayloadDispatcherImpl.accept() - HTTP Send Point
46+
47+
**Location:** `dd-trace-core/src/main/java/datadog/trace/common/writer/PayloadDispatcherImpl.java`
48+
49+
**Purpose:** Track actual HTTP sends to the agent and detect failures
50+
51+
**Assertions:**
52+
- `trace_payloads_being_sent` - All send attempts (before HTTP call)
53+
- `traces_sent_successfully` - Traces successfully sent to agent
54+
- `traces_failed_to_send` - Traces that failed to send via HTTP
55+
56+
**Data Captured:**
57+
- `decision`: "sent_success" or "dropped_send_failed"
58+
- `trace_count`: Number of traces in the payload
59+
- `payload_size_bytes`: Size of the payload in bytes
60+
- `http_status`: HTTP response status code
61+
- `dropped_traces_in_payload`: Count of traces already dropped before this send
62+
- `dropped_spans_in_payload`: Count of spans already dropped before this send
63+
- `has_exception`: Whether an exception occurred (for failures)
64+
65+
## Complete Trace Flow
66+
67+
```
68+
Application → CoreTracer.write()
69+
70+
[ASSERTION POINT 1: Sampling]
71+
↓ ↓
72+
published=true published=false
73+
↓ ↓
74+
✅ trace_accepted_by_sampling ❌ trace_dropped_by_sampling
75+
76+
RemoteWriter.write()
77+
78+
[ASSERTION POINT 2: Buffer Acceptance]
79+
80+
traceProcessingWorker.publish()
81+
82+
✅ trace_enqueued_for_send
83+
OR
84+
❌ trace_dropped_buffer_overflow
85+
❌ trace_dropped_by_policy
86+
❌ trace_dropped_writer_closed
87+
88+
TraceProcessingWorker (batching)
89+
90+
PayloadDispatcherImpl.accept()
91+
92+
[ASSERTION POINT 3: HTTP Send]
93+
94+
🔵 trace_payloads_being_sent
95+
96+
api.sendSerializedTraces()
97+
↓ ↓
98+
response.success() !response.success()
99+
↓ ↓
100+
✅ traces_sent_successfully ❌ traces_failed_to_send
101+
```
102+
103+
## Metrics Available After Antithesis Testing
104+
105+
After running Antithesis tests, you will be able to calculate:
106+
107+
### Total Traces Processed
108+
```
109+
Total = trace_accepted_by_sampling + trace_dropped_by_sampling
110+
```
111+
112+
### Total Traces Lost
113+
```
114+
Lost = trace_dropped_by_sampling
115+
+ trace_dropped_buffer_overflow
116+
+ trace_dropped_by_policy
117+
+ trace_dropped_writer_closed
118+
+ traces_failed_to_send
119+
```
120+
121+
### Total Traces Successfully Sent
122+
```
123+
Success = traces_sent_successfully
124+
```
125+
126+
### Loss Rate
127+
```
128+
Loss Rate = (Total Traces Lost / Total Traces Processed) * 100%
129+
```
130+
131+
### Loss Breakdown by Cause
132+
- **Sampling Loss:** `trace_dropped_by_sampling / Total Traces Processed`
133+
- **Buffer Overflow Loss:** `trace_dropped_buffer_overflow / Total Traces Processed`
134+
- **Policy Loss:** `trace_dropped_by_policy / Total Traces Processed`
135+
- **Shutdown Loss:** `trace_dropped_writer_closed / Total Traces Processed`
136+
- **Send Failure Loss:** `traces_failed_to_send / Total Traces Processed`
137+
138+
## Assertion Properties
139+
140+
All assertions use `Assert.sometimes()` which means:
141+
- They track that the condition occurred at least once during testing
142+
- They provide detailed context about each occurrence
143+
- They don't fail the test (they're for tracking, not validation)
144+
145+
## Benefits of This Approach
146+
147+
1. **Clear Tracking:** Each assertion has a unique, descriptive name
148+
2. **Complete Coverage:** Tracks the entire pipeline from sampling to agent
149+
3. **Detailed Context:** Captures relevant metadata at each point
150+
4. **Easy Analysis:** Simple math to calculate loss rates and breakdown
151+
5. **Actionable Data:** Identifies exactly where and why traces are lost
152+
153+
## Example Analysis
154+
155+
After an Antithesis test run, you might see:
156+
157+
```
158+
trace_accepted_by_sampling: 10,000 occurrences
159+
trace_dropped_by_sampling: 90,000 occurrences
160+
trace_enqueued_for_send: 10,000 occurrences
161+
trace_dropped_buffer_overflow: 50 occurrences
162+
traces_sent_successfully: 9,950 occurrences
163+
traces_failed_to_send: 0 occurrences
164+
```
165+
166+
**Analysis:**
167+
- Total traces: 100,000
168+
- Sampling rate: 10% (10,000 accepted / 100,000 total)
169+
- Buffer overflow: 0.05% (50 / 100,000)
170+
- Send success rate: 99.5% (9,950 / 10,000 accepted)
171+
- Overall success rate: 9.95% (9,950 / 100,000 total)
172+
173+
**Conclusion:**
174+
- Sampling is working as expected (90% drop rate)
175+
- Very low buffer overflow (0.05%)
176+
- Excellent send success rate (99.5%)
177+
- No HTTP failures
178+
179+
## Dependencies
180+
181+
- **Antithesis SDK:** `com.antithesis:sdk:1.4.5` (already configured in `dd-trace-core/build.gradle`)
182+
- The SDK is bundled in the tracer JAR and has minimal performance impact in production
183+
184+
## Running Antithesis Tests
185+
186+
Contact the Antithesis team or refer to their documentation for running tests with these assertions enabled.
187+
188+
## Future Enhancements
189+
190+
Potential improvements:
191+
1. Add `Assert.always()` for critical paths that should never fail
192+
2. Add `Assert.unreachable()` for error paths that should never occur
193+
3. Track additional metadata (e.g., service names, operation names)
194+
4. Add time-based metrics (latency, throughput)
195+

0 commit comments

Comments
 (0)