Skip to content

Commit e528576

Browse files
felixbarnyAlexanderWertbasepitrentmeyalkoren
authored
Handling huge tracing specs (#453)
* First draft of handling huge tracing specs * Apply suggestions from code review Co-authored-by: Alexander Wert <[email protected]> * Implement suggestions * Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert <[email protected]> * Pseudo code for how the strategies work in combination * Add composite.exact_match flag * Apply suggestions from code review Co-authored-by: Colton Myers <[email protected]> * Add breadcrumbs * Add missing table of contents link to AWS tracing spec file * Some clarifications for the destination APIs (#452) * Add limit to dropped_spans_stats * Add implementation section to transaction_max_spans * Move exit span definition from destination spec to span spec * Add exit_span_min_duration spec * Apply suggestions from code review Co-authored-by: Sergey Kleyman <[email protected]> * Fix links, add clarification to max duration * Dropping fast spans requires stats * Rework transaction_max_spans implementation logic * Improve transaction_max_spans: no CAS * Apply suggestions from code review Co-authored-by: Sergey Kleyman <[email protected]> * Update specs/agents/tracing-spans-compress.md * Update specs/agents/tracing-spans-compress.md * Update specs/agents/tracing-spans-handling-huge-traces.md Co-authored-by: Trent Mick <[email protected]> * Renamed same_kind_compression_max_duration config option to span_compression_same_kind_max_duration * Added span_compression_same_kind_max_duration config option * Added span_compression_enabled config option * Update specs/agents/tracing-spans-compress.md Co-authored-by: eyalkoren <[email protected]> * Changed end to sum.us in composite sub-object * Replaced exact_match bool with compression_strategy enum * Update specs/agents/tracing-spans-compress.md Co-authored-by: eyalkoren <[email protected]> * Added outcome requirement to eligible for compression * Added outcome requirement to eligible for compression PART 2 Updated isCompressionEligible() pseudo-code * Added links from tracing-spans.md to tracing-spans-compress.md * Fixed missing isSameKind check in tryToCompressComposite() * Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: Alexander Wert <[email protected]> * Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert <[email protected]> * Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: Alexander Wert <[email protected]> * Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert <[email protected]> * Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert <[email protected]> * Update specs/agents/tracing-spans-compress.md Co-authored-by: Alexander Wert <[email protected]> * Removed "Exit span API" requirement from tracing-spans.md * Update specs/agents/tracing-spans-drop-fast-exit.md Co-authored-by: eyalkoren <[email protected]> * reafctored file structure for handling huge traces * Update specs/agents/tracing-spans-destination.md * Update specs/agents/tracing-spans.md * Update specs/agents/tracing-spans.md Co-authored-by: Alexander Wert <[email protected]> Co-authored-by: Colton Myers <[email protected]> Co-authored-by: Trent Mick <[email protected]> Co-authored-by: eyalkoren <[email protected]> Co-authored-by: Sergey Kleyman <[email protected]> Co-authored-by: Trent Mick <[email protected]> Co-authored-by: Alexander Wert <[email protected]>
1 parent b338fe9 commit e528576

File tree

8 files changed

+551
-28
lines changed

8 files changed

+551
-28
lines changed

specs/agents/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,11 @@ You can find details about each of these in the [APM Data Model](https://www.ela
4040
- [Transactions](tracing-transactions.md)
4141
- [Spans](tracing-spans.md)
4242
- [Span destination](tracing-spans-destination.md)
43+
- [Handling huge traces](handling-huge-traces/tracing-spans-handling-huge-traces.md)
44+
- [Hard limit on number of spans to collect](handling-huge-traces/tracing-spans-limit.md)
45+
- [Collecting statistics about dropped spans](handling-huge-traces/tracing-spans-dropped-stats.md)
46+
- [Dropping fast exit spans](handling-huge-traces/tracing-spans-drop-fast-exit.md)
47+
- [Compressing spans](handling-huge-traces/tracing-spans-compress.md)
4348
- [Sampling](tracing-sampling.md)
4449
- [Distributed tracing](tracing-distributed-tracing.md)
4550
- [Tracer API](tracing-api.md)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Handling huge traces
2+
3+
Instrumenting applications that make lots of requests (such as 10k+) to backends like caches or databases can lead to several issues:
4+
- A significant performance impact in the target application.
5+
For example due to high allocation rate, network traffic, garbage collection, additional CPU cycles for serializing, compressing and sending spans, etc.
6+
- Dropping of events in agents or APM Server due to exhausted queues.
7+
- High load on the APM Server.
8+
- High storage costs.
9+
- Decreased performance of the Elastic APM UI due to slow searches and rendering of huge traces.
10+
- Loss of clarity and overview (--> decreased user experience) in the UI when analyzing the traces.
11+
12+
Agents can implement several strategies to mitigate these issues.
13+
These strategies are designed to capture significant information about relevant spans while at the same time limiting the trace to a manageable size.
14+
Applying any of these strategies inevitably leads to a loss of information.
15+
However, they aim to provide a better tradeoff between cost and insight by not capturing or summarizing less relevant data.
16+
17+
- [Hard limit on number of spans to collect](tracing-spans-limit.md) \
18+
Even after applying the most advanced strategies, there must always be a hard limit on the number of spans we collect.
19+
This is the last line of defense that comes with the highest amount of data loss.
20+
- [Collecting statistics about dropped spans](tracing-spans-dropped-stats.md) \
21+
Makes sure even if dropping spans, we at least have stats about them.
22+
- [Dropping fast exit spans](tracing-spans-drop-fast-exit.md) \
23+
If a span was blazingly fast, it's probably not worth the cost to send and store it.
24+
- [Compressing spans](tracing-spans-compress.md) \
25+
If there are a bunch of very similar spans, we can represent them in a single document - a composite span.
26+
27+
In a nutshell, this is how the different settings work in combination:
28+
29+
```java
30+
if (span.transaction.spanCount > transaction_max_spans) {
31+
// drop span
32+
// collect statistics for dropped spans
33+
} else if (compression possible) {
34+
// apply compression
35+
} else if (span.duration < exit_span_min_duration) {
36+
// drop span
37+
// collect statistics for dropped spans
38+
} else {
39+
// report span
40+
}
41+
```
Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
# Compressing spans
2+
3+
To mitigate the potential flood of spans to a backend,
4+
agents SHOULD implement the strategies laid out in this section to avoid sending almost identical and very similar spans.
5+
6+
While compressing multiple similar spans into a single composite span can't fully eliminate the collection overhead,
7+
it can significantly reduce the impact on the following areas,
8+
with very little loss of information:
9+
- Agent reporter queue utilization
10+
- Capturing stack traces, serialization, compression, and sending events to APM Server
11+
- Potential to re-use span objects, significantly reducing allocations
12+
- Downstream effects like reducing impact on APM Server, ES storage, and UI performance
13+
14+
### Configuration option `span_compression_enabled`
15+
16+
Setting this option to true will enable span compression feature.
17+
Span compression reduces the collection, processing, and storage overhead, and removes clutter from the UI.
18+
The tradeoff is that some information such as DB statements of all the compressed spans will not be collected.
19+
20+
| | |
21+
|----------------|----------|
22+
| Type | `boolean`|
23+
| Default | `false` |
24+
| Dynamic | `true` |
25+
26+
27+
## Consecutive-Exact-Match compression strategy
28+
29+
One of the biggest sources of excessive data collection are n+1 type queries and repetitive requests to a cache server.
30+
This strategy detects consecutive spans that hold the same information (except for the duration)
31+
and creates a single [composite span](#composite-span).
32+
33+
```
34+
[ ]
35+
GET /users
36+
[] [] [] [] [] [] [] [] [] []
37+
10x SELECT FROM users
38+
```
39+
40+
Two spans are considered to be an exact match if they are of the [same kind](#consecutive-same-kind-compression-strategy) and if their span names are equal:
41+
- `type`
42+
- `subtype`
43+
- `destination.service.resource`
44+
- `name`
45+
46+
### Configuration option `span_compression_exact_match_max_duration`
47+
48+
Consecutive spans that are exact match and that are under this threshold will be compressed into a single composite span.
49+
This option does not apply to [composite spans](#composite-span).
50+
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
51+
The tradeoff is that the DB statements of all the compressed spans will not be collected.
52+
53+
| | |
54+
|----------------|----------|
55+
| Type | `duration`|
56+
| Default | `5ms` |
57+
| Dynamic | `true` |
58+
59+
## Consecutive-Same-Kind compression strategy
60+
61+
Another pattern that often occurs is a high amount of alternating queries to the same backend.
62+
Especially if the individual spans are quite fast, recording every single query is likely to not be worth the overhead.
63+
64+
```
65+
[ ]
66+
GET /users
67+
[] [] [] [] [] [] [] [] [] []
68+
10x Calls to mysql
69+
```
70+
71+
Two spans are considered to be of the same type if the following properties are equal:
72+
- `type`
73+
- `subtype`
74+
- `destination.service.resource`
75+
76+
```java
77+
boolean isSameKind(Span other) {
78+
return type == other.type
79+
&& subtype == other.subtype
80+
&& destination.service.resource == other.destination.service.resource
81+
}
82+
```
83+
84+
When applying this compression strategy, the `span.name` is set to `Calls to $span.destination.service.resource`.
85+
The rest of the context, such as the `db.statement` will be determined by the first compressed span, which is turned into a composite span.
86+
87+
### Configuration option `span_compression_same_kind_max_duration`
88+
89+
Consecutive spans to the same destination that are under this threshold will be compressed into a single composite span.
90+
This option does not apply to [composite spans](#composite-span).
91+
This reduces the collection, processing, and storage overhead, and removes clutter from the UI.
92+
The tradeoff is that the DB statements of all the compressed spans will not be collected.
93+
94+
| | |
95+
|----------------|----------|
96+
| Type | `duration`|
97+
| Default | `5ms` |
98+
| Dynamic | `true` |
99+
100+
## Composite span
101+
102+
Compressed spans don't have a physical span document.
103+
Instead, multiple compressed spans are represented by a composite span.
104+
105+
### Data model
106+
107+
The `timestamp` and `duration` have slightly similar semantics,
108+
and they define properties under the `composite` context.
109+
110+
- `timestamp`: The start timestamp of the first span.
111+
- `duration`: gross duration (i.e., _<last compressed span's end timestamp>_ - _<first compressed span's start timestamp>_).
112+
- `composite`
113+
- `count`: The number of compressed spans this composite span represents.
114+
The minimum count is 2 as a composite span represents at least two spans.
115+
- `sum.us`: sum of durations of all compressed spans this composite span represents in microseconds.
116+
Thus `sum.us` is the net duration of all the compressed spans while `duration` is the gross duration (including "whitespace" between the spans).
117+
- `compression_strategy`: A string value indicating which compression strategy was used. The valid values are:
118+
- `exact_match` - [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy)
119+
- `same_kind` - [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy)
120+
121+
### Effects on metric processing
122+
123+
As laid out in the [span destination spec](tracing-spans-destination.md#contextdestinationserviceresource),
124+
APM Server tracks span destination metrics.
125+
To avoid compressed spans to skew latency metrics and cause throughput metrics to be under-counted,
126+
APM Server will take `composite.count` into account when tracking span destination metrics.
127+
128+
## Compression algorithm
129+
130+
### Eligibility for compression
131+
132+
A span is eligible for compression if all the following conditions are met
133+
1. It's an [exit span](tracing-spans.md#exit-spans)
134+
2. The trace context of this span has not been propagated to a downstream service
135+
3. If the span has `outcome` (i.e., `outcome` is present and it's not `null`) then it should be `success`.
136+
It means spans with outcome indicating an issue of potential interest should not be compressed.
137+
138+
The second condition is important so that we don't remove (compress) a span that may be the parent of a downstream service.
139+
This would orphan the sub-graph started by the downstream service and cause it to not appear in the waterfall view.
140+
141+
```java
142+
boolean isCompressionEligible() {
143+
return exit && !context.hasPropagated && (outcome == null || outcome == "success")
144+
}
145+
```
146+
147+
### Span buffering
148+
149+
Non-compression-eligible spans may be reported immediately after they have ended.
150+
When a compression-eligible span ends, it does not immediately get reported.
151+
Instead, the span is buffered within its parent.
152+
A span/transaction can buffer at most one child span.
153+
154+
Span buffering allows to "look back" one span when determining whether a given span should be compressed.
155+
156+
A buffered span gets reported when
157+
1. its parent ends
158+
2. a non-compressible sibling ends
159+
160+
```java
161+
void onEnd() {
162+
if (buffered != null) {
163+
report(buffered)
164+
}
165+
}
166+
167+
void onChildEnd(Span child) {
168+
if (!child.isCompressionEligible()) {
169+
if (buffered != null) {
170+
report(buffered)
171+
buffered = null
172+
}
173+
report(child)
174+
return
175+
}
176+
177+
if (buffered == null) {
178+
buffered = child
179+
return
180+
}
181+
182+
if (!buffered.tryToCompress(child)) {
183+
report(buffered)
184+
buffered = child
185+
}
186+
}
187+
```
188+
189+
### Turning compressed spans into a composite span
190+
191+
Spans have `tryToCompress` method that is called on a span buffered by its parent.
192+
On the first call the span checks if it can be compressed with the given sibling and it selects the best compression strategy.
193+
Note that the compression strategy selected only once based on the first two spans of the sequence.
194+
The compression strategy cannot be changed by the rest the spans in the sequence.
195+
So when the current sibling span cannot be added to the ongoing sequence under the selected compression strategy
196+
then the ongoing is terminated, it is sent out as a composite span and the current sibling span is buffered.
197+
198+
If the spans are of the same kind, and have the same name and both spans `duration` <= `span_compression_exact_match_max_duration`,
199+
we apply the [Consecutive-Exact-Match compression strategy](tracing-spans-compress.md#consecutive-exact-match-compression-strategy).
200+
Note that if the spans are _exact match_
201+
but duration threshold requirement is not satisfied we just stop compression sequence.
202+
In particular it means that the implementation should not proceed to try _same kind_ strategy.
203+
Otherwise user would have to lower both `span_compression_exact_match_max_duration` and `span_compression_same_kind_max_duration`
204+
to prevent longer _exact match_ spans from being compressed.
205+
206+
If the spans are of the same kind but have different span names and both spans `duration` <= `span_compression_same_kind_max_duration`,
207+
we compress them using the [Consecutive-Same-Kind compression strategy](tracing-spans-compress.md#consecutive-same-kind-compression-strategy).
208+
209+
```java
210+
bool tryToCompress(Span sibling) {
211+
isAlreadyComposite = composite != null
212+
canBeCompressed = isAlreadyComposite ? tryToCompressComposite(sibling) : tryToCompressRegular(sibling)
213+
if (!canBeCompressed) {
214+
return false
215+
}
216+
217+
if (!isAlreadyComposite) {
218+
composite.count = 1
219+
composite.sumUs = duration
220+
}
221+
222+
++composite.count
223+
composite.sumUs += other.duration
224+
return true
225+
}
226+
227+
bool tryToCompressRegular(Span sibling) {
228+
if (!isSameKind(sibling)) {
229+
return false
230+
}
231+
232+
if (name == sibling.name) {
233+
if (duration <= span_compression_exact_match_max_duration && sibling.duration <= span_compression_exact_match_max_duration) {
234+
composite.compressionStrategy = "exact_match"
235+
return true
236+
}
237+
return false
238+
}
239+
240+
if (duration <= span_compression_same_kind_max_duration && sibling.duration <= span_compression_same_kind_max_duration) {
241+
composite.compressionStrategy = "same_kind"
242+
name = "Calls to " + destination.service.resource
243+
return true
244+
}
245+
246+
return false
247+
}
248+
249+
bool tryToCompressComposite(Span sibling) {
250+
switch (composite.compressionStrategy) {
251+
case "exact_match":
252+
return isSameKind(sibling) && name == sibling.name && sibling.duration <= span_compression_exact_match_max_duration
253+
254+
case "same_kind":
255+
return isSameKind(sibling) && sibling.duration <= span_compression_same_kind_max_duration
256+
}
257+
}
258+
```
259+
260+
### Concurrency
261+
262+
The pseudo-code in this spec is intentionally not written in a thread-safe manner to make it more concise.
263+
Also, thread safety is highly platform/runtime dependent, and some don't support parallelism or concurrency.
264+
265+
However, if there can be a situation where multiple spans may end concurrently, agents MUST guard against race conditions.
266+
To do that, agents should prefer [lock-free algorithms](https://en.wikipedia.org/wiki/Non-blocking_algorithm)
267+
paired with retry loops over blocking algorithms that use mutexes or locks.
268+
269+
In particular, operations that work with the buffer require special attention:
270+
- Setting a span into the buffer must be handled atomically.
271+
- Retrieving a span from the buffer must be handled atomically.
272+
Retrieving includes atomically getting and clearing the buffer.
273+
This makes sure that only one thread can compare span properties and call mutating methods, such as `compress` at a time.

0 commit comments

Comments
 (0)