Skip to content

Commit 78f4ffb

Browse files
committed
simplify and improve language
1 parent edd9143 commit 78f4ffb

File tree

1 file changed

+14
-13
lines changed

1 file changed

+14
-13
lines changed

solutions/observability/apm/transaction-sampling.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -292,46 +292,47 @@ This example defines three tail-based sampling polices:
292292

293293
### Example configuration B [_example_configuration_b]
294294

295-
For a trace that originates in Service A and ends in Service B without error, what would the sampling be?
295+
When a trace originates in Service A and then calls Service B (without errors), the sampling rate is determined by the service where the trace starts:
296296

297297
```yaml
298298
- sample_rate: 0.3
299299
service.name: B
300300
- sample_rate: 0.5
301301
service.name: A
302-
- sample_rate: 1.0 # Always set a default
302+
- sample_rate: 1.0 # Fallback: always set a default
303303
```
304304

305-
In the example, only 50% of traces will be sampled. The service that start the trace (Service A) has precedence over child services (Service B). The order of services does not matter, what matters, is in what service the trace event start. Service A, is were the trace starts, and therefore will always have precedence over "child" services that only create spans (Service B). If we start at Service B instead, pass on the context to Service A, which then adds a child span, then, the policy of `service.name: B` will take precedence over that of `service.name: A`. This is because we are working on the *trace level* rather than the *service level*.
305+
- Because Service A is the root of the trace, its policy (0.5) takes precedence over Service B's policy (0.3).
306+
- If instead the trace began in Service B (and then passed to Service A), the policy for Service B would apply.
307+
308+
> **Key point**: Tail‑based sampling rules are evaluated at the *trace level* based on where the trace was initiated, not on downstream spans (*service level*).
306309

307310
### Example configuration C [_example_configuration_c]
308311

309-
For a trace that originates in A and has an error in B, what would the sampling be?
312+
When you need to combine service‑specific policies with outcomes (e.g. failures), policy order defines specificity:
310313

311314
```yaml
312-
# Example A
315+
# Example A: prioritize service origin, then failures
313316
- sample_rate: 0.2
314317
service.name: A
315318
- sample_rate: 0.5
316319
trace.outcome: failure
317-
- sample_rate: 1.0 # Always set a default
320+
- sample_rate: 1.0 # Default
318321
319-
# Example B
322+
# Example B: prioritize failures, then a specific service
320323
- sample_rate: 0.2
321324
trace.outcome: failure
322325
- sample_rate: 0.5
323326
service.name: alice
324327
- sample_rate: 1.0
325328
```
326329

327-
- In Example A, we are stating that we want a 20% sample rate for trace events originating from Service A, but for all other failed traces we want a sample rate of 50%.
328-
- However, in Example B, we want all failed traces to sample at 20%, including Service A.
329-
330-
The order matters for `trace` policies relative to `service` policies. This has to do with how to define *specificity* in a distributed system. A *trace*, is an abstract concept that "spans" over a range of distributed services. This is by definition, since we want to be able to "trace" an event across multiple service. So then, when we define a policy on the trace level (such as `trace.outcome: failure`), we are implicitly defining this policy for a range of services.
330+
- In Example A, traces from Service A are sampled at 20%, and all other failed traces (regardless of service) are sampled at 50%.
331+
- In Example B, every failed trace is sampled at 20%, including those originating from Service A.
331332

332-
If you want to always capture all failed traces, you should define it at the top of your policy list with a value of 1.0. And then define more specific policies for specific services to capture edge cases. If an error happens in a child (Service B), ensure to propagate this error back up to the parent (Service A), which then makes the decision as to whether you want to trace as a whole to fail or not. This logic has to happen on the application layer.
333+
Policies targeting the trace (e.g. `trace.outcome: failure`) apply across all services and should appear before more specific, service‑level rules if you want them to take precedence.
333334

334-
A child failing doesn't imply a distributed trace should fail. It is possible that the child call is just a nice-to-have and there are backup plans when that fails. For example, an application can fail to call a cache, but it can still read/write to a database directly. The trace shouldn't fail just because the cache isn't available.
335+
> **Key point**: Define failure policy at the top to ensure capturing all failed traces, then define more specific policies for specific services to capture edge cases.
335336

336337
### Configuration reference [_configuration_reference]
337338

0 commit comments

Comments
 (0)