Skip to content

Commit 63a0950

Browse files
alexmojakiKludex
andauthored
Tail sampling (#407)
Co-authored-by: Marcelo Trylesinski <[email protected]>
1 parent fb0a162 commit 63a0950

File tree

22 files changed

+965
-291
lines changed

22 files changed

+965
-291
lines changed

docs/api/sampling.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: logfire.sampling

docs/guides/advanced/sampling.md

Lines changed: 293 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,312 @@
11
# Sampling
22

3-
Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be stored and analyzed. Sampling is a trade-off between cost and completeness of data.
3+
Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be
4+
stored and analyzed. Sampling is a trade-off between cost and completeness of data.
45

5-
To configure sampling for the SDK:
6+
_Head sampling_ means the decision to sample is made at the beginning of a trace. This is simpler and more common.
67

7-
- Set the [`trace_sample_rate`][logfire.configure(trace_sample_rate)] option of [`logfire.configure()`][logfire.configure] to a number between 0 and 1, or
8-
- Set the `LOGFIRE_TRACE_SAMPLE_RATE` environment variable, or
9-
- Set the `trace_sample_rate` config file option.
8+
_Tail sampling_ means the decision to sample is delayed, possibly until the end of a trace. This means there is more
9+
information available to make the decision, but this adds complexity.
1010

11-
See [Configuration](../../reference/configuration.md) for more information.
11+
Sampling usually happens at the trace level, meaning entire traces are kept or discarded. This way the remaining traces
12+
are generally complete.
13+
14+
## Random head sampling
15+
16+
Here's an example of randomly sampling 50% of traces:
17+
18+
```python
19+
import logfire
20+
21+
logfire.configure(sampling=logfire.SamplingOptions(head=0.5))
22+
23+
for x in range(10):
24+
with logfire.span(f'span {x}'):
25+
logfire.info(f'log {x}')
26+
```
27+
28+
This outputs something like:
29+
30+
```
31+
11:09:29.041 span 0
32+
11:09:29.041 log 0
33+
11:09:29.041 span 1
34+
11:09:29.042 log 1
35+
11:09:29.042 span 4
36+
11:09:29.042 log 4
37+
11:09:29.042 span 5
38+
11:09:29.042 log 5
39+
11:09:29.042 span 7
40+
11:09:29.042 log 7
41+
```
42+
43+
Note that 5 out of 10 traces are kept, and that the child log is kept if and only if the parent span is kept.
44+
45+
## Tail sampling by level and duration
46+
47+
Random head sampling often works well, but you may not want to lose any traces which indicate problems. In this case,
48+
you can use tail sampling. Here's a simple example:
49+
50+
```python
51+
import time
52+
53+
import logfire
54+
55+
logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())
56+
57+
for x in range(3):
58+
# None of these are logged
59+
with logfire.span('excluded span'):
60+
logfire.info(f'info {x}')
61+
62+
# All of these are logged
63+
with logfire.span('included span'):
64+
logfire.error(f'error {x}')
65+
66+
for t in range(1, 10, 2):
67+
with logfire.span(f'span with duration {t}'):
68+
time.sleep(t)
69+
```
70+
71+
This outputs something like:
72+
73+
```
74+
11:37:45.484 included span
75+
11:37:45.484 error 0
76+
11:37:45.485 included span
77+
11:37:45.485 error 1
78+
11:37:45.485 included span
79+
11:37:45.485 error 2
80+
11:37:49.493 span with duration 5
81+
11:37:54.499 span with duration 7
82+
11:38:01.505 span with duration 9
83+
```
84+
85+
[`logfire.SamplingOptions.level_or_duration()`][logfire.sampling.SamplingOptions.level_or_duration] creates an instance
86+
of [`logfire.SamplingOptions`][logfire.sampling.SamplingOptions] with simple tail sampling. With no arguments,
87+
it means that a trace will be included if and only if it has at least one span/log that:
88+
89+
1. has a log level greater than `info` (the default of any span), or
90+
2. has a duration greater than 5 seconds.
91+
92+
This way you won't lose information about warnings/errors or long-running operations. You can customize what to keep
93+
with the `level_threshold` and `duration_threshold` arguments.
94+
95+
## Combining head and tail sampling
96+
97+
You can combine head and tail sampling. For example:
98+
99+
```python
100+
import logfire
101+
102+
logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(head=0.1))
103+
```
104+
105+
This will only keep 10% of traces, even if they have a high log level or duration. Traces that don't meet the tail
106+
sampling criteria will be discarded every time.
107+
108+
## Keeping a fraction of all traces
109+
110+
To keep some traces even if they don't meet the tail sampling criteria, you can use the `background_rate` argument. For
111+
example, this script:
112+
113+
```python
114+
import logfire
115+
116+
logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(background_rate=0.3))
117+
118+
for x in range(10):
119+
logfire.info(f'info {x}')
120+
for x in range(5):
121+
logfire.error(f'error {x}')
122+
```
123+
124+
will output something like:
125+
126+
```
127+
12:24:40.293 info 2
128+
12:24:40.293 info 3
129+
12:24:40.293 info 7
130+
12:24:40.294 error 0
131+
12:24:40.294 error 1
132+
12:24:40.294 error 2
133+
12:24:40.294 error 3
134+
12:24:40.295 error 4
135+
```
136+
137+
i.e. about 30% of the info logs and 100% of the error logs are kept.
138+
139+
(Technical note: the trace ID is compared against the head and background rates to determine inclusion, so the
140+
probabilities don't depend on the number of spans in the trace, and the rates give the probabilities directly without
141+
needing any further calculations. For example, with a head sample rate of `0.6` and a background rate of `0.3`, the
142+
chance of a non-notable trace being included is `0.3`, not `0.6 * 0.3`.)
143+
144+
## Caveats of tail sampling
145+
146+
### Memory usage
147+
148+
For tail sampling to work, all the spans in a trace must be kept in memory until either the trace is included by
149+
sampling or the trace is completed and discarded. In the above example, the spans named `included span` don't have a
150+
high enough level to be included, so they are kept in memory until the error logs cause the entire trace to be included.
151+
This means that traces with a large number of spans can consume a lot of memory, whereas without tail sampling the spans
152+
would be regularly exported and freed from memory without waiting for the rest of the trace.
153+
154+
In practice this is usually OK, because such large traces will usually exceed the duration threshold, at which point the
155+
trace will be included and the spans will be exported and freed. This works because the duration is measured as the time
156+
between the start of the trace and the start/end of the most recent span, so the tail sampler can know that a span will
157+
exceed the duration threshold even before it's complete. For example, running this script:
12158

13159
```python
160+
import time
161+
14162
import logfire
15163

16-
logfire.configure(trace_sample_rate=0.5)
164+
logfire.configure(sampling=logfire.SamplingOptions.level_or_duration())
165+
166+
with logfire.span('span'):
167+
for x in range(1, 10):
168+
time.sleep(1)
169+
logfire.info(f'info {x}')
170+
```
171+
172+
will do nothing for the first 5 seconds, before suddenly logging all this at once:
17173

18-
with logfire.span("my_span"): # This span will be sampled 50% of the time
19-
pass
20174
```
175+
12:29:43.063 span
176+
12:29:44.065 info 1
177+
12:29:45.066 info 2
178+
12:29:46.072 info 3
179+
12:29:47.076 info 4
180+
12:29:48.082 info 5
181+
```
182+
183+
followed by additional logs once per second. This is despite the fact that at this stage the outer span hasn't completed
184+
yet and the inner logs each have 0 duration.
185+
186+
However, memory usage can still be a problem in any of the following cases:
187+
188+
- The duration threshold is set to a high value
189+
- Spans are produced extremely rapidly
190+
- Spans contain large attributes
21191

22-
<!-- ## Fine grained sampling
192+
### Distributed tracing
23193

24-
You can tweak sampling on a per module or per code block basis using
25-
[`logfire.with_trace_sample_rate()`][logfire.Logfire.with_trace_sample_rate].
194+
Logfire's tail sampling is implemented in the SDK and only works for traces within one process. If you need tail
195+
sampling with distributed tracing, consider deploying
196+
the [Tail Sampling Processor in the OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md).
197+
198+
If a trace was started on another process and its context was propagated to the process using the Logfire SDK tail
199+
sampling, the whole trace will be included.
200+
201+
If you start a trace with the Logfire SDK with tail sampling, and then propagate the context to another process, the
202+
spans generated by the SDK may be discarded, while the spans generated by the other process may be included, leading to
203+
an incomplete trace.
204+
205+
## Custom head sampling
206+
207+
If you need more control than random sampling, you can pass an [OpenTelemetry
208+
`Sampler`](https://opentelemetry-python.readthedocs.io/en/latest/sdk/trace.sampling.html). For example:
26209

27210
```python
211+
from opentelemetry.sdk.trace.sampling import (
212+
ALWAYS_OFF,
213+
ALWAYS_ON,
214+
ParentBased,
215+
Sampler,
216+
)
217+
28218
import logfire
29219

30-
logfire.configure()
31220

32-
sampled = logfire.with_trace_sample_rate(0.5)
221+
class MySampler(Sampler):
222+
def should_sample(
223+
self,
224+
parent_context,
225+
trace_id,
226+
name,
227+
*args,
228+
**kwargs,
229+
):
230+
if name == 'exclude me':
231+
sampler = ALWAYS_OFF
232+
else:
233+
sampler = ALWAYS_ON
234+
return sampler.should_sample(
235+
parent_context,
236+
trace_id,
237+
name,
238+
*args,
239+
**kwargs,
240+
)
241+
242+
def get_description(self):
243+
return 'MySampler'
244+
245+
246+
logfire.configure(
247+
sampling=logfire.SamplingOptions(
248+
head=ParentBased(
249+
MySampler(),
250+
)
251+
)
252+
)
253+
254+
with logfire.span('keep me'):
255+
logfire.info('kept child')
256+
257+
with logfire.span('exclude me'):
258+
logfire.info('excluded child')
259+
```
260+
261+
This will output something like:
262+
263+
```
264+
10:37:30.897 keep me
265+
10:37:30.898 kept child
266+
```
267+
268+
Note that the sampler explicitly excluded only the span named `exclude me`. The reason that the `excluded child` log is
269+
not included is that `MySampler` was wrapped in a `ParentBased` sampler, which excludes spans whose parents are
270+
excluded. If you remove that and simply pass `head=MySampler()`, the `excluded child` log will be included, resulting in
271+
an incomplete trace.
272+
273+
You can also pass a `Sampler` to the `head` argument of `SamplingOptions.level_or_duration` to combine tail sampling
274+
with custom head sampling.
275+
276+
## Custom tail sampling
277+
278+
If you want tail sampling with more control than `level_or_duration`, you can pass a function to [
279+
`tail`][logfire.sampling.SamplingOptions.tail] which will accept an instance of [
280+
`TailSamplingSpanInfo`][logfire.sampling.TailSamplingSpanInfo] and return a float between 0 and 1 representing the
281+
probability that the trace should be included. For example:
282+
283+
```python
284+
import logfire
285+
286+
287+
def get_tail_sample_rate(span_info):
288+
if span_info.duration >= 1:
289+
return 0.5 # (1)!
290+
291+
if span_info.level > 'warn': # (2)!
292+
return 0.3 # (3)!
293+
294+
return 0.1 # (4)!
295+
296+
297+
logfire.configure(
298+
sampling=logfire.SamplingOptions(
299+
head=0.5, # (5)!
300+
tail=get_tail_sample_rate,
301+
),
302+
)
303+
```
33304

34-
with sampled.span("outer"): # This span will be sampled 50% of the time
35-
# `with sampled.with_trace_sample_rate(0.1).span("inner")` would also work
36-
with logfire.with_trace_sample_rate(0.1).span("inner"): # This span will be sampled 10% of the time
37-
pass
38-
``` -->
305+
1. Keep 50% of traces with duration >= 1 second
306+
2. `span_info.level` is a [special object][logfire.sampling.SpanLevel] that can be compared to log level names
307+
3. Keep 30% of traces with a warning or error and with duration < 1 second
308+
4. Keep 10% of other traces
309+
5. Discard 50% of traces at the beginning to reduce the overhead of generating spans. This is optional, but improves
310+
performance, and we know that `get_tail_sample_rate` will always return at most 0.5 so the other 50% of traces will
311+
be discarded anyway. The probabilities are not independent - this will not discard traces that would otherwise have
312+
been kept by tail sampling.

logfire-api/logfire_api/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ def suppress_instrumentation():
173173
class ConsoleOptions:
174174
def __init__(self, *args, **kwargs) -> None: ...
175175

176-
class TailSamplingOptions:
176+
class SamplingOptions:
177177
def __init__(self, *args, **kwargs) -> None: ...
178178

179179
class ScrubbingOptions:

logfire-api/logfire_api/__init__.pyi

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ from ._internal.auto_trace.rewrite_ast import no_auto_trace as no_auto_trace
33
from ._internal.config import ConsoleOptions as ConsoleOptions, METRICS_PREFERRED_TEMPORALITY as METRICS_PREFERRED_TEMPORALITY, PydanticPlugin as PydanticPlugin, configure as configure
44
from ._internal.constants import LevelName as LevelName
55
from ._internal.exporters.file import load_file as load_spans_from_file
6-
from ._internal.exporters.tail_sampling import TailSamplingOptions as TailSamplingOptions
76
from ._internal.main import Logfire as Logfire, LogfireSpan as LogfireSpan
87
from ._internal.scrubbing import ScrubMatch as ScrubMatch, ScrubbingOptions as ScrubbingOptions
98
from ._internal.utils import suppress_instrumentation as suppress_instrumentation
109
from .integrations.logging import LogfireLoggingHandler as LogfireLoggingHandler
1110
from .integrations.structlog import LogfireProcessor as StructlogProcessor
1211
from .version import VERSION as VERSION
12+
from logfire.sampling import SamplingOptions as SamplingOptions
1313

14-
__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'instrument_system_metrics', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'TailSamplingOptions']
14+
__all__ = ['Logfire', 'LogfireSpan', 'LevelName', 'ConsoleOptions', 'PydanticPlugin', 'configure', 'span', 'instrument', 'log', 'trace', 'debug', 'notice', 'info', 'warn', 'error', 'exception', 'fatal', 'force_flush', 'log_slow_async_callbacks', 'install_auto_tracing', 'instrument_fastapi', 'instrument_openai', 'instrument_anthropic', 'instrument_asyncpg', 'instrument_httpx', 'instrument_celery', 'instrument_requests', 'instrument_psycopg', 'instrument_django', 'instrument_flask', 'instrument_starlette', 'instrument_aiohttp_client', 'instrument_sqlalchemy', 'instrument_redis', 'instrument_pymongo', 'instrument_mysql', 'instrument_system_metrics', 'AutoTraceModule', 'with_tags', 'with_settings', 'shutdown', 'load_spans_from_file', 'no_auto_trace', 'METRICS_PREFERRED_TEMPORALITY', 'ScrubMatch', 'ScrubbingOptions', 'VERSION', 'suppress_instrumentation', 'StructlogProcessor', 'LogfireLoggingHandler', 'SamplingOptions']
1515

1616
DEFAULT_LOGFIRE_INSTANCE = Logfire()
1717
span = DEFAULT_LOGFIRE_INSTANCE.span

0 commit comments

Comments
 (0)