|
1 | 1 | # Sampling |
2 | 2 |
|
3 | | -Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be stored and analyzed. Sampling is a trade-off between cost and completeness of data. |
| 3 | +Sampling is the practice of discarding some traces or spans in order to reduce the amount of data that needs to be |
| 4 | +stored and analyzed. Sampling is a trade-off between cost and completeness of data. |
4 | 5 |
|
5 | | -To configure sampling for the SDK: |
| 6 | +_Head sampling_ means the decision to sample is made at the beginning of a trace. This is simpler and more common. |
6 | 7 |
|
7 | | -- Set the [`trace_sample_rate`][logfire.configure(trace_sample_rate)] option of [`logfire.configure()`][logfire.configure] to a number between 0 and 1, or |
8 | | -- Set the `LOGFIRE_TRACE_SAMPLE_RATE` environment variable, or |
9 | | -- Set the `trace_sample_rate` config file option. |
| 8 | +_Tail sampling_ means the decision to sample is delayed, possibly until the end of a trace. This means there is more |
| 9 | +information available to make the decision, but this adds complexity. |
10 | 10 |
|
11 | | -See [Configuration](../../reference/configuration.md) for more information. |
| 11 | +Sampling usually happens at the trace level, meaning entire traces are kept or discarded. This way the remaining traces |
| 12 | +are generally complete. |
| 13 | + |
| 14 | +## Random head sampling |
| 15 | + |
| 16 | +Here's an example of randomly sampling 50% of traces: |
| 17 | + |
| 18 | +```python |
| 19 | +import logfire |
| 20 | + |
| 21 | +logfire.configure(sampling=logfire.SamplingOptions(head=0.5)) |
| 22 | + |
| 23 | +for x in range(10): |
| 24 | + with logfire.span(f'span {x}'): |
| 25 | + logfire.info(f'log {x}') |
| 26 | +``` |
| 27 | + |
| 28 | +This outputs something like: |
| 29 | + |
| 30 | +``` |
| 31 | +11:09:29.041 span 0 |
| 32 | +11:09:29.041 log 0 |
| 33 | +11:09:29.041 span 1 |
| 34 | +11:09:29.042 log 1 |
| 35 | +11:09:29.042 span 4 |
| 36 | +11:09:29.042 log 4 |
| 37 | +11:09:29.042 span 5 |
| 38 | +11:09:29.042 log 5 |
| 39 | +11:09:29.042 span 7 |
| 40 | +11:09:29.042 log 7 |
| 41 | +``` |
| 42 | + |
| 43 | +Note that 5 out of 10 traces are kept, and that the child log is kept if and only if the parent span is kept. |
| 44 | + |
| 45 | +## Tail sampling by level and duration |
| 46 | + |
| 47 | +Random head sampling often works well, but you may not want to lose any traces which indicate problems. In this case, |
| 48 | +you can use tail sampling. Here's a simple example: |
| 49 | + |
| 50 | +```python |
| 51 | +import time |
| 52 | + |
| 53 | +import logfire |
| 54 | + |
| 55 | +logfire.configure(sampling=logfire.SamplingOptions.level_or_duration()) |
| 56 | + |
| 57 | +for x in range(3): |
| 58 | + # None of these are logged |
| 59 | + with logfire.span('excluded span'): |
| 60 | + logfire.info(f'info {x}') |
| 61 | + |
| 62 | + # All of these are logged |
| 63 | + with logfire.span('included span'): |
| 64 | + logfire.error(f'error {x}') |
| 65 | + |
| 66 | +for t in range(1, 10, 2): |
| 67 | + with logfire.span(f'span with duration {t}'): |
| 68 | + time.sleep(t) |
| 69 | +``` |
| 70 | + |
| 71 | +This outputs something like: |
| 72 | + |
| 73 | +``` |
| 74 | +11:37:45.484 included span |
| 75 | +11:37:45.484 error 0 |
| 76 | +11:37:45.485 included span |
| 77 | +11:37:45.485 error 1 |
| 78 | +11:37:45.485 included span |
| 79 | +11:37:45.485 error 2 |
| 80 | +11:37:49.493 span with duration 5 |
| 81 | +11:37:54.499 span with duration 7 |
| 82 | +11:38:01.505 span with duration 9 |
| 83 | +``` |
| 84 | + |
| 85 | +[`logfire.SamplingOptions.level_or_duration()`][logfire.sampling.SamplingOptions.level_or_duration] creates an instance |
| 86 | +of [`logfire.SamplingOptions`][logfire.sampling.SamplingOptions] with simple tail sampling. With no arguments, |
| 87 | +it means that a trace will be included if and only if it has at least one span/log that: |
| 88 | + |
| 89 | +1. has a log level greater than `info` (the default of any span), or |
| 90 | +2. has a duration greater than 5 seconds. |
| 91 | + |
| 92 | +This way you won't lose information about warnings/errors or long-running operations. You can customize what to keep |
| 93 | +with the `level_threshold` and `duration_threshold` arguments. |
| 94 | + |
| 95 | +## Combining head and tail sampling |
| 96 | + |
| 97 | +You can combine head and tail sampling. For example: |
| 98 | + |
| 99 | +```python |
| 100 | +import logfire |
| 101 | + |
| 102 | +logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(head=0.1)) |
| 103 | +``` |
| 104 | + |
| 105 | +This will only keep 10% of traces, even if they have a high log level or duration. Traces that don't meet the tail |
| 106 | +sampling criteria will be discarded every time. |
| 107 | + |
| 108 | +## Keeping a fraction of all traces |
| 109 | + |
| 110 | +To keep some traces even if they don't meet the tail sampling criteria, you can use the `background_rate` argument. For |
| 111 | +example, this script: |
| 112 | + |
| 113 | +```python |
| 114 | +import logfire |
| 115 | + |
| 116 | +logfire.configure(sampling=logfire.SamplingOptions.level_or_duration(background_rate=0.3)) |
| 117 | + |
| 118 | +for x in range(10): |
| 119 | + logfire.info(f'info {x}') |
| 120 | +for x in range(5): |
| 121 | + logfire.error(f'error {x}') |
| 122 | +``` |
| 123 | + |
| 124 | +will output something like: |
| 125 | + |
| 126 | +``` |
| 127 | +12:24:40.293 info 2 |
| 128 | +12:24:40.293 info 3 |
| 129 | +12:24:40.293 info 7 |
| 130 | +12:24:40.294 error 0 |
| 131 | +12:24:40.294 error 1 |
| 132 | +12:24:40.294 error 2 |
| 133 | +12:24:40.294 error 3 |
| 134 | +12:24:40.295 error 4 |
| 135 | +``` |
| 136 | + |
| 137 | +i.e. about 30% of the info logs and 100% of the error logs are kept. |
| 138 | + |
| 139 | +(Technical note: the trace ID is compared against the head and background rates to determine inclusion, so the |
| 140 | +probabilities don't depend on the number of spans in the trace, and the rates give the probabilities directly without |
| 141 | +needing any further calculations. For example, with a head sample rate of `0.6` and a background rate of `0.3`, the |
| 142 | +chance of a non-notable trace being included is `0.3`, not `0.6 * 0.3`.) |
| 143 | + |
| 144 | +## Caveats of tail sampling |
| 145 | + |
| 146 | +### Memory usage |
| 147 | + |
| 148 | +For tail sampling to work, all the spans in a trace must be kept in memory until either the trace is included by |
| 149 | +sampling or the trace is completed and discarded. In the above example, the spans named `included span` don't have a |
| 150 | +high enough level to be included, so they are kept in memory until the error logs cause the entire trace to be included. |
| 151 | +This means that traces with a large number of spans can consume a lot of memory, whereas without tail sampling the spans |
| 152 | +would be regularly exported and freed from memory without waiting for the rest of the trace. |
| 153 | + |
| 154 | +In practice this is usually OK, because such large traces will usually exceed the duration threshold, at which point the |
| 155 | +trace will be included and the spans will be exported and freed. This works because the duration is measured as the time |
| 156 | +between the start of the trace and the start/end of the most recent span, so the tail sampler can know that a span will |
| 157 | +exceed the duration threshold even before it's complete. For example, running this script: |
12 | 158 |
|
13 | 159 | ```python |
| 160 | +import time |
| 161 | + |
14 | 162 | import logfire |
15 | 163 |
|
16 | | -logfire.configure(trace_sample_rate=0.5) |
| 164 | +logfire.configure(sampling=logfire.SamplingOptions.level_or_duration()) |
| 165 | + |
| 166 | +with logfire.span('span'): |
| 167 | + for x in range(1, 10): |
| 168 | + time.sleep(1) |
| 169 | + logfire.info(f'info {x}') |
| 170 | +``` |
| 171 | + |
| 172 | +will do nothing for the first 5 seconds, before suddenly logging all this at once: |
17 | 173 |
|
18 | | -with logfire.span("my_span"): # This span will be sampled 50% of the time |
19 | | - pass |
20 | 174 | ``` |
| 175 | +12:29:43.063 span |
| 176 | +12:29:44.065 info 1 |
| 177 | +12:29:45.066 info 2 |
| 178 | +12:29:46.072 info 3 |
| 179 | +12:29:47.076 info 4 |
| 180 | +12:29:48.082 info 5 |
| 181 | +``` |
| 182 | + |
| 183 | +followed by additional logs once per second. This is despite the fact that at this stage the outer span hasn't completed |
| 184 | +yet and the inner logs each have 0 duration. |
| 185 | + |
| 186 | +However, memory usage can still be a problem in any of the following cases: |
| 187 | + |
| 188 | +- The duration threshold is set to a high value |
| 189 | +- Spans are produced extremely rapidly |
| 190 | +- Spans contain large attributes |
21 | 191 |
|
22 | | -<!-- ## Fine grained sampling |
| 192 | +### Distributed tracing |
23 | 193 |
|
24 | | -You can tweak sampling on a per module or per code block basis using |
25 | | -[`logfire.with_trace_sample_rate()`][logfire.Logfire.with_trace_sample_rate]. |
| 194 | +Logfire's tail sampling is implemented in the SDK and only works for traces within one process. If you need tail |
| 195 | +sampling with distributed tracing, consider deploying |
| 196 | +the [Tail Sampling Processor in the OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md). |
| 197 | + |
| 198 | +If a trace was started on another process and its context was propagated to the process using the Logfire SDK tail |
| 199 | +sampling, the whole trace will be included. |
| 200 | + |
| 201 | +If you start a trace with the Logfire SDK with tail sampling, and then propagate the context to another process, the |
| 202 | +spans generated by the SDK may be discarded, while the spans generated by the other process may be included, leading to |
| 203 | +an incomplete trace. |
| 204 | + |
| 205 | +## Custom head sampling |
| 206 | + |
| 207 | +If you need more control than random sampling, you can pass an [OpenTelemetry |
| 208 | +`Sampler`](https://opentelemetry-python.readthedocs.io/en/latest/sdk/trace.sampling.html). For example: |
26 | 209 |
|
27 | 210 | ```python |
| 211 | +from opentelemetry.sdk.trace.sampling import ( |
| 212 | + ALWAYS_OFF, |
| 213 | + ALWAYS_ON, |
| 214 | + ParentBased, |
| 215 | + Sampler, |
| 216 | +) |
| 217 | + |
28 | 218 | import logfire |
29 | 219 |
|
30 | | -logfire.configure() |
31 | 220 |
|
32 | | -sampled = logfire.with_trace_sample_rate(0.5) |
| 221 | +class MySampler(Sampler): |
| 222 | + def should_sample( |
| 223 | + self, |
| 224 | + parent_context, |
| 225 | + trace_id, |
| 226 | + name, |
| 227 | + *args, |
| 228 | + **kwargs, |
| 229 | + ): |
| 230 | + if name == 'exclude me': |
| 231 | + sampler = ALWAYS_OFF |
| 232 | + else: |
| 233 | + sampler = ALWAYS_ON |
| 234 | + return sampler.should_sample( |
| 235 | + parent_context, |
| 236 | + trace_id, |
| 237 | + name, |
| 238 | + *args, |
| 239 | + **kwargs, |
| 240 | + ) |
| 241 | + |
| 242 | + def get_description(self): |
| 243 | + return 'MySampler' |
| 244 | + |
| 245 | + |
| 246 | +logfire.configure( |
| 247 | + sampling=logfire.SamplingOptions( |
| 248 | + head=ParentBased( |
| 249 | + MySampler(), |
| 250 | + ) |
| 251 | + ) |
| 252 | +) |
| 253 | + |
| 254 | +with logfire.span('keep me'): |
| 255 | + logfire.info('kept child') |
| 256 | + |
| 257 | +with logfire.span('exclude me'): |
| 258 | + logfire.info('excluded child') |
| 259 | +``` |
| 260 | + |
| 261 | +This will output something like: |
| 262 | + |
| 263 | +``` |
| 264 | +10:37:30.897 keep me |
| 265 | +10:37:30.898 kept child |
| 266 | +``` |
| 267 | + |
| 268 | +Note that the sampler explicitly excluded only the span named `exclude me`. The reason that the `excluded child` log is |
| 269 | +not included is that `MySampler` was wrapped in a `ParentBased` sampler, which excludes spans whose parents are |
| 270 | +excluded. If you remove that and simply pass `head=MySampler()`, the `excluded child` log will be included, resulting in |
| 271 | +an incomplete trace. |
| 272 | + |
| 273 | +You can also pass a `Sampler` to the `head` argument of `SamplingOptions.level_or_duration` to combine tail sampling |
| 274 | +with custom head sampling. |
| 275 | + |
| 276 | +## Custom tail sampling |
| 277 | + |
| 278 | +If you want tail sampling with more control than `level_or_duration`, you can pass a function to [ |
| 279 | +`tail`][logfire.sampling.SamplingOptions.tail] which will accept an instance of [ |
| 280 | +`TailSamplingSpanInfo`][logfire.sampling.TailSamplingSpanInfo] and return a float between 0 and 1 representing the |
| 281 | +probability that the trace should be included. For example: |
| 282 | + |
| 283 | +```python |
| 284 | +import logfire |
| 285 | + |
| 286 | + |
| 287 | +def get_tail_sample_rate(span_info): |
| 288 | + if span_info.duration >= 1: |
| 289 | + return 0.5 # (1)! |
| 290 | + |
| 291 | + if span_info.level > 'warn': # (2)! |
| 292 | + return 0.3 # (3)! |
| 293 | + |
| 294 | + return 0.1 # (4)! |
| 295 | + |
| 296 | + |
| 297 | +logfire.configure( |
| 298 | + sampling=logfire.SamplingOptions( |
| 299 | + head=0.5, # (5)! |
| 300 | + tail=get_tail_sample_rate, |
| 301 | + ), |
| 302 | +) |
| 303 | +``` |
33 | 304 |
|
34 | | -with sampled.span("outer"): # This span will be sampled 50% of the time |
35 | | - # `with sampled.with_trace_sample_rate(0.1).span("inner")` would also work |
36 | | - with logfire.with_trace_sample_rate(0.1).span("inner"): # This span will be sampled 10% of the time |
37 | | - pass |
38 | | -``` --> |
| 305 | +1. Keep 50% of traces with duration >= 1 second |
| 306 | +2. `span_info.level` is a [special object][logfire.sampling.SpanLevel] that can be compared to log level names |
| 307 | +3. Keep 30% of traces with a warning or error and with duration < 1 second |
| 308 | +4. Keep 10% of other traces |
| 309 | +5. Discard 50% of traces at the beginning to reduce the overhead of generating spans. This is optional, but improves |
| 310 | + performance, and we know that `get_tail_sample_rate` will always return at most 0.5 so the other 50% of traces will |
| 311 | + be discarded anyway. The probabilities are not independent - this will not discard traces that would otherwise have |
| 312 | + been kept by tail sampling. |
0 commit comments