You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(sampling): ensure the rate limiter operates on positive time intervals (#9416)
## Motivation
Currently the RateLimiter samples the first span in trace using
`Span.start_ns` and evaluating this timestamp against the last seen
timestamp.
[RateLimiter.is_allowed(...)](https://github.com/DataDog/dd-trace-py/blob/v2.9.0rc7/ddtrace/internal/rate_limiter.py#L60)
works as expected if it receives monotonically increasing timestamps.
However if this method receives a timestamp that is less than a previous
value it will compute a [negative time
window](https://github.com/DataDog/dd-trace-py/blob/v2.9.0rc7/ddtrace/internal/rate_limiter.py#L126)
and then set an [incorrect
rate_limit](https://github.com/DataDog/dd-trace-py/blob/v2.9.0rc7/ddtrace/internal/rate_limiter.py#L136).
ddtrace v2.8.0 introduced support for lazy sampling. With this feature
sample rates and rate limits are no longer applied on span start. This
increased the frequency of this bug:
9707da1.
## Description
This PR resolves this issue by:
- Deprecating the timestamp argument in `RateLimiter.is_allowed`. The
current time will always be used to compute span rate limits (instead of
Span.start_ns). This will ensure rate limits are computed on ONLY
increasing time intervals.
- Ensuring a lock is acquired when computing rate limits and updating
rate counts. Currently we only acquire a lock to compute
`RateLimiter._replenish`. This is not sufficient.
## Reproduction
- This bug can be reproduced by generating two spans with different
start times but the same end time. The span with earliest start time
should be finished last.
Failing regression test:
https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/62701/workflows/915c8cc5-6968-4069-a379-84929b239df8/jobs/3906251
## Checklist
- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.
## Reviewer Checklist
- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
tracing: Ensures spans are rate limited at the expected rate (100 spans per second by default). Previously long running spans would set the rate limiter to set an invalid window and this could cause the next trace to be dropped.
0 commit comments