Skip to content

FastCodahale Timers miscalculate percentiles #2053

@nicmichael

Description

@nicmichael

BUG REPORT

Describe the bug

FastCodahale Timer implementation may miscalculate percentiles if snapshots of values are slightly out of sync, and if only few events have been recorded.

FastCodahale Timers use fine-grained locking and are meant to tolerate that (some) values change while being recorded or while snapshots are created. Currently, the total count of requests is not synchronized with the number of requests recorded in percentile buckets. If a snapshot is created while the total count of the timer has been incremented beyond the sum of values in the percentile buckets, the percentile calculation may produce wrong values.

For example, if 3 percentile values have been recorded, but the overall count is 4, then the percentile calculation would be based on 4 values. This becomes most obvious if a percentile > .75 (e.g. p95) is being calculated. For this, the implementation will try to find 0.95 * 4 values, which is more than the 3 values recorded in the buckets. Since no bucket fulfills the criteria, the bound of the last (overflow) bucket will be returned, i.e. Long.MAX_VALUE.

To Reproduce

I added a test case in my pull request to fix this bug which demonstrates the behavior.

Expected behavior

Snapshots should base the percentile calculation on the sum of all values contained in the buckets, not on any value provided by a caller (which might be wrong or out of sync).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions