-
Notifications
You must be signed in to change notification settings - Fork 963
Description
BUG REPORT
Describe the bug
FastCodahale Timer implementation may miscalculate percentiles if snapshots of values are slightly out of sync, and if only few events have been recorded.
FastCodahale Timers use fine-grained locking and are meant to tolerate that (some) values change while being recorded or while snapshots are created. Currently, the total count of requests is not synchronized with the number of requests recorded in percentile buckets. If a snapshot is created while the total count of the timer has been incremented beyond the sum of values in the percentile buckets, the percentile calculation may produce wrong values.
For example, if 3 percentile values have been recorded, but the overall count is 4, then the percentile calculation would be based on 4 values. This becomes most obvious if a percentile > .75 (e.g. p95) is being calculated. For this, the implementation will try to find 0.95 * 4 values, which is more than the 3 values recorded in the buckets. Since no bucket fulfills the criteria, the bound of the last (overflow) bucket will be returned, i.e. Long.MAX_VALUE.
To Reproduce
I added a test case in my pull request to fix this bug which demonstrates the behavior.
Expected behavior
Snapshots should base the percentile calculation on the sum of all values contained in the buckets, not on any value provided by a caller (which might be wrong or out of sync).