1717
1818//! Frequency sketches for finding heavy hitters in data streams.
1919//!
20- //! This module implements the Frequent Items sketch from Apache DataSketches. It tracks
21- //! approximate frequencies in a stream and can report heavy hitters with explicit
22- //! error guarantees (no false negatives or no false positives).
20+ //! # Overview
2321//!
24- //! For background, see the Java documentation:
25- //! <https://apache.github.io/datasketches-java/9.0.0/org/apache/datasketches/frequencies/FrequentItemsSketch.html>
22+ //! This sketch is based on the paper ["A High-Performance Algorithm for Identifying Frequent Items
23+ //! in Data Streams"](https://arxiv.org/abs/1705.07001) by Daniel Anderson, Pryce Bevan, Kevin Lang,
24+ //! Edo Liberty, Lee Rhodes, and Justin Thaler.
2625//!
27- //! # Usage
26+ //! This sketch is useful for tracking approximate frequencies of items of type `T` that implements
27+ //! [`FrequentItemValue`], with optional associated counts (`T` item, `u64` count) that are members
28+ //! of a multiset of such items. The true frequency of an item is defined to be the sum of
29+ //! associated counts.
2830//!
29- //! ```rust
31+ //! This implementation provides the following capabilities:
32+ //! * Estimate the frequency of an item.
33+ //! * Return upper and lower bounds of any item, such that the true frequency is always between the
34+ //! upper and lower bounds.
35+ //! * Return a global maximum error that holds for all items in the stream.
36+ //! * Return an array of frequent items that qualify either [`ErrorType::NoFalsePositives`] or
37+ //! [`ErrorType::NoFalseNegatives`].
38+ //! * Merge itself with another sketch created from this module.
39+ //! * Serialize to bytes, or deserialize from bytes, for storage or transmission.
40+ //!
41+ //! # Accuracy
42+ //!
43+ //! If fewer than `0.75 * max_map_size` different items are inserted into the sketch the estimated
44+ //! frequencies returned by the sketch will be exact.
45+ //!
46+ //! The logic of the frequent items sketch is such that the stored counts and true counts are never
47+ //! too different. More specifically, for any item, the sketch can return an estimate of the true
48+ //! frequency of item, along with upper and lower bounds on the frequency (that hold
49+ //! deterministically).
50+ //!
51+ //! For this implementation and for a specific active item, it is guaranteed that the true frequency
52+ //! will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
53+ //! Specifically, `(UB - LB) ≤ W * epsilon`, where `W` denotes the sum of all item counts, and
54+ //! `epsilon = 3.5/M`, where `M` is the `max_map_size`.
55+ //!
56+ //! This is the worst case guarantee that applies to arbitrary inputs. [^1]
57+ //! For inputs typically seen in practice (`UB - LB`) is usually much smaller.
58+ //!
59+ //! [^1]: For speed we do employ some randomization that introduces a small probability that our
60+ //! proof of the worst-case bound might not apply to a given run. However, we have ensured that this
61+ //! probability is extremely small. For example, if the stream causes one table purge (rebuild),
62+ //! our proof of the worst case bound applies with probability at least `1 - 1E-14`. If the stream
63+ //! causes `1E9` purges, our proof applies with probability at least `1 - 1E-5`.
64+ //!
65+ //! # Background
66+ //!
67+ //! This code implements a variant of what is commonly known as the "Misra-Gries algorithm".
68+ //! Variants of it were discovered and rediscovered and redesigned several times over the years:
69+ //! * "Finding repeated elements", Misra, Gries, 1982
70+ //! * "Frequency estimation of Internet packet streams with limited space" Demaine, Lopez-Ortiz,
71+ //! Munro, 2002
72+ //! * "A simple algorithm for finding frequent elements in streams and bags" Karp, Shenker,
73+ //! Papadimitriou, 2003
74+ //! * "Efficient Computation of Frequent and Top-k Elements in Data Streams" Metwally, Agrawal,
75+ //! Abbadi, 2006
76+ //!
77+ //! # Examples
78+ //!
79+ //! ```
3080//! # use datasketches::frequencies::ErrorType;
3181//! # use datasketches::frequencies::FrequentItemsSketch;
3282//! let mut sketch = FrequentItemsSketch::<i64>::new(64);
3888//!
3989//! # Serialization
4090//!
41- //! ```rust
91+ //! ```
4292//! # use datasketches::frequencies::FrequentItemsSketch;
4393//! let mut sketch = FrequentItemsSketch::<i64>::new(64);
4494//! sketch.update_with_count(42, 2);
@@ -52,6 +102,7 @@ mod reverse_purge_item_hash_map;
52102mod serialization;
53103mod sketch;
54104
105+ pub use self :: serialization:: FrequentItemValue ;
55106pub use self :: sketch:: ErrorType ;
56107pub use self :: sketch:: FrequentItemsSketch ;
57108pub use self :: sketch:: Row ;
0 commit comments