Skip to content

Commit 2bae25d

Browse files
DOC-5227 added Jedis probabilistic data type examples
1 parent 21d5cf5 commit 2bae25d

File tree

1 file changed

+223
-0
lines changed
  • content/develop/clients/jedis

1 file changed

+223
-0
lines changed
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Learn how to use approximate calculations with Redis.
13+
linkTitle: Probabilistic data types
14+
title: Probabilistic data types
15+
weight: 5
16+
---
17+
18+
Redis supports several
19+
[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
20+
that let you calculate values approximately rather than exactly.
21+
The types fall into two basic categories:
22+
23+
- [Set operations](#set-operations): These types let you calculate (approximately)
24+
the number of items in a set of distinct values, and whether or not a given value is
25+
a member of a set.
26+
- [Statistics](#statistics): These types give you an approximation of
27+
statistics such as the quantiles, ranks, and frequencies of numeric data points in
28+
a list.
29+
30+
To see why these approximate calculations would be useful, consider the task of
31+
counting the number of distinct IP addresses that access a website in one day.
32+
33+
Assuming that you already have code that supplies you with each IP
34+
address as a string, you could record the addresses in Redis using
35+
a [set]({{< relref "/develop/data-types/sets" >}}):
36+
37+
```java
38+
jedis.sadd("ip_tracker", new_ip_address)
39+
```
40+
41+
The set can only contain each key once, so if the same address
42+
appears again during the day, the new instance will not change
43+
the set. At the end of the day, you could get the exact number of
44+
distinct addresses using the `scard()` function:
45+
46+
```java
47+
long num_distinct_ips = jedis.scard("ip_tracker")
48+
```
49+
50+
This approach is simple, effective, and precise but if your website
51+
is very busy, the `ip_tracker` set could become very large and consume
52+
a lot of memory.
53+
54+
You would probably round the count of distinct IP addresses to the
55+
nearest thousand or more to deliver the usage statistics, so
56+
getting it exactly right is not important. It would be useful
57+
if you could trade off some accuracy in exchange for lower memory
58+
consumption. The probabilistic data types provide exactly this kind of
59+
trade-off. Specifically, you can count the approximate number of items in a
60+
set using the [HyperLogLog](#set-cardinality) data type, as described below.
61+
62+
In general, the probabilistic data types let you perform approximations with a
63+
bounded degree of error that have much lower memory consumption or execution
64+
time than the equivalent precise calculations.
65+
66+
## Set operations
67+
68+
Redis supports the following approximate set operations:
69+
70+
- [Membership](#set-membership): The
71+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73+
data types let you track whether or not a given item is a member of a set.
74+
- [Cardinality](#set-cardinality): The
75+
[HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76+
data type gives you an approximate value for the number of items in a set, also
77+
known as the *cardinality* of the set.
78+
79+
The sections below describe these operations in more detail.
80+
81+
### Set membership
82+
83+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
84+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
85+
objects provide a set membership operation that lets you track whether or not a
86+
particular item has been added to a set. These two types provide different
87+
trade-offs for memory usage and speed, so you can select the best one for your
88+
use case. Note that for both types, there is an asymmetry between presence and
89+
absence of items in the set. If an item is reported as absent, then it is definitely
90+
absent, but if it is reported as present, then there is a small chance it may really be
91+
absent.
92+
93+
Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
94+
a Bloom filter records the presence or absence of the
95+
[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
96+
This gives a very compact representation of the
97+
set's membership with a fixed memory size, regardless of how many items you
98+
add. The following example adds some names to a Bloom filter representing
99+
a list of users and checks for the presence or absence of users in the list.
100+
101+
{{< clients-example home_prob_dts bloom Java-Sync >}}
102+
{{< /clients-example >}}
103+
104+
A Cuckoo filter has similar features to a Bloom filter, but also supports
105+
a deletion operation to remove hashes from a set, as shown in the example
106+
below.
107+
108+
{{< clients-example home_prob_dts cuckoo Java-Sync >}}
109+
{{< /clients-example >}}
110+
111+
Which of these two data types you choose depends on your use case.
112+
Bloom filters are generally faster than Cuckoo filters when adding new items,
113+
and also have better memory usage. Cuckoo filters are generally faster
114+
at checking membership and also support the delete operation. See the
115+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
116+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
117+
reference pages for more information and comparison between the two types.
118+
119+
### Set cardinality
120+
121+
A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
122+
object calculates the cardinality of a set. As you add
123+
items, the HyperLogLog tracks the number of distinct set members but
124+
doesn't let you retrieve them or query which items have been added.
125+
You can also merge two or more HyperLogLogs to find the cardinality of the
126+
[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
127+
represent.
128+
129+
{{< clients-example home_prob_dts hyperloglog Java-Sync >}}
130+
{{< /clients-example >}}
131+
132+
The main benefit that HyperLogLogs offer is their very low
133+
memory usage. They can count up to 2^64 items with less than
134+
1% standard error using a maximum 12KB of memory. This makes
135+
them very useful for counting things like the total of distinct
136+
IP addresses that access a website or the total of distinct
137+
bank card numbers that make purchases within a day.
138+
139+
## Statistics
140+
141+
Redis supports several approximate statistical calculations
142+
on numeric data sets:
143+
144+
- [Frequency](#frequency): The
145+
[Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
146+
data type lets you find the approximate frequency of a labeled item in a data stream.
147+
- [Quantiles](#quantiles): The
148+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
149+
data type estimates the quantile of a query value in a data stream.
150+
- [Ranking](#ranking): The
151+
[Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
152+
estimates the ranking of labeled items by frequency in a data stream.
153+
154+
The sections below describe these operations in more detail.
155+
156+
### Frequency
157+
158+
A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
159+
(CMS) object keeps count of a set of related items represented by
160+
string labels. The count is approximate, but you can specify
161+
how close you want to keep the count to the true value (as a fraction)
162+
and the acceptable probability of failing to keep it in this
163+
desired range. For example, you can request that the count should
164+
stay within 0.1% of the true value and have a 0.05% probability
165+
of going outside this limit. The example below shows how to create
166+
a Count-min sketch object, add data to it, and then query it.
167+
168+
{{< clients-example home_prob_dts cms Java-Sync >}}
169+
{{< /clients-example >}}
170+
171+
The advantage of using a CMS over keeping an exact count with a
172+
[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
173+
is that that a CMS has very low and fixed memory usage, even for
174+
large numbers of items. Use CMS objects to keep daily counts of
175+
items sold, accesses to individual web pages on your site, and
176+
other similar statistics.
177+
178+
### Quantiles
179+
180+
A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
181+
below which a certain fraction of samples lie. For example, with
182+
a set of measurements of people's heights, the quantile of 0.75 is
183+
the value of height below which 75% of all people's heights lie.
184+
[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
185+
to quantiles, except that the fraction is expressed as a percentage.
186+
187+
A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
188+
object can estimate quantiles from a set of values added to it
189+
without having to store each value in the set explicitly. This can
190+
save a lot of memory when you have a large number of samples.
191+
192+
The example below shows how to add data samples to a t-digest
193+
object and obtain some basic statistics, such as the minimum and
194+
maximum values, the quantile of 0.75, and the
195+
[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
196+
(CDF), which is effectively the inverse of the quantile function. It also
197+
shows how to merge two or more t-digest objects to query the combined
198+
data set.
199+
200+
{{< clients-example home_prob_dts tdigest Java-Sync >}}
201+
{{< /clients-example >}}
202+
203+
A t-digest object also supports several other related commands, such
204+
as querying by rank. See the
205+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
206+
reference for more information.
207+
208+
### Ranking
209+
210+
A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
211+
object estimates the rankings of different labeled items in a data
212+
stream according to frequency. For example, you could use this to
213+
track the top ten most frequently-accessed pages on a website, or the
214+
top five most popular items sold.
215+
216+
The example below adds several different items to a Top-K object
217+
that tracks the top three items (this is the second parameter to
218+
the `topkReserve()` method). It also shows how to list the
219+
top *k* items and query whether or not a given item is in the
220+
list.
221+
222+
{{< clients-example home_prob_dts topk Java-Sync >}}
223+
{{< /clients-example >}}

0 commit comments

Comments
 (0)