Skip to content

Commit a4d2d15

Browse files
Merge pull request #1743 from redis/DOC-5227-prob-dt-jedis
DOC-5227 Jedis probabilistic examples
2 parents 0787c24 + aa5a78a commit a4d2d15

File tree

2 files changed

+385
-1
lines changed

2 files changed

+385
-1
lines changed
Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Learn how to use approximate calculations with Redis.
13+
linkTitle: Probabilistic data types
14+
title: Probabilistic data types
15+
weight: 5
16+
---
17+
18+
Redis supports several
19+
[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
20+
that let you calculate values approximately rather than exactly.
21+
The types fall into two basic categories:
22+
23+
- [Set operations](#set-operations): These types let you calculate (approximately)
24+
the number of items in a set of distinct values, and whether or not a given value is
25+
a member of a set.
26+
- [Statistics](#statistics): These types give you an approximation of
27+
statistics such as the quantiles, ranks, and frequencies of numeric data points in
28+
a list.
29+
30+
To see why these approximate calculations would be useful, consider the task of
31+
counting the number of distinct IP addresses that access a website in one day.
32+
33+
Assuming that you already have code that supplies you with each IP
34+
address as a string, you could record the addresses in Redis using
35+
a [set]({{< relref "/develop/data-types/sets" >}}):
36+
37+
```java
38+
jedis.sadd("ip_tracker", new_ip_address)
39+
```
40+
41+
The set can only contain each key once, so if the same address
42+
appears again during the day, the new instance will not change
43+
the set. At the end of the day, you could get the exact number of
44+
distinct addresses using the `scard()` function:
45+
46+
```java
47+
long num_distinct_ips = jedis.scard("ip_tracker")
48+
```
49+
50+
This approach is simple, effective, and precise but if your website
51+
is very busy, the `ip_tracker` set could become very large and consume
52+
a lot of memory.
53+
54+
You would probably round the count of distinct IP addresses to the
55+
nearest thousand or more to deliver the usage statistics, so
56+
getting it exactly right is not important. It would be useful
57+
if you could trade off some accuracy in exchange for lower memory
58+
consumption. The probabilistic data types provide exactly this kind of
59+
trade-off. Specifically, you can count the approximate number of items in a
60+
set using the [HyperLogLog](#set-cardinality) data type, as described below.
61+
62+
In general, the probabilistic data types let you perform approximations with a
63+
bounded degree of error that have much lower memory consumption or execution
64+
time than the equivalent precise calculations.
65+
66+
## Set operations
67+
68+
Redis supports the following approximate set operations:
69+
70+
- [Membership](#set-membership): The
71+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73+
data types let you track whether or not a given item is a member of a set.
74+
- [Cardinality](#set-cardinality): The
75+
[HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76+
data type gives you an approximate value for the number of items in a set, also
77+
known as the *cardinality* of the set.
78+
79+
The sections below describe these operations in more detail.
80+
81+
### Set membership
82+
83+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
84+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
85+
objects provide a set membership operation that lets you track whether or not a
86+
particular item has been added to a set. These two types provide different
87+
trade-offs for memory usage and speed, so you can select the best one for your
88+
use case. Note that for both types, there is an asymmetry between presence and
89+
absence of items in the set. If an item is reported as absent, then it is definitely
90+
absent, but if it is reported as present, then there is a small chance it may really be
91+
absent.
92+
93+
Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
94+
a Bloom filter records the presence or absence of the
95+
[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
96+
This gives a very compact representation of the
97+
set's membership with a fixed memory size, regardless of how many items you
98+
add. The following example adds some names to a Bloom filter representing
99+
a list of users and checks for the presence or absence of users in the list.
100+
101+
```java
102+
List<Boolean> res1 = jedis.bfMAdd(
103+
"recorded_users",
104+
"andy", "cameron", "david", "michelle"
105+
);
106+
System.out.println(res1); // >>> [true, true, true, true]
107+
108+
boolean res2 = jedis.bfExists("recorded_users", "cameron");
109+
System.out.println(res2); // >>> true
110+
111+
boolean res3 = jedis.bfExists("recorded_users", "kaitlyn");
112+
System.out.println(res3); // >>> false
113+
```
114+
<!--< clients-example home_prob_dts bloom Java-Sync >}}
115+
< /clients-example >}}-->
116+
117+
A Cuckoo filter has similar features to a Bloom filter, but also supports
118+
a deletion operation to remove hashes from a set, as shown in the example
119+
below.
120+
121+
<!--< clients-example home_prob_dts cuckoo Java-Sync >}}
122+
< /clients-example >}}-->
123+
```java
124+
boolean res4 = jedis.cfAdd("other_users", "paolo");
125+
System.out.println(res4); // >>> true
126+
127+
boolean res5 = jedis.cfAdd("other_users", "kaitlyn");
128+
System.out.println(res5); // >>> true
129+
130+
boolean res6 = jedis.cfAdd("other_users", "rachel");
131+
System.out.println(res6); // >>> true
132+
133+
boolean[] res7 = jedis.cfMExists(
134+
"other_users",
135+
"paolo", "rachel", "andy"
136+
);
137+
System.out.println(res7); // >>> [true, true, false]
138+
139+
boolean res8 = jedis.cfDel("other_users", "paolo");
140+
System.out.println(res8); // >>> true
141+
142+
boolean res9 = jedis.cfExists("other_users", "paolo");
143+
System.out.println(res9); // >>> false
144+
```
145+
146+
Which of these two data types you choose depends on your use case.
147+
Bloom filters are generally faster than Cuckoo filters when adding new items,
148+
and also have better memory usage. Cuckoo filters are generally faster
149+
at checking membership and also support the delete operation. See the
150+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
151+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
152+
reference pages for more information and comparison between the two types.
153+
154+
### Set cardinality
155+
156+
A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
157+
object calculates the cardinality of a set. As you add
158+
items, the HyperLogLog tracks the number of distinct set members but
159+
doesn't let you retrieve them or query which items have been added.
160+
You can also merge two or more HyperLogLogs to find the cardinality of the
161+
[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
162+
represent.
163+
164+
<!--< clients-example home_prob_dts hyperloglog Java-Sync >}}
165+
< /clients-example >}}-->
166+
```java
167+
long res10 = jedis.pfadd("group:1", "andy", "cameron", "david");
168+
System.out.println(res10); // >>> 1
169+
170+
long res11 = jedis.pfcount("group:1");
171+
System.out.println(res11); // >>> 3
172+
173+
long res12 = jedis.pfadd(
174+
"group:2",
175+
"kaitlyn", "michelle", "paolo", "rachel"
176+
);
177+
System.out.println(res12); // >>> 1
178+
179+
long res13 = jedis.pfcount("group:2");
180+
System.out.println(res13); // >>> 4
181+
182+
String res14 = jedis.pfmerge("both_groups", "group:1", "group:2");
183+
System.out.println(res14); // >>> OK
184+
185+
long res15 = jedis.pfcount("both_groups");
186+
System.out.println(res15); // >>> 7
187+
```
188+
189+
The main benefit that HyperLogLogs offer is their very low
190+
memory usage. They can count up to 2^64 items with less than
191+
1% standard error using a maximum 12KB of memory. This makes
192+
them very useful for counting things like the total of distinct
193+
IP addresses that access a website or the total of distinct
194+
bank card numbers that make purchases within a day.
195+
196+
## Statistics
197+
198+
Redis supports several approximate statistical calculations
199+
on numeric data sets:
200+
201+
- [Frequency](#frequency): The
202+
[Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
203+
data type lets you find the approximate frequency of a labeled item in a data stream.
204+
- [Quantiles](#quantiles): The
205+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
206+
data type estimates the quantile of a query value in a data stream.
207+
- [Ranking](#ranking): The
208+
[Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
209+
estimates the ranking of labeled items by frequency in a data stream.
210+
211+
The sections below describe these operations in more detail.
212+
213+
### Frequency
214+
215+
A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
216+
(CMS) object keeps count of a set of related items represented by
217+
string labels. The count is approximate, but you can specify
218+
how close you want to keep the count to the true value (as a fraction)
219+
and the acceptable probability of failing to keep it in this
220+
desired range. For example, you can request that the count should
221+
stay within 0.1% of the true value and have a 0.05% probability
222+
of going outside this limit. The example below shows how to create
223+
a Count-min sketch object, add data to it, and then query it.
224+
225+
<!--< clients-example home_prob_dts cms Java-Sync >}}
226+
< /clients-example >}}-->
227+
```java
228+
// Specify that you want to keep the counts within 0.01
229+
// (1%) of the true value with a 0.005 (0.5%) chance
230+
// of going outside this limit.
231+
String res16 = jedis.cmsInitByProb("items_sold", 0.01, 0.005);
232+
System.out.println(res16); // >>> OK
233+
234+
Map<String, Long> firstItemIncrements = new HashMap<>();
235+
firstItemIncrements.put("bread", 300L);
236+
firstItemIncrements.put("tea", 200L);
237+
firstItemIncrements.put("coffee", 200L);
238+
firstItemIncrements.put("beer", 100L);
239+
240+
List<Long> res17 = jedis.cmsIncrBy("items_sold",
241+
firstItemIncrements
242+
);
243+
res17.sort(null);
244+
System.out.println(); // >>> [100, 200, 200, 300]
245+
246+
Map<String, Long> secondItemIncrements = new HashMap<>();
247+
secondItemIncrements.put("bread", 100L);
248+
secondItemIncrements.put("coffee", 150L);
249+
250+
List<Long> res18 = jedis.cmsIncrBy("items_sold",
251+
secondItemIncrements
252+
);
253+
res18.sort(null);
254+
System.out.println(res18); // >>> [350, 400]
255+
256+
List<Long> res19 = jedis.cmsQuery(
257+
"items_sold",
258+
"bread", "tea", "coffee", "beer"
259+
);
260+
res19.sort(null);
261+
System.out.println(res19); // >>> [100, 200, 350, 400]
262+
```
263+
264+
The advantage of using a CMS over keeping an exact count with a
265+
[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
266+
is that that a CMS has very low and fixed memory usage, even for
267+
large numbers of items. Use CMS objects to keep daily counts of
268+
items sold, accesses to individual web pages on your site, and
269+
other similar statistics.
270+
271+
### Quantiles
272+
273+
A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
274+
below which a certain fraction of samples lie. For example, with
275+
a set of measurements of people's heights, the quantile of 0.75 is
276+
the value of height below which 75% of all people's heights lie.
277+
[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
278+
to quantiles, except that the fraction is expressed as a percentage.
279+
280+
A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
281+
object can estimate quantiles from a set of values added to it
282+
without having to store each value in the set explicitly. This can
283+
save a lot of memory when you have a large number of samples.
284+
285+
The example below shows how to add data samples to a t-digest
286+
object and obtain some basic statistics, such as the minimum and
287+
maximum values, the quantile of 0.75, and the
288+
[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
289+
(CDF), which is effectively the inverse of the quantile function. It also
290+
shows how to merge two or more t-digest objects to query the combined
291+
data set.
292+
293+
<!--< clients-example home_prob_dts tdigest Java-Sync >}}
294+
< /clients-example >}}-->
295+
```java
296+
String res20 = jedis.tdigestCreate("male_heights");
297+
System.out.println(res20); // >>> OK
298+
299+
String res21 = jedis.tdigestAdd("male_heights",
300+
175.5, 181, 160.8, 152, 177, 196, 164);
301+
System.out.println(res21); // >>> OK
302+
303+
double res22 = jedis.tdigestMin("male_heights");
304+
System.out.println(res22); // >>> 152.0
305+
306+
double res23 = jedis.tdigestMax("male_heights");
307+
System.out.println(res23); // >>> 196.0
308+
309+
List<Double> res24 = jedis.tdigestQuantile("male_heights", 0.75);
310+
System.out.println(res24); // >>> [181.0]
311+
312+
// Note that the CDF value for 181 is not exactly 0.75.
313+
// Both values are estimates.
314+
List<Double> res25 = jedis.tdigestCDF("male_heights", 181);
315+
System.out.println(res25); // >>> [0.7857142857142857]
316+
317+
String res26 = jedis.tdigestCreate("female_heights");
318+
System.out.println(res26); // >>> OK
319+
320+
String res27 = jedis.tdigestAdd("female_heights",
321+
155.5, 161, 168.5, 170, 157.5, 163, 171);
322+
System.out.println(res27); // >>> OK
323+
324+
List<Double> res28 = jedis.tdigestQuantile("female_heights", 0.75);
325+
System.out.println(res28); // >>> [170.0]
326+
327+
String res29 = jedis.tdigestMerge(
328+
"all_heights",
329+
"male_heights", "female_heights"
330+
);
331+
System.out.println(res29); // >>> OK
332+
List<Double> res30 = jedis.tdigestQuantile("all_heights", 0.75);
333+
System.out.println(res30); // >>> [175.5]
334+
```
335+
336+
A t-digest object also supports several other related commands, such
337+
as querying by rank. See the
338+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
339+
reference for more information.
340+
341+
### Ranking
342+
343+
A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
344+
object estimates the rankings of different labeled items in a data
345+
stream according to frequency. For example, you could use this to
346+
track the top ten most frequently-accessed pages on a website, or the
347+
top five most popular items sold.
348+
349+
The example below adds several different items to a Top-K object
350+
that tracks the top three items (this is the second parameter to
351+
the `topkReserve()` method). It also shows how to list the
352+
top *k* items and query whether or not a given item is in the
353+
list.
354+
355+
<!--< clients-example home_prob_dts topk Java-Sync >}}
356+
< /clients-example >}}-->
357+
```java
358+
String res31 = jedis.topkReserve("top_3_songs", 3L, 2000L, 7L, 0.925D);
359+
System.out.println(res31); // >>> OK
360+
361+
Map<String, Long> songIncrements = new HashMap<>();
362+
songIncrements.put("Starfish Trooper", 3000L);
363+
songIncrements.put("Only one more time", 1850L);
364+
songIncrements.put("Rock me, Handel", 1325L);
365+
songIncrements.put("How will anyone know?", 3890L);
366+
songIncrements.put("Average lover", 4098L);
367+
songIncrements.put("Road to everywhere", 770L);
368+
369+
List<String> res32 = jedis.topkIncrBy("top_3_songs",
370+
songIncrements
371+
);
372+
System.out.println(res32);
373+
// >>> [null, null, null, null, null, Rock me, Handel]
374+
375+
List<String> res33 = jedis.topkList("top_3_songs");
376+
System.out.println(res33);
377+
// >>> [Average lover, How will anyone know?, Starfish Trooper]
378+
379+
List<Boolean> res34 = jedis.topkQuery("top_3_songs",
380+
"Starfish Trooper", "Road to everywhere"
381+
);
382+
System.out.println(res34);
383+
// >>> [true, false]
384+
```

content/develop/clients/redis-py/prob.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@ sketch commands.
222222

223223
```py
224224
# Specify that you want to keep the counts within 0.01
225-
# (0.1%) of the true value with a 0.005 (0.05%) chance
225+
# (1%) of the true value with a 0.005 (0.5%) chance
226226
# of going outside this limit.
227227
res16 = r.cms().initbyprob("items_sold", 0.01, 0.005)
228228
print(res16) # >>> True

0 commit comments

Comments
 (0)