Merge pull request #1743 from redis/DOC-5227-prob-dt-jedis

andy-stark-redis · web-flow · commit a4d2d158f43d · 2025-06-24T15:03:42.000+01:00
DOC-5227 Jedis probabilistic examples
diff --git a/content/develop/clients/jedis/prob.md b/content/develop/clients/jedis/prob.md
@@ -0,0 +1,384 @@
+---
+categories:
+- docs
+- develop
+- stack
+- oss
+- rs
+- rc
+- oss
+- kubernetes
+- clients
+description: Learn how to use approximate calculations with Redis.
+linkTitle: Probabilistic data types
+title: Probabilistic data types
+weight: 5
+---
+
+Redis supports several
+[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
+that let you calculate values approximately rather than exactly.
+The types fall into two basic categories:
+
+-   [Set operations](#set-operations): These types let you calculate (approximately)
+    the number of items in a set of distinct values, and whether or not a given value is
+    a member of a set.
+-   [Statistics](#statistics): These types give you an approximation of
+    statistics such as the quantiles, ranks, and frequencies of numeric data points in
+    a list.
+
+To see why these approximate calculations would be useful, consider the task of
+counting the number of distinct IP addresses that access a website in one day.
+
+Assuming that you already have code that supplies you with each IP
+address as a string, you could record the addresses in Redis using
+a [set]({{< relref "/develop/data-types/sets" >}}):
+
+```java
+jedis.sadd("ip_tracker", new_ip_address)
+```
+
+The set can only contain each key once, so if the same address
+appears again during the day, the new instance will not change
+the set. At the end of the day, you could get the exact number of
+distinct addresses using the `scard()` function:
+
+```java
+long num_distinct_ips = jedis.scard("ip_tracker")
+```
+
+This approach is simple, effective, and precise but if your website
+is very busy, the `ip_tracker` set could become very large and consume
+a lot of memory.
+
+You would probably round the count of distinct IP addresses to the
+nearest thousand or more to deliver the usage statistics, so
+getting it exactly right is not important. It would be useful
+if you could trade off some accuracy in exchange for lower memory
+consumption. The probabilistic data types provide exactly this kind of
+trade-off. Specifically, you can count the approximate number of items in a
+set using the [HyperLogLog](#set-cardinality) data type, as described below.
+
+In general, the probabilistic data types let you perform approximations with a
+bounded degree of error that have much lower memory consumption or execution
+time than the equivalent precise calculations.
+
+## Set operations
+
+Redis supports the following approximate set operations:
+
+-   [Membership](#set-membership): The
+    [Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
+    [Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
+    data types let you track whether or not a given item is a member of a set.
+-   [Cardinality](#set-cardinality): The
+    [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
+    data type gives you an approximate value for the number of items in a set, also
+    known as the *cardinality* of the set.
+
+The sections below describe these operations in more detail.
+
+### Set membership
+
+[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
+[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
+objects provide a set membership operation that lets you track whether or not a
+particular item has been added to a set. These two types provide different
+trade-offs for memory usage and speed, so you can select the best one for your
+use case. Note that for both types, there is an asymmetry between presence and
+absence of items in the set. If an item is reported as absent, then it is definitely
+absent, but if it is reported as present, then there is a small chance it may really be
+absent.
+
+Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
+a Bloom filter records the presence or absence of the
+[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
+This gives a very compact representation of the
+set's membership with a fixed memory size, regardless of how many items you
+add. The following example adds some names to a Bloom filter representing
+a list of users and checks for the presence or absence of users in the list.
+
+```java
+List<Boolean> res1 = jedis.bfMAdd(
+    "recorded_users",
+    "andy", "cameron", "david", "michelle"
+);
+System.out.println(res1);  // >>> [true, true, true, true]
+
+boolean res2 = jedis.bfExists("recorded_users", "cameron");
+System.out.println(res2);  // >>> true
+
+boolean res3 = jedis.bfExists("recorded_users", "kaitlyn");
+System.out.println(res3);  // >>> false
+```
+<!--< clients-example home_prob_dts bloom Java-Sync >}}
+< /clients-example >}}-->
+
+A Cuckoo filter has similar features to a Bloom filter, but also supports
+a deletion operation to remove hashes from a set, as shown in the example
+below.
+
+<!--< clients-example home_prob_dts cuckoo Java-Sync >}}
+< /clients-example >}}-->
+```java
+boolean res4 = jedis.cfAdd("other_users", "paolo");
+System.out.println(res4);  // >>> true
+
+boolean res5 = jedis.cfAdd("other_users", "kaitlyn");
+System.out.println(res5);  // >>> true
+
+boolean res6 = jedis.cfAdd("other_users", "rachel");
+System.out.println(res6);  // >>> true
+
+boolean[] res7 = jedis.cfMExists(
+    "other_users",
+     "paolo", "rachel", "andy"
+);
+System.out.println(res7);  // >>> [true, true, false]
+
+boolean res8 = jedis.cfDel("other_users", "paolo");
+System.out.println(res8);  // >>> true
+
+boolean res9 = jedis.cfExists("other_users", "paolo");
+System.out.println(res9);  // >>> false
+```
+
+Which of these two data types you choose depends on your use case.
+Bloom filters are generally faster than Cuckoo filters when adding new items,
+and also have better memory usage. Cuckoo filters are generally faster
+at checking membership and also support the delete operation. See the
+[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
+[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
+reference pages for more information and comparison between the two types.
+
+### Set cardinality
+
+A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
+object calculates the cardinality of a set. As you add
+items, the HyperLogLog tracks the number of distinct set members but
+doesn't let you retrieve them or query which items have been added.
+You can also merge two or more HyperLogLogs to find the cardinality of the
+[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
+represent.
+
+<!--< clients-example home_prob_dts hyperloglog Java-Sync >}}
+< /clients-example >}}-->
+```java
+long res10 = jedis.pfadd("group:1", "andy", "cameron", "david");
+System.out.println(res10);  // >>> 1
+
+long res11 = jedis.pfcount("group:1");
+System.out.println(res11);  // >>> 3
+
+long res12 = jedis.pfadd(
+    "group:2",
+    "kaitlyn", "michelle", "paolo", "rachel"
+);
+System.out.println(res12);  // >>> 1
+
+long res13 = jedis.pfcount("group:2");
+System.out.println(res13);  // >>> 4
+
+String res14 = jedis.pfmerge("both_groups", "group:1", "group:2");
+System.out.println(res14);  // >>> OK
+
+long res15 = jedis.pfcount("both_groups");
+System.out.println(res15);  // >>> 7
+```
+
+The main benefit that HyperLogLogs offer is their very low
+memory usage. They can count up to 2^64 items with less than
+1% standard error using a maximum 12KB of memory. This makes
+them very useful for counting things like the total of distinct
+IP addresses that access a website or the total of distinct
+bank card numbers that make purchases within a day.
+
+## Statistics
+
+Redis supports several approximate statistical calculations
+on numeric data sets:
+
+-   [Frequency](#frequency): The
+    [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
+    data type lets you find the approximate frequency of a labeled item in a data stream.
+-   [Quantiles](#quantiles): The
+    [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
+    data type estimates the quantile of a query value in a data stream.
+-   [Ranking](#ranking): The
+    [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
+    estimates the ranking of labeled items by frequency in a data stream.
+
+The sections below describe these operations in more detail.
+
+### Frequency
+
+A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
+(CMS) object keeps count of a set of related items represented by
+string labels. The count is approximate, but you can specify
+how close you want to keep the count to the true value (as a fraction)
+and the acceptable probability of failing to keep it in this
+desired range. For example, you can request that the count should
+stay within 0.1% of the true value and have a 0.05% probability
+of going outside this limit. The example below shows how to create
+a Count-min sketch object, add data to it, and then query it.
+
+<!--< clients-example home_prob_dts cms Java-Sync >}}
+< /clients-example >}}-->
+```java
+// Specify that you want to keep the counts within 0.01
+// (1%) of the true value with a 0.005 (0.5%) chance
+// of going outside this limit.
+String res16 = jedis.cmsInitByProb("items_sold", 0.01, 0.005);
+System.out.println(res16);  // >>> OK
+
+Map<String, Long> firstItemIncrements = new HashMap<>();
+firstItemIncrements.put("bread", 300L);
+firstItemIncrements.put("tea", 200L);
+firstItemIncrements.put("coffee", 200L);
+firstItemIncrements.put("beer", 100L);
+
+List<Long> res17 = jedis.cmsIncrBy("items_sold",
+    firstItemIncrements
+);
+res17.sort(null);
+System.out.println();  // >>> [100, 200, 200, 300]
+
+Map<String, Long> secondItemIncrements = new HashMap<>();
+secondItemIncrements.put("bread", 100L);
+secondItemIncrements.put("coffee", 150L);
+
+List<Long> res18 = jedis.cmsIncrBy("items_sold",
+    secondItemIncrements
+);
+res18.sort(null);
+System.out.println(res18);  // >>> [350, 400]
+
+List<Long> res19 = jedis.cmsQuery(
+    "items_sold",
+    "bread", "tea", "coffee", "beer"
+);
+res19.sort(null);
+System.out.println(res19);  // >>> [100, 200, 350, 400]
+```
+
+The advantage of using a CMS over keeping an exact count with a
+[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
+is that that a CMS has very low and fixed memory usage, even for
+large numbers of items. Use CMS objects to keep daily counts of
+items sold, accesses to individual web pages on your site, and
+other similar statistics.
+
+### Quantiles
+
+A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
+below which a certain fraction of samples lie. For example, with
+a set of measurements of people's heights, the quantile of 0.75 is
+the value of height below which 75% of all people's heights lie.
+[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
+to quantiles, except that the fraction is expressed as a percentage.
+
+A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
+object can estimate quantiles from a set of values added to it
+without having to store each value in the set explicitly. This can
+save a lot of memory when you have a large number of samples.
+
+The example below shows how to add data samples to a t-digest
+object and obtain some basic statistics, such as the minimum and
+maximum values, the quantile of 0.75, and the 
+[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
+(CDF), which is effectively the inverse of the quantile function. It also
+shows how to merge two or more t-digest objects to query the combined
+data set.
+
+<!--< clients-example home_prob_dts tdigest Java-Sync >}}
+< /clients-example >}}-->
+```java
+String res20 = jedis.tdigestCreate("male_heights");
+System.out.println(res20);  // >>> OK
+
+String res21 = jedis.tdigestAdd("male_heights", 
+    175.5, 181, 160.8, 152, 177, 196, 164);
+System.out.println(res21);  // >>> OK
+
+double res22 = jedis.tdigestMin("male_heights");
+System.out.println(res22);  // >>> 152.0
+
+double res23 = jedis.tdigestMax("male_heights");
+System.out.println(res23);  // >>> 196.0
+
+List<Double> res24 = jedis.tdigestQuantile("male_heights", 0.75);
+System.out.println(res24);  // >>> [181.0]
+
+// Note that the CDF value for 181 is not exactly 0.75.
+// Both values are estimates.
+List<Double> res25 = jedis.tdigestCDF("male_heights", 181);
+System.out.println(res25);  // >>> [0.7857142857142857]
+
+String res26 = jedis.tdigestCreate("female_heights");
+System.out.println(res26);  // >>> OK
+
+String res27 = jedis.tdigestAdd("female_heights",
+    155.5, 161, 168.5, 170, 157.5, 163, 171);
+System.out.println(res27);  // >>> OK
+
+List<Double> res28 = jedis.tdigestQuantile("female_heights", 0.75);
+System.out.println(res28);  // >>> [170.0]
+
+String res29 = jedis.tdigestMerge(
+    "all_heights",
+    "male_heights", "female_heights"
+);
+System.out.println(res29);  // >>> OK
+List<Double> res30 = jedis.tdigestQuantile("all_heights", 0.75);
+System.out.println(res30);  // >>> [175.5]
+```
+
+A t-digest object also supports several other related commands, such
+as querying by rank. See the
+[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
+reference for more information.
+
+### Ranking
+
+A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
+object estimates the rankings of different labeled items in a data
+stream according to frequency. For example, you could use this to
+track the top ten most frequently-accessed pages on a website, or the
+top five most popular items sold.
+
+The example below adds several different items to a Top-K object
+that tracks the top three items (this is the second parameter to
+the `topkReserve()` method). It also shows how to list the
+top *k* items and query whether or not a given item is in the
+list.
+
+<!--< clients-example home_prob_dts topk Java-Sync >}}
+< /clients-example >}}-->
+```java
+String res31 = jedis.topkReserve("top_3_songs", 3L, 2000L, 7L, 0.925D);
+System.out.println(res31);  // >>> OK
+
+Map<String, Long> songIncrements = new HashMap<>();
+songIncrements.put("Starfish Trooper", 3000L);
+songIncrements.put("Only one more time", 1850L);
+songIncrements.put("Rock me, Handel", 1325L);
+songIncrements.put("How will anyone know?", 3890L);
+songIncrements.put("Average lover", 4098L);
+songIncrements.put("Road to everywhere", 770L);
+
+List<String> res32 = jedis.topkIncrBy("top_3_songs",
+    songIncrements
+);
+System.out.println(res32);
+// >>> [null, null, null, null, null, Rock me, Handel]
+
+List<String> res33 = jedis.topkList("top_3_songs");
+System.out.println(res33);
+// >>> [Average lover, How will anyone know?, Starfish Trooper]
+
+List<Boolean> res34 = jedis.topkQuery("top_3_songs",
+    "Starfish Trooper", "Road to everywhere"
+);
+System.out.println(res34);
+// >>> [true, false]
+```
diff --git a/content/develop/clients/redis-py/prob.md b/content/develop/clients/redis-py/prob.md
@@ -222,7 +222,7 @@ sketch commands.
 
 ```py
 # Specify that you want to keep the counts within 0.01
-# (0.1%) of the true value with a 0.005 (0.05%) chance
+# (1%) of the true value with a 0.005 (0.5%) chance
 # of going outside this limit.
 res16 = r.cms().initbyprob("items_sold", 0.01, 0.005)
 print(res16)  # >>> True