|
| 1 | +--- |
| 2 | +categories: |
| 3 | +- docs |
| 4 | +- develop |
| 5 | +- stack |
| 6 | +- oss |
| 7 | +- rs |
| 8 | +- rc |
| 9 | +- oss |
| 10 | +- kubernetes |
| 11 | +- clients |
| 12 | +description: Learn how to use approximate calculations with Redis. |
| 13 | +linkTitle: Probabilistic data types |
| 14 | +title: Probabilistic data types |
| 15 | +weight: 5 |
| 16 | +--- |
| 17 | + |
| 18 | +Redis supports several |
| 19 | +[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}}) |
| 20 | +that let you calculate values approximately rather than exactly. |
| 21 | +The types fall into two basic categories: |
| 22 | + |
| 23 | +- [Set operations](#set-operations): These types let you calculate (approximately) |
| 24 | + the number of items in a set of distinct values, and whether or not a given value is |
| 25 | + a member of a set. |
| 26 | +- [Statistics](#statistics): These types give you an approximation of |
| 27 | + statistics such as the quantiles, ranks, and frequencies of numeric data points in |
| 28 | + a list. |
| 29 | + |
| 30 | +To see why these approximate calculations would be useful, consider the task of |
| 31 | +counting the number of distinct IP addresses that access a website in one day. |
| 32 | + |
| 33 | +Assuming that you already have code that supplies you with each IP |
| 34 | +address as a string, you could record the addresses in Redis using |
| 35 | +a [set]({{< relref "/develop/data-types/sets" >}}): |
| 36 | + |
| 37 | +```java |
| 38 | +jedis.sadd("ip_tracker", new_ip_address) |
| 39 | +``` |
| 40 | + |
| 41 | +The set can only contain each key once, so if the same address |
| 42 | +appears again during the day, the new instance will not change |
| 43 | +the set. At the end of the day, you could get the exact number of |
| 44 | +distinct addresses using the `scard()` function: |
| 45 | + |
| 46 | +```java |
| 47 | +long num_distinct_ips = jedis.scard("ip_tracker") |
| 48 | +``` |
| 49 | + |
| 50 | +This approach is simple, effective, and precise but if your website |
| 51 | +is very busy, the `ip_tracker` set could become very large and consume |
| 52 | +a lot of memory. |
| 53 | + |
| 54 | +You would probably round the count of distinct IP addresses to the |
| 55 | +nearest thousand or more to deliver the usage statistics, so |
| 56 | +getting it exactly right is not important. It would be useful |
| 57 | +if you could trade off some accuracy in exchange for lower memory |
| 58 | +consumption. The probabilistic data types provide exactly this kind of |
| 59 | +trade-off. Specifically, you can count the approximate number of items in a |
| 60 | +set using the [HyperLogLog](#set-cardinality) data type, as described below. |
| 61 | + |
| 62 | +In general, the probabilistic data types let you perform approximations with a |
| 63 | +bounded degree of error that have much lower memory consumption or execution |
| 64 | +time than the equivalent precise calculations. |
| 65 | + |
| 66 | +## Set operations |
| 67 | + |
| 68 | +Redis supports the following approximate set operations: |
| 69 | + |
| 70 | +- [Membership](#set-membership): The |
| 71 | + [Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and |
| 72 | + [Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) |
| 73 | + data types let you track whether or not a given item is a member of a set. |
| 74 | +- [Cardinality](#set-cardinality): The |
| 75 | + [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}}) |
| 76 | + data type gives you an approximate value for the number of items in a set, also |
| 77 | + known as the *cardinality* of the set. |
| 78 | + |
| 79 | +The sections below describe these operations in more detail. |
| 80 | + |
| 81 | +### Set membership |
| 82 | + |
| 83 | +[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and |
| 84 | +[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) |
| 85 | +objects provide a set membership operation that lets you track whether or not a |
| 86 | +particular item has been added to a set. These two types provide different |
| 87 | +trade-offs for memory usage and speed, so you can select the best one for your |
| 88 | +use case. Note that for both types, there is an asymmetry between presence and |
| 89 | +absence of items in the set. If an item is reported as absent, then it is definitely |
| 90 | +absent, but if it is reported as present, then there is a small chance it may really be |
| 91 | +absent. |
| 92 | + |
| 93 | +Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}), |
| 94 | +a Bloom filter records the presence or absence of the |
| 95 | +[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string. |
| 96 | +This gives a very compact representation of the |
| 97 | +set's membership with a fixed memory size, regardless of how many items you |
| 98 | +add. The following example adds some names to a Bloom filter representing |
| 99 | +a list of users and checks for the presence or absence of users in the list. |
| 100 | + |
| 101 | +```java |
| 102 | +List<Boolean> res1 = jedis.bfMAdd( |
| 103 | + "recorded_users", |
| 104 | + "andy", "cameron", "david", "michelle" |
| 105 | +); |
| 106 | +System.out.println(res1); // >>> [true, true, true, true] |
| 107 | + |
| 108 | +boolean res2 = jedis.bfExists("recorded_users", "cameron"); |
| 109 | +System.out.println(res2); // >>> true |
| 110 | + |
| 111 | +boolean res3 = jedis.bfExists("recorded_users", "kaitlyn"); |
| 112 | +System.out.println(res3); // >>> false |
| 113 | +``` |
| 114 | +<!--< clients-example home_prob_dts bloom Java-Sync >}} |
| 115 | +< /clients-example >}}--> |
| 116 | + |
| 117 | +A Cuckoo filter has similar features to a Bloom filter, but also supports |
| 118 | +a deletion operation to remove hashes from a set, as shown in the example |
| 119 | +below. |
| 120 | + |
| 121 | +<!--< clients-example home_prob_dts cuckoo Java-Sync >}} |
| 122 | +< /clients-example >}}--> |
| 123 | +```java |
| 124 | +boolean res4 = jedis.cfAdd("other_users", "paolo"); |
| 125 | +System.out.println(res4); // >>> true |
| 126 | + |
| 127 | +boolean res5 = jedis.cfAdd("other_users", "kaitlyn"); |
| 128 | +System.out.println(res5); // >>> true |
| 129 | + |
| 130 | +boolean res6 = jedis.cfAdd("other_users", "rachel"); |
| 131 | +System.out.println(res6); // >>> true |
| 132 | + |
| 133 | +boolean[] res7 = jedis.cfMExists( |
| 134 | + "other_users", |
| 135 | + "paolo", "rachel", "andy" |
| 136 | +); |
| 137 | +System.out.println(res7); // >>> [true, true, false] |
| 138 | + |
| 139 | +boolean res8 = jedis.cfDel("other_users", "paolo"); |
| 140 | +System.out.println(res8); // >>> true |
| 141 | + |
| 142 | +boolean res9 = jedis.cfExists("other_users", "paolo"); |
| 143 | +System.out.println(res9); // >>> false |
| 144 | +``` |
| 145 | + |
| 146 | +Which of these two data types you choose depends on your use case. |
| 147 | +Bloom filters are generally faster than Cuckoo filters when adding new items, |
| 148 | +and also have better memory usage. Cuckoo filters are generally faster |
| 149 | +at checking membership and also support the delete operation. See the |
| 150 | +[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and |
| 151 | +[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}}) |
| 152 | +reference pages for more information and comparison between the two types. |
| 153 | + |
| 154 | +### Set cardinality |
| 155 | + |
| 156 | +A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}}) |
| 157 | +object calculates the cardinality of a set. As you add |
| 158 | +items, the HyperLogLog tracks the number of distinct set members but |
| 159 | +doesn't let you retrieve them or query which items have been added. |
| 160 | +You can also merge two or more HyperLogLogs to find the cardinality of the |
| 161 | +[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they |
| 162 | +represent. |
| 163 | + |
| 164 | +<!--< clients-example home_prob_dts hyperloglog Java-Sync >}} |
| 165 | +< /clients-example >}}--> |
| 166 | +```java |
| 167 | +long res10 = jedis.pfadd("group:1", "andy", "cameron", "david"); |
| 168 | +System.out.println(res10); // >>> 1 |
| 169 | + |
| 170 | +long res11 = jedis.pfcount("group:1"); |
| 171 | +System.out.println(res11); // >>> 3 |
| 172 | + |
| 173 | +long res12 = jedis.pfadd( |
| 174 | + "group:2", |
| 175 | + "kaitlyn", "michelle", "paolo", "rachel" |
| 176 | +); |
| 177 | +System.out.println(res12); // >>> 1 |
| 178 | + |
| 179 | +long res13 = jedis.pfcount("group:2"); |
| 180 | +System.out.println(res13); // >>> 4 |
| 181 | + |
| 182 | +String res14 = jedis.pfmerge("both_groups", "group:1", "group:2"); |
| 183 | +System.out.println(res14); // >>> OK |
| 184 | + |
| 185 | +long res15 = jedis.pfcount("both_groups"); |
| 186 | +System.out.println(res15); // >>> 7 |
| 187 | +``` |
| 188 | + |
| 189 | +The main benefit that HyperLogLogs offer is their very low |
| 190 | +memory usage. They can count up to 2^64 items with less than |
| 191 | +1% standard error using a maximum 12KB of memory. This makes |
| 192 | +them very useful for counting things like the total of distinct |
| 193 | +IP addresses that access a website or the total of distinct |
| 194 | +bank card numbers that make purchases within a day. |
| 195 | + |
| 196 | +## Statistics |
| 197 | + |
| 198 | +Redis supports several approximate statistical calculations |
| 199 | +on numeric data sets: |
| 200 | + |
| 201 | +- [Frequency](#frequency): The |
| 202 | + [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}}) |
| 203 | + data type lets you find the approximate frequency of a labeled item in a data stream. |
| 204 | +- [Quantiles](#quantiles): The |
| 205 | + [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) |
| 206 | + data type estimates the quantile of a query value in a data stream. |
| 207 | +- [Ranking](#ranking): The |
| 208 | + [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type |
| 209 | + estimates the ranking of labeled items by frequency in a data stream. |
| 210 | + |
| 211 | +The sections below describe these operations in more detail. |
| 212 | + |
| 213 | +### Frequency |
| 214 | + |
| 215 | +A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}}) |
| 216 | +(CMS) object keeps count of a set of related items represented by |
| 217 | +string labels. The count is approximate, but you can specify |
| 218 | +how close you want to keep the count to the true value (as a fraction) |
| 219 | +and the acceptable probability of failing to keep it in this |
| 220 | +desired range. For example, you can request that the count should |
| 221 | +stay within 0.1% of the true value and have a 0.05% probability |
| 222 | +of going outside this limit. The example below shows how to create |
| 223 | +a Count-min sketch object, add data to it, and then query it. |
| 224 | + |
| 225 | +<!--< clients-example home_prob_dts cms Java-Sync >}} |
| 226 | +< /clients-example >}}--> |
| 227 | +```java |
| 228 | +// Specify that you want to keep the counts within 0.01 |
| 229 | +// (1%) of the true value with a 0.005 (0.5%) chance |
| 230 | +// of going outside this limit. |
| 231 | +String res16 = jedis.cmsInitByProb("items_sold", 0.01, 0.005); |
| 232 | +System.out.println(res16); // >>> OK |
| 233 | + |
| 234 | +Map<String, Long> firstItemIncrements = new HashMap<>(); |
| 235 | +firstItemIncrements.put("bread", 300L); |
| 236 | +firstItemIncrements.put("tea", 200L); |
| 237 | +firstItemIncrements.put("coffee", 200L); |
| 238 | +firstItemIncrements.put("beer", 100L); |
| 239 | + |
| 240 | +List<Long> res17 = jedis.cmsIncrBy("items_sold", |
| 241 | + firstItemIncrements |
| 242 | +); |
| 243 | +res17.sort(null); |
| 244 | +System.out.println(); // >>> [100, 200, 200, 300] |
| 245 | + |
| 246 | +Map<String, Long> secondItemIncrements = new HashMap<>(); |
| 247 | +secondItemIncrements.put("bread", 100L); |
| 248 | +secondItemIncrements.put("coffee", 150L); |
| 249 | + |
| 250 | +List<Long> res18 = jedis.cmsIncrBy("items_sold", |
| 251 | + secondItemIncrements |
| 252 | +); |
| 253 | +res18.sort(null); |
| 254 | +System.out.println(res18); // >>> [350, 400] |
| 255 | + |
| 256 | +List<Long> res19 = jedis.cmsQuery( |
| 257 | + "items_sold", |
| 258 | + "bread", "tea", "coffee", "beer" |
| 259 | +); |
| 260 | +res19.sort(null); |
| 261 | +System.out.println(res19); // >>> [100, 200, 350, 400] |
| 262 | +``` |
| 263 | + |
| 264 | +The advantage of using a CMS over keeping an exact count with a |
| 265 | +[sorted set]({{< relref "/develop/data-types/sorted-sets" >}}) |
| 266 | +is that that a CMS has very low and fixed memory usage, even for |
| 267 | +large numbers of items. Use CMS objects to keep daily counts of |
| 268 | +items sold, accesses to individual web pages on your site, and |
| 269 | +other similar statistics. |
| 270 | + |
| 271 | +### Quantiles |
| 272 | + |
| 273 | +A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value |
| 274 | +below which a certain fraction of samples lie. For example, with |
| 275 | +a set of measurements of people's heights, the quantile of 0.75 is |
| 276 | +the value of height below which 75% of all people's heights lie. |
| 277 | +[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent |
| 278 | +to quantiles, except that the fraction is expressed as a percentage. |
| 279 | + |
| 280 | +A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) |
| 281 | +object can estimate quantiles from a set of values added to it |
| 282 | +without having to store each value in the set explicitly. This can |
| 283 | +save a lot of memory when you have a large number of samples. |
| 284 | + |
| 285 | +The example below shows how to add data samples to a t-digest |
| 286 | +object and obtain some basic statistics, such as the minimum and |
| 287 | +maximum values, the quantile of 0.75, and the |
| 288 | +[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) |
| 289 | +(CDF), which is effectively the inverse of the quantile function. It also |
| 290 | +shows how to merge two or more t-digest objects to query the combined |
| 291 | +data set. |
| 292 | + |
| 293 | +<!--< clients-example home_prob_dts tdigest Java-Sync >}} |
| 294 | +< /clients-example >}}--> |
| 295 | +```java |
| 296 | +String res20 = jedis.tdigestCreate("male_heights"); |
| 297 | +System.out.println(res20); // >>> OK |
| 298 | + |
| 299 | +String res21 = jedis.tdigestAdd("male_heights", |
| 300 | + 175.5, 181, 160.8, 152, 177, 196, 164); |
| 301 | +System.out.println(res21); // >>> OK |
| 302 | + |
| 303 | +double res22 = jedis.tdigestMin("male_heights"); |
| 304 | +System.out.println(res22); // >>> 152.0 |
| 305 | + |
| 306 | +double res23 = jedis.tdigestMax("male_heights"); |
| 307 | +System.out.println(res23); // >>> 196.0 |
| 308 | + |
| 309 | +List<Double> res24 = jedis.tdigestQuantile("male_heights", 0.75); |
| 310 | +System.out.println(res24); // >>> [181.0] |
| 311 | + |
| 312 | +// Note that the CDF value for 181 is not exactly 0.75. |
| 313 | +// Both values are estimates. |
| 314 | +List<Double> res25 = jedis.tdigestCDF("male_heights", 181); |
| 315 | +System.out.println(res25); // >>> [0.7857142857142857] |
| 316 | + |
| 317 | +String res26 = jedis.tdigestCreate("female_heights"); |
| 318 | +System.out.println(res26); // >>> OK |
| 319 | + |
| 320 | +String res27 = jedis.tdigestAdd("female_heights", |
| 321 | + 155.5, 161, 168.5, 170, 157.5, 163, 171); |
| 322 | +System.out.println(res27); // >>> OK |
| 323 | + |
| 324 | +List<Double> res28 = jedis.tdigestQuantile("female_heights", 0.75); |
| 325 | +System.out.println(res28); // >>> [170.0] |
| 326 | + |
| 327 | +String res29 = jedis.tdigestMerge( |
| 328 | + "all_heights", |
| 329 | + "male_heights", "female_heights" |
| 330 | +); |
| 331 | +System.out.println(res29); // >>> OK |
| 332 | +List<Double> res30 = jedis.tdigestQuantile("all_heights", 0.75); |
| 333 | +System.out.println(res30); // >>> [175.5] |
| 334 | +``` |
| 335 | + |
| 336 | +A t-digest object also supports several other related commands, such |
| 337 | +as querying by rank. See the |
| 338 | +[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}}) |
| 339 | +reference for more information. |
| 340 | + |
| 341 | +### Ranking |
| 342 | + |
| 343 | +A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) |
| 344 | +object estimates the rankings of different labeled items in a data |
| 345 | +stream according to frequency. For example, you could use this to |
| 346 | +track the top ten most frequently-accessed pages on a website, or the |
| 347 | +top five most popular items sold. |
| 348 | + |
| 349 | +The example below adds several different items to a Top-K object |
| 350 | +that tracks the top three items (this is the second parameter to |
| 351 | +the `topkReserve()` method). It also shows how to list the |
| 352 | +top *k* items and query whether or not a given item is in the |
| 353 | +list. |
| 354 | + |
| 355 | +<!--< clients-example home_prob_dts topk Java-Sync >}} |
| 356 | +< /clients-example >}}--> |
| 357 | +```java |
| 358 | +String res31 = jedis.topkReserve("top_3_songs", 3L, 2000L, 7L, 0.925D); |
| 359 | +System.out.println(res31); // >>> OK |
| 360 | + |
| 361 | +Map<String, Long> songIncrements = new HashMap<>(); |
| 362 | +songIncrements.put("Starfish Trooper", 3000L); |
| 363 | +songIncrements.put("Only one more time", 1850L); |
| 364 | +songIncrements.put("Rock me, Handel", 1325L); |
| 365 | +songIncrements.put("How will anyone know?", 3890L); |
| 366 | +songIncrements.put("Average lover", 4098L); |
| 367 | +songIncrements.put("Road to everywhere", 770L); |
| 368 | + |
| 369 | +List<String> res32 = jedis.topkIncrBy("top_3_songs", |
| 370 | + songIncrements |
| 371 | +); |
| 372 | +System.out.println(res32); |
| 373 | +// >>> [null, null, null, null, null, Rock me, Handel] |
| 374 | + |
| 375 | +List<String> res33 = jedis.topkList("top_3_songs"); |
| 376 | +System.out.println(res33); |
| 377 | +// >>> [Average lover, How will anyone know?, Starfish Trooper] |
| 378 | + |
| 379 | +List<Boolean> res34 = jedis.topkQuery("top_3_songs", |
| 380 | + "Starfish Trooper", "Road to everywhere" |
| 381 | +); |
| 382 | +System.out.println(res34); |
| 383 | +// >>> [true, false] |
| 384 | +``` |
0 commit comments