Skip to content

MemCS bloom aggregate #5500

@TarantoolBot

Description

@TarantoolBot

Since: 3.6

MemCS primary index was populated with bloom aggregates. This type of
aggregates allows to use data-skipping base on bloom filter for
requests with equality filters. Also, this aggregate has a tunable fpr
parameter - false-positive rate of undrelying bloom filter. It must be
in (0..1) range. The higher fpr, the lower memory consumption. The
default value is 0.05 (5%).

Note that bloom aggregates support all fixed-size types and string
type (minmax supports only fied-size types).

Example:

local s = box.schema.create_space('test', {
    engine = 'memcs', field_count = 4,
    format = {{'a', 'uint64'}, {'b', 'uint64'}, {'c', 'uint64'},
	      {'d', 'string'}},
})
s:create_index('pk', {aggregates = {
    {type = 'bloom', field = 2, name = 'bloom_2', fpr = 0.1},
    {type = 'bloom', field = 3, name = 'bloom_3', fpr = 0.01},
    {type = 'bloom', field = 4, name = 'bloom_4'},
}})

Then filter with equality condition will automatically use bloom
aggregates, if any:

/* Create arrow stream options. */
box_arrow_options_t *options = box_arrow_options_new();

/*
 * Set filter `[2] = 42` so some rows with `[2] != 42` can be skipped.
 */
box_filter_t filter;
filter->type = FILTER_TYPE_EQ;
filter->field_no = 1; /* 0-indexation. */

char buf[16];
mp_encode_uint(buf, 42);
filter->value = buf;

box_arrow_options_set_filter(options, &filter);

/* Create stream. */
struct ArrowArrayStream stream;
int rc = box_index_arrow_stream(space_id, index_id, field_count, fields,
				key, key + key_size, options, &stream);

Regarding memory consumption, it's the same for all types - only fpr
parameter matters. Here are some memory consumption measurements:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions