Skip to content

Bloom filter size table does not match implementations #547

@pitrou

Description

@pitrou

Describe the bug, including details regarding any error messages, version, and platform.

The bits/key values in this table do not seem to match results given by the formula used in the Parquet C++ and Parquet Java implementations.

Using that formula (the same in both implementations) I get this table:

Bits of space per insert False positive probability
5.8 10 %
9.7 1 %
14.6 0.1 %
21 0.01 %
29.6 0.001 %

In Python:

>>> fpp = 0.1 ; 8/math.log(1/(1 - fpp**0.125))
5.7725418439029506
>>> fpp = 0.01 ; 8/math.log(1/(1 - fpp**0.125))
9.681526738735679
>>> fpp = 0.001 ; 8/math.log(1/(1 - fpp**0.125))
14.607697478479535
>>> fpp = 0.0001 ; 8/math.log(1/(1 - fpp**0.125))
21.045409233894773
>>> fpp = 0.00001 ; 8/math.log(1/(1 - fpp**0.125))
29.555488704606017

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions