Skip to content

Implement t-digest algorithm for online histogram calculation #603

@tasansal

Description

@tasansal

This would be used to calculate dataset histogram during ingestion. t-digest algorithm is very popular (especially in map-reduce ops in Apache Spark).

Computing Extremely Accurate Quantiles Using t-Digests (Dunning & Ertl 2019):
https://arxiv.org/abs/1902.04023

Some light weight explainer:
https://www.gresearch.com/news/approximate-percentiles-with-t-digests

There are two python libraries that do it:

E.g. each distributed worker calculates their part of the data distribution representation. At the end, they would be combined for the final histogram.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions