Skip to content

Conversation

asantoni
Copy link

@asantoni asantoni commented Sep 26, 2025

Hi all, just quickly hacked in support for SAMPLE BY support when creating models and added a missing datetime function. I also implemented support for the SAMPLE clause.

This adds a new "sample_by" parameter to the constructor of BaseMergeTree, which can be used to enable sampling on your table.

from clickhouse_backend import models
from clickhouse_backend.models.functions.hashes import farmFingerprint64
from clickhouse_backend.models.functions.datetime import toStartOfDay

class DemoLog(models.ClickhouseModel):
    timestamp = models.DateTimeField(default=timezone.now)
    ip = models.GenericIPAddressField(default="::")

    class Meta:
        engine = models.MergeTree(
            primary_key=("timestamp", toStartOfDay("timestamp"), farmFingerprint64("ip")),
            order_by=("timestamp", toStartOfDay("timestamp"), farmFingerprint64("ip")),
            partition_by=toStartOfMonth("timestamp"),
            sample_by=(farmFingerprint64("ip"),),
            index_granularity=8192,
        )

You can query using the SAMPLE clause using the new .sample function like so:

session_count_estimate = DemoLog.objects.filter(timestamp__gte=time_start, timestamp__lte=time_end).sample(0.1).aggregate(session_count=Count('id') * 10

The new sample function takes two parameters:

    def sample(self, sample_fraction, sample_offset=None):

which generates either a SAMPLE k or SAMPLE k OFFSET m clause as per the Clickhouse docs on SAMPLE.

I didn't include unit tests because I'm too lazy to spin up all the docker stuff and I'm in a hurry to get some bare minimum thing here working. I'm hoping this is PR is useful for others and could be useful for the project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant