Skip to content

tomtom215/duckdb-behavioral

duckdb-behavioral

Behavioral analytics functions for DuckDB, inspired by ClickHouse.

CI E2E Tests License: MIT MSRV: 1.84.1

Provides sessionize, retention, window_funnel, sequence_match, sequence_count, sequence_match_events, and sequence_next_node as a loadable DuckDB extension written in Rust. Complete ClickHouse behavioral analytics parity.

Personal Project Disclaimer: This is a personal project developed on my own time. It is not affiliated with, endorsed by, or related to my employer or professional role in any way.

AI-Assisted Development: Built with Claude (Anthropic). Correctness is validated by automated testing — not assumed from AI output. See Quality.

Table of Contents

Quick Start

-- Install from the DuckDB Community Extensions repository
INSTALL behavioral FROM community;
LOAD behavioral;

Or build from source:

# Build the extension
cargo build --release

# Load in DuckDB (locally-built extensions require -unsigned)
duckdb -unsigned -cmd "LOAD 'target/release/libbehavioral.so';"
-- Assign session IDs with a 30-minute inactivity gap
SELECT user_id, event_time,
  sessionize(event_time, INTERVAL '30 minutes') OVER (
    PARTITION BY user_id ORDER BY event_time
  ) as session_id
FROM events;

-- Track conversion funnel steps within a 1-hour window
SELECT user_id,
  window_funnel(INTERVAL '1 hour', event_time,
    event_type = 'page_view',
    event_type = 'add_to_cart',
    event_type = 'purchase'
  ) as furthest_step
FROM events
GROUP BY user_id;

Functions

Function Signature Returns Description
sessionize (TIMESTAMP, INTERVAL) BIGINT Window function assigning session IDs based on inactivity gaps
retention (BOOLEAN, BOOLEAN, ...) BOOLEAN[] Cohort retention analysis
window_funnel (INTERVAL [, VARCHAR], TIMESTAMP, BOOLEAN, ...) INTEGER Conversion funnel step tracking with 6 combinable modes
sequence_match (VARCHAR, TIMESTAMP, BOOLEAN, ...) BOOLEAN NFA-based pattern matching over event sequences
sequence_count (VARCHAR, TIMESTAMP, BOOLEAN, ...) BIGINT Count non-overlapping pattern matches
sequence_match_events (VARCHAR, TIMESTAMP, BOOLEAN, ...) LIST(TIMESTAMP) Return matched condition timestamps
sequence_next_node (VARCHAR, VARCHAR, TIMESTAMP, VARCHAR, BOOLEAN, ...) VARCHAR Next event value after pattern match

All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. Detailed documentation, examples, and edge case behavior for each function: Function Reference

Performance

All measurements below are Criterion.rs 0.8.2 with 95% confidence intervals, validated across multiple runs on commodity hardware.

Function Scale Wall Clock Throughput
sessionize 1 billion 1.20 s 830 Melem/s
retention (combine) 100 million 274 ms 365 Melem/s
window_funnel 100 million 791 ms 126 Melem/s
sequence_match 100 million 1.05 s 95 Melem/s
sequence_count 100 million 1.18 s 85 Melem/s
sequence_match_events 100 million 1.07 s 93 Melem/s
sequence_next_node 10 million 546 ms 18 Melem/s

Key design choices:

  • 16-byte Copy events with u32 bitmask conditions — four events per cache line, zero heap allocation per event
  • O(1) combine for sessionize and retention via boundary tracking and bitmask OR
  • In-place combine for event-collecting functions — O(N) amortized instead of O(N^2) from repeated allocation
  • NFA fast paths — common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
  • Presorted detection — O(n) check skips O(n log n) sort when events arrive in timestamp order

Optimization highlights:

Optimization Speedup Technique
Event bitmask 5–13x Vec<bool> replaced with u32 bitmask, enabling Copy semantics
In-place combine up to 2,436x O(N) amortized extend instead of O(N^2) merge-allocate
NFA lazy matching 1,961x at 1M events Swapped exploration order so .* tries advancing before consuming
Arc<str> values 2.1–5.8x Reference-counted strings for O(1) clone in sequence_next_node
NFA fast paths 39–61% Pattern classification dispatches common shapes to O(n) linear scans

Five attempted optimizations were measured, found to be regressions, and reverted. All negative results are documented in PERF.md.

Full methodology, per-session optimization history with confidence intervals, and reproducible benchmark instructions: PERF.md.

Community Extension

This extension is listed in the DuckDB Community Extensions repository (PR #1306, merged 2026-02-15). Install with:

INSTALL behavioral FROM community;
LOAD behavioral;

No build tools, compilation, or -unsigned flag required.

Update Process

The community-submission.yml workflow automates the full pre-submission pipeline in 5 phases:

Phase Purpose
Validate description.yml schema, version consistency, required files
Quality Gate cargo test, clippy, fmt, doc
Build & Test make configure && make release && make test_release
Pin Ref Updates description.yml ref to the validated commit SHA
Submission Package Uploads artifact, generates step-by-step PR commands

Updating the Published Extension

Push changes to this repository, re-run the submission workflow to pin the new ref, then open a new PR against duckdb/community-extensions updating the ref field in extensions/behavioral/description.yml. When DuckDB releases a new version, update libduckdb-sys, TARGET_DUCKDB_VERSION, and the extension-ci-tools submodule.

Quality

Metric Value
Unit tests 453 + 1 doc-test
E2E tests 27 (against real DuckDB CLI)
Property-based tests 26 (proptest)
Mutation testing 88.4% kill rate (130/147, cargo-mutants)
Clippy warnings 0 (pedantic + nursery + cargo lint groups)
CI jobs 13 (check, test, clippy, fmt, doc, MSRV, bench, deny, semver, coverage, cross-platform, extension-build)
Benchmark files 7 (Criterion.rs, up to 1 billion elements)
Release platforms 4 (Linux x86_64/ARM64, macOS x86_64/ARM64)

CI runs on every push and PR: 6 workflows across .github/workflows/ including E2E tests against real DuckDB, CodeQL static analysis, SemVer validation, and 4-platform release builds with provenance attestation.

ClickHouse Parity Status

COMPLETE — All ClickHouse behavioral analytics functions are implemented.

Function Status
retention Complete
window_funnel (6 modes) Complete
sequence_match Complete
sequence_count Complete
sequence_match_events Complete
sequence_next_node Complete
32-condition support Complete
sessionize Extension-only (no ClickHouse equivalent)

Building

Prerequisites: Rust 1.84.1+ (MSRV), a C compiler (for DuckDB sys bindings)

# Build the extension (release mode)
cargo build --release

# The loadable extension will be at:
# target/release/libbehavioral.so   (Linux)
# target/release/libbehavioral.dylib (macOS)

Development

cargo test                  # Unit tests + doc-tests
cargo clippy --all-targets  # Zero warnings required
cargo fmt                   # Format
cargo bench                 # Criterion.rs benchmarks

# Build extension via community Makefile
git submodule update --init
make configure && make release && make test_release

This project follows Semantic Versioning. See the versioning policy for the full SemVer rules applied to SQL function signatures.

Documentation

Requirements

  • Rust 1.84.1+ (MSRV)
  • DuckDB 1.5.0 (pinned dependency)
  • Python 3.x (for extension metadata tooling)

License

MIT

About

A DuckDB Community Extension to enable Behavioral Analytics, inspired by ClickHouse.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors