Behavioral analytics functions for DuckDB, inspired by ClickHouse.
Provides sessionize, retention, window_funnel, sequence_match,
sequence_count, sequence_match_events, and sequence_next_node as a loadable
DuckDB extension written in Rust. Complete
ClickHouse
behavioral analytics parity.
Personal Project Disclaimer: This is a personal project developed on my own time. It is not affiliated with, endorsed by, or related to my employer or professional role in any way.
AI-Assisted Development: Built with Claude (Anthropic). Correctness is validated by automated testing — not assumed from AI output. See Quality.
- Quick Start
- Functions
- Performance
- Community Extension
- Quality
- ClickHouse Parity Status
- Building
- Development
- Documentation
- Requirements
- License
-- Install from the DuckDB Community Extensions repository
INSTALL behavioral FROM community;
LOAD behavioral;Or build from source:
# Build the extension
cargo build --release
# Load in DuckDB (locally-built extensions require -unsigned)
duckdb -unsigned -cmd "LOAD 'target/release/libbehavioral.so';"-- Assign session IDs with a 30-minute inactivity gap
SELECT user_id, event_time,
sessionize(event_time, INTERVAL '30 minutes') OVER (
PARTITION BY user_id ORDER BY event_time
) as session_id
FROM events;
-- Track conversion funnel steps within a 1-hour window
SELECT user_id,
window_funnel(INTERVAL '1 hour', event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'purchase'
) as furthest_step
FROM events
GROUP BY user_id;| Function | Signature | Returns | Description |
|---|---|---|---|
sessionize |
(TIMESTAMP, INTERVAL) |
BIGINT |
Window function assigning session IDs based on inactivity gaps |
retention |
(BOOLEAN, BOOLEAN, ...) |
BOOLEAN[] |
Cohort retention analysis |
window_funnel |
(INTERVAL [, VARCHAR], TIMESTAMP, BOOLEAN, ...) |
INTEGER |
Conversion funnel step tracking with 6 combinable modes |
sequence_match |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
BOOLEAN |
NFA-based pattern matching over event sequences |
sequence_count |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
BIGINT |
Count non-overlapping pattern matches |
sequence_match_events |
(VARCHAR, TIMESTAMP, BOOLEAN, ...) |
LIST(TIMESTAMP) |
Return matched condition timestamps |
sequence_next_node |
(VARCHAR, VARCHAR, TIMESTAMP, VARCHAR, BOOLEAN, ...) |
VARCHAR |
Next event value after pattern match |
All functions support 2 to 32 boolean conditions, matching ClickHouse's limit. Detailed documentation, examples, and edge case behavior for each function: Function Reference
All measurements below are Criterion.rs 0.8.2 with 95% confidence intervals, validated across multiple runs on commodity hardware.
| Function | Scale | Wall Clock | Throughput |
|---|---|---|---|
sessionize |
1 billion | 1.20 s | 830 Melem/s |
retention (combine) |
100 million | 274 ms | 365 Melem/s |
window_funnel |
100 million | 791 ms | 126 Melem/s |
sequence_match |
100 million | 1.05 s | 95 Melem/s |
sequence_count |
100 million | 1.18 s | 85 Melem/s |
sequence_match_events |
100 million | 1.07 s | 93 Melem/s |
sequence_next_node |
10 million | 546 ms | 18 Melem/s |
Key design choices:
- 16-byte
Copyevents withu32bitmask conditions — four events per cache line, zero heap allocation per event - O(1) combine for
sessionizeandretentionvia boundary tracking and bitmask OR - In-place combine for event-collecting functions — O(N) amortized instead of O(N^2) from repeated allocation
- NFA fast paths — common pattern shapes dispatch to specialized O(n) linear scans instead of full NFA backtracking
- Presorted detection — O(n) check skips O(n log n) sort when events arrive in timestamp order
Optimization highlights:
| Optimization | Speedup | Technique |
|---|---|---|
| Event bitmask | 5–13x | Vec<bool> replaced with u32 bitmask, enabling Copy semantics |
| In-place combine | up to 2,436x | O(N) amortized extend instead of O(N^2) merge-allocate |
| NFA lazy matching | 1,961x at 1M events | Swapped exploration order so .* tries advancing before consuming |
Arc<str> values |
2.1–5.8x | Reference-counted strings for O(1) clone in sequence_next_node |
| NFA fast paths | 39–61% | Pattern classification dispatches common shapes to O(n) linear scans |
Five attempted optimizations were measured, found to be regressions, and reverted.
All negative results are documented in PERF.md.
Full methodology, per-session optimization history with confidence intervals, and
reproducible benchmark instructions: PERF.md.
This extension is listed in the DuckDB Community Extensions repository (PR #1306, merged 2026-02-15). Install with:
INSTALL behavioral FROM community;
LOAD behavioral;No build tools, compilation, or -unsigned flag required.
The community-submission.yml
workflow automates the full pre-submission pipeline in 5 phases:
| Phase | Purpose |
|---|---|
| Validate | description.yml schema, version consistency, required files |
| Quality Gate | cargo test, clippy, fmt, doc |
| Build & Test | make configure && make release && make test_release |
| Pin Ref | Updates description.yml ref to the validated commit SHA |
| Submission Package | Uploads artifact, generates step-by-step PR commands |
Push changes to this repository, re-run the submission workflow to pin the new
ref, then open a new PR against duckdb/community-extensions updating the ref
field in extensions/behavioral/description.yml. When DuckDB releases a new
version, update libduckdb-sys, TARGET_DUCKDB_VERSION, and the
extension-ci-tools submodule.
| Metric | Value |
|---|---|
| Unit tests | 453 + 1 doc-test |
| E2E tests | 27 (against real DuckDB CLI) |
| Property-based tests | 26 (proptest) |
| Mutation testing | 88.4% kill rate (130/147, cargo-mutants) |
| Clippy warnings | 0 (pedantic + nursery + cargo lint groups) |
| CI jobs | 13 (check, test, clippy, fmt, doc, MSRV, bench, deny, semver, coverage, cross-platform, extension-build) |
| Benchmark files | 7 (Criterion.rs, up to 1 billion elements) |
| Release platforms | 4 (Linux x86_64/ARM64, macOS x86_64/ARM64) |
CI runs on every push and PR: 6 workflows across .github/workflows/ including
E2E tests against real DuckDB, CodeQL static analysis, SemVer validation, and
4-platform release builds with provenance attestation.
COMPLETE — All ClickHouse behavioral analytics functions are implemented.
| Function | Status |
|---|---|
retention |
Complete |
window_funnel (6 modes) |
Complete |
sequence_match |
Complete |
sequence_count |
Complete |
sequence_match_events |
Complete |
sequence_next_node |
Complete |
| 32-condition support | Complete |
sessionize |
Extension-only (no ClickHouse equivalent) |
Prerequisites: Rust 1.84.1+ (MSRV), a C compiler (for DuckDB sys bindings)
# Build the extension (release mode)
cargo build --release
# The loadable extension will be at:
# target/release/libbehavioral.so (Linux)
# target/release/libbehavioral.dylib (macOS)cargo test # Unit tests + doc-tests
cargo clippy --all-targets # Zero warnings required
cargo fmt # Format
cargo bench # Criterion.rs benchmarks
# Build extension via community Makefile
git submodule update --init
make configure && make release && make test_releaseThis project follows Semantic Versioning. See the versioning policy for the full SemVer rules applied to SQL function signatures.
- Getting Started — installation, loading, troubleshooting
- Function Reference — detailed docs for all 7 functions
- Use Cases — 5 complete real-world examples with sample data
- Engineering Overview — architecture, testing philosophy, design trade-offs
- Performance — benchmarks, optimization history, methodology
- ClickHouse Compatibility — syntax mapping, semantic parity
- Contributing — development setup, testing, PR process
- Rust 1.84.1+ (MSRV)
- DuckDB 1.5.0 (pinned dependency)
- Python 3.x (for extension metadata tooling)
MIT