|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +This is a DuckDB extension called "stochastic" developed by Query.Farm that provides comprehensive statistical distribution functions to DuckDB. The extension enables advanced statistical analysis, probability calculations, and random sampling directly within SQL queries. |
| 8 | + |
| 9 | +The extension supports 20+ probability distributions (normal, beta, binomial, poisson, etc.) with functions for PDF/PMF, CDF, quantile, sampling, and distribution properties (mean, variance, etc.). |
| 10 | + |
| 11 | +## Build Commands |
| 12 | + |
| 13 | +This project uses a Makefile wrapper around CMake. All build commands are run from the project root: |
| 14 | + |
| 15 | +```bash |
| 16 | +# Build release version (default) needed to run tests |
| 17 | +VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake GEN=ninja make release |
| 18 | + |
| 19 | +# Build debug version |
| 20 | +VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake GEN=ninja make debug |
| 21 | + |
| 22 | +# Run tests (uses release build) |
| 23 | +make test |
| 24 | + |
| 25 | +# Run tests with debug build |
| 26 | +make test_debug |
| 27 | +``` |
| 28 | + |
| 29 | +### Running a Single Test |
| 30 | + |
| 31 | +To run a specific test file: |
| 32 | +```bash |
| 33 | +./build/release/test/unittest "test/sql/normal.test" |
| 34 | +``` |
| 35 | + |
| 36 | + |
| 37 | +All extension functions should be documented inside of DuckDB with CreateScalarFunctionInfo or CreateAggregateFunctionInfo or the appropriate type for the function. This documentation of the function should include examples, parameter types and parameter names. The function should be categorized. |
| 38 | + |
| 39 | +When making changes the version should always be updated to the current date plus an ordinal counter in the form of YYYYMMDDCC. |
| 40 | + |
| 41 | + |
| 42 | +## Architecture |
| 43 | + |
| 44 | +### Extension Structure |
| 45 | + |
| 46 | +The extension follows the DuckDB extension template structure with these key components: |
| 47 | + |
| 48 | +**Entry Point (`src/stochastic_extension.cpp`)**: Registers all distribution functions via `Load_<distribution>_distribution()` functions. Also includes Query.Farm telemetry (can be opted out via `QUERY_FARM_TELEMETRY_OPT_OUT` environment variable). |
| 49 | + |
| 50 | +**Distribution Implementations (`src/distribution_*.cpp`)**: Each distribution is in its own file (e.g., `distribution_normal.cpp`, `distribution_beta.cpp`). Each file: |
| 51 | +- Uses preprocessor macros to define distribution name and type mappings |
| 52 | +- Specializes `distribution_traits` template for both Boost.Math and Boost.Random versions |
| 53 | +- Registers all functions (sample, pdf, cdf, quantile, mean, variance, etc.) using the `REGISTER` macro |
| 54 | + |
| 55 | +**Function Registration System (`src/include/utils.hpp`)**: Template-based system that: |
| 56 | +- Uses `distribution_traits` to determine parameter types and names |
| 57 | +- Automatically generates function names with pattern: `dist_<prefix>_<function>` (e.g., `dist_normal_pdf`) |
| 58 | +- Handles both unary (1 param) and binary (2 param) distributions |
| 59 | +- Optimizes for constant vectors by creating distribution once and reusing |
| 60 | + |
| 61 | +**RNG Management (`src/rng_utils.cpp`, `src/include/rng_utils.hpp`)**: Thread-local Mersenne Twister RNG (`boost::random::mt19937`) with: |
| 62 | +- Fixed global seed (12345) for reproducibility |
| 63 | +- Unique per-thread seeding to ensure thread-safety |
| 64 | +- Thread index mapping for deterministic parallel execution |
| 65 | + |
| 66 | +**Type System (`src/include/distribution_traits.hpp`, `src/include/callable_traits.hpp`)**: |
| 67 | +- `distribution_traits`: Maps C++ types to DuckDB LogicalTypes |
| 68 | +- `callable_traits`: Template metaprogramming for extracting function signatures |
| 69 | +- `logical_type_map`: Supports double, int64_t, and pair<double,double> (for ranges) |
| 70 | + |
| 71 | +### Adding a New Distribution |
| 72 | + |
| 73 | +To add a new distribution: |
| 74 | + |
| 75 | +1. Create `src/distribution_<name>.cpp` following the pattern in existing files |
| 76 | +2. Define the distribution traits with parameter names, types, and validation |
| 77 | +3. Register all desired functions (sample, pdf, cdf, etc.) in the `Load_<name>_distribution` function |
| 78 | +4. Add the source file to `EXTENSION_SOURCES` in `CMakeLists.txt` |
| 79 | +5. Add forward declaration and call to `Load_<name>_distribution()` in `src/stochastic_extension.cpp` |
| 80 | +6. Create test file `test/sql/<name>.test` with comprehensive test cases |
| 81 | + |
| 82 | +### Testing |
| 83 | + |
| 84 | +Tests use DuckDB's `.test` format (see `test/sql/normal.test` for examples). Test files: |
| 85 | +- Start with metadata comments (`# name:`, `# description:`, `# group:`) |
| 86 | +- Use `require stochastic` to ensure extension is loaded |
| 87 | +- Support `statement error`, `query R` (float), `query I` (integer), `query T` (text) |
| 88 | +- Can test multiple columns with `query RR`, `query RRR`, etc. |
| 89 | + |
| 90 | +### Dependencies |
| 91 | + |
| 92 | +The extension depends on: |
| 93 | +- **Boost.Math**: For statistical distribution calculations (PDF, CDF, quantile) |
| 94 | +- **Boost.Random**: For random number generation from distributions |
| 95 | +- Dependencies are managed via vcpkg (see `vcpkg.json`) |
| 96 | + |
| 97 | +### Build System |
| 98 | + |
| 99 | +The project uses: |
| 100 | +- **CMakeLists.txt**: Defines extension sources and links Boost libraries |
| 101 | +- **Makefile**: Thin wrapper that delegates to `extension-ci-tools/makefiles/duckdb_extension.Makefile` |
| 102 | +- **extension_config.cmake**: Tells DuckDB build system to load the stochastic extension |
| 103 | +- **extension-ci-tools/**: Git submodule with shared DuckDB extension build infrastructure |
| 104 | + |
| 105 | +The `duckdb/` directory is a git submodule containing the DuckDB source code that the extension builds against. |
| 106 | + |
| 107 | +### Function Naming Convention |
| 108 | + |
| 109 | +All functions follow the pattern: `dist_<distribution>_<operation>` |
| 110 | + |
| 111 | +Examples: |
| 112 | +- `dist_normal_sample(mean, stddev)` - Generate random sample |
| 113 | +- `dist_normal_pdf(mean, stddev, x)` - Probability density at x |
| 114 | +- `dist_normal_cdf(mean, stddev, x)` - Cumulative distribution at x |
| 115 | +- `dist_normal_quantile(mean, stddev, p)` - Inverse CDF (p-th quantile) |
| 116 | +- `dist_normal_mean(mean, stddev)` - Distribution mean |
| 117 | +- `dist_normal_variance(mean, stddev)` - Distribution variance |
| 118 | + |
| 119 | +### Code Organization |
| 120 | + |
| 121 | +``` |
| 122 | +src/ |
| 123 | +├── stochastic_extension.cpp # Extension entry point |
| 124 | +├── distribution_*.cpp # Individual distribution implementations (20+ files) |
| 125 | +├── rng_utils.cpp # Thread-local RNG management |
| 126 | +├── query_farm_telemetry.cpp # Optional telemetry |
| 127 | +└── include/ |
| 128 | + ├── utils.hpp # Template-based function registration system |
| 129 | + ├── distribution_traits.hpp # Type system for distributions |
| 130 | + ├── callable_traits.hpp # Template metaprogramming utilities |
| 131 | + ├── rng_utils.hpp # RNG declarations |
| 132 | + └── ... |
| 133 | +
|
| 134 | +test/sql/ |
| 135 | +└── *.test # DuckDB test files |
| 136 | +
|
| 137 | +duckdb/ # Git submodule - DuckDB source |
| 138 | +extension-ci-tools/ # Git submodule - shared build infrastructure |
| 139 | +``` |
| 140 | + |
| 141 | +### Key Implementation Details |
| 142 | + |
| 143 | +**Constant Vector Optimization**: When all parameters are constant, the distribution object is created once and reused for all rows (see `DistributionSampleUnary` and `DistributionSampleBinary` in `utils.hpp`). |
| 144 | + |
| 145 | +**Parameter Validation**: Each distribution validates its parameters in `distribution_traits::ValidateParameters()` and throws `InvalidInputException` for invalid inputs. |
| 146 | + |
| 147 | +**Thread Safety**: The thread-local RNG ensures thread-safe random number generation without locks in the hot path. |
| 148 | + |
| 149 | +**Telemetry**: The extension sends anonymous usage telemetry to Query.Farm (opt-out via environment variable). This happens once per extension load and includes DuckDB version, platform, and extension version. This telemetry is opt-out and does not need to be documented as Query.Farm documents it on its own website. |
0 commit comments