Skip to content

Commit 9c128d4

Browse files
committed
fix: build fix
1 parent bfb2216 commit 9c128d4

File tree

2 files changed

+158
-0
lines changed

2 files changed

+158
-0
lines changed

CLAUDE.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is a DuckDB extension called "stochastic" developed by Query.Farm that provides comprehensive statistical distribution functions to DuckDB. The extension enables advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.
8+
9+
The extension supports 20+ probability distributions (normal, beta, binomial, poisson, etc.) with functions for PDF/PMF, CDF, quantile, sampling, and distribution properties (mean, variance, etc.).
10+
11+
## Build Commands
12+
13+
This project uses a Makefile wrapper around CMake. All build commands are run from the project root:
14+
15+
```bash
16+
# Build release version (default) needed to run tests
17+
VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake GEN=ninja make release
18+
19+
# Build debug version
20+
VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake GEN=ninja make debug
21+
22+
# Run tests (uses release build)
23+
make test
24+
25+
# Run tests with debug build
26+
make test_debug
27+
```
28+
29+
### Running a Single Test
30+
31+
To run a specific test file:
32+
```bash
33+
./build/release/test/unittest "test/sql/normal.test"
34+
```
35+
36+
37+
All extension functions should be documented inside of DuckDB with CreateScalarFunctionInfo or CreateAggregateFunctionInfo or the appropriate type for the function. This documentation of the function should include examples, parameter types and parameter names. The function should be categorized.
38+
39+
When making changes the version should always be updated to the current date plus an ordinal counter in the form of YYYYMMDDCC.
40+
41+
42+
## Architecture
43+
44+
### Extension Structure
45+
46+
The extension follows the DuckDB extension template structure with these key components:
47+
48+
**Entry Point (`src/stochastic_extension.cpp`)**: Registers all distribution functions via `Load_<distribution>_distribution()` functions. Also includes Query.Farm telemetry (can be opted out via `QUERY_FARM_TELEMETRY_OPT_OUT` environment variable).
49+
50+
**Distribution Implementations (`src/distribution_*.cpp`)**: Each distribution is in its own file (e.g., `distribution_normal.cpp`, `distribution_beta.cpp`). Each file:
51+
- Uses preprocessor macros to define distribution name and type mappings
52+
- Specializes `distribution_traits` template for both Boost.Math and Boost.Random versions
53+
- Registers all functions (sample, pdf, cdf, quantile, mean, variance, etc.) using the `REGISTER` macro
54+
55+
**Function Registration System (`src/include/utils.hpp`)**: Template-based system that:
56+
- Uses `distribution_traits` to determine parameter types and names
57+
- Automatically generates function names with pattern: `dist_<prefix>_<function>` (e.g., `dist_normal_pdf`)
58+
- Handles both unary (1 param) and binary (2 param) distributions
59+
- Optimizes for constant vectors by creating distribution once and reusing
60+
61+
**RNG Management (`src/rng_utils.cpp`, `src/include/rng_utils.hpp`)**: Thread-local Mersenne Twister RNG (`boost::random::mt19937`) with:
62+
- Fixed global seed (12345) for reproducibility
63+
- Unique per-thread seeding to ensure thread-safety
64+
- Thread index mapping for deterministic parallel execution
65+
66+
**Type System (`src/include/distribution_traits.hpp`, `src/include/callable_traits.hpp`)**:
67+
- `distribution_traits`: Maps C++ types to DuckDB LogicalTypes
68+
- `callable_traits`: Template metaprogramming for extracting function signatures
69+
- `logical_type_map`: Supports double, int64_t, and pair<double,double> (for ranges)
70+
71+
### Adding a New Distribution
72+
73+
To add a new distribution:
74+
75+
1. Create `src/distribution_<name>.cpp` following the pattern in existing files
76+
2. Define the distribution traits with parameter names, types, and validation
77+
3. Register all desired functions (sample, pdf, cdf, etc.) in the `Load_<name>_distribution` function
78+
4. Add the source file to `EXTENSION_SOURCES` in `CMakeLists.txt`
79+
5. Add forward declaration and call to `Load_<name>_distribution()` in `src/stochastic_extension.cpp`
80+
6. Create test file `test/sql/<name>.test` with comprehensive test cases
81+
82+
### Testing
83+
84+
Tests use DuckDB's `.test` format (see `test/sql/normal.test` for examples). Test files:
85+
- Start with metadata comments (`# name:`, `# description:`, `# group:`)
86+
- Use `require stochastic` to ensure extension is loaded
87+
- Support `statement error`, `query R` (float), `query I` (integer), `query T` (text)
88+
- Can test multiple columns with `query RR`, `query RRR`, etc.
89+
90+
### Dependencies
91+
92+
The extension depends on:
93+
- **Boost.Math**: For statistical distribution calculations (PDF, CDF, quantile)
94+
- **Boost.Random**: For random number generation from distributions
95+
- Dependencies are managed via vcpkg (see `vcpkg.json`)
96+
97+
### Build System
98+
99+
The project uses:
100+
- **CMakeLists.txt**: Defines extension sources and links Boost libraries
101+
- **Makefile**: Thin wrapper that delegates to `extension-ci-tools/makefiles/duckdb_extension.Makefile`
102+
- **extension_config.cmake**: Tells DuckDB build system to load the stochastic extension
103+
- **extension-ci-tools/**: Git submodule with shared DuckDB extension build infrastructure
104+
105+
The `duckdb/` directory is a git submodule containing the DuckDB source code that the extension builds against.
106+
107+
### Function Naming Convention
108+
109+
All functions follow the pattern: `dist_<distribution>_<operation>`
110+
111+
Examples:
112+
- `dist_normal_sample(mean, stddev)` - Generate random sample
113+
- `dist_normal_pdf(mean, stddev, x)` - Probability density at x
114+
- `dist_normal_cdf(mean, stddev, x)` - Cumulative distribution at x
115+
- `dist_normal_quantile(mean, stddev, p)` - Inverse CDF (p-th quantile)
116+
- `dist_normal_mean(mean, stddev)` - Distribution mean
117+
- `dist_normal_variance(mean, stddev)` - Distribution variance
118+
119+
### Code Organization
120+
121+
```
122+
src/
123+
├── stochastic_extension.cpp # Extension entry point
124+
├── distribution_*.cpp # Individual distribution implementations (20+ files)
125+
├── rng_utils.cpp # Thread-local RNG management
126+
├── query_farm_telemetry.cpp # Optional telemetry
127+
└── include/
128+
├── utils.hpp # Template-based function registration system
129+
├── distribution_traits.hpp # Type system for distributions
130+
├── callable_traits.hpp # Template metaprogramming utilities
131+
├── rng_utils.hpp # RNG declarations
132+
└── ...
133+
134+
test/sql/
135+
└── *.test # DuckDB test files
136+
137+
duckdb/ # Git submodule - DuckDB source
138+
extension-ci-tools/ # Git submodule - shared build infrastructure
139+
```
140+
141+
### Key Implementation Details
142+
143+
**Constant Vector Optimization**: When all parameters are constant, the distribution object is created once and reused for all rows (see `DistributionSampleUnary` and `DistributionSampleBinary` in `utils.hpp`).
144+
145+
**Parameter Validation**: Each distribution validates its parameters in `distribution_traits::ValidateParameters()` and throws `InvalidInputException` for invalid inputs.
146+
147+
**Thread Safety**: The thread-local RNG ensures thread-safe random number generation without locks in the hot path.
148+
149+
**Telemetry**: The extension sends anonymous usage telemetry to Query.Farm (opt-out via environment variable). This happens once per extension load and includes DuckDB version, platform, and extension version. This telemetry is opt-out and does not need to be documented as Query.Farm documents it on its own website.

src/include/version.hpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#pragma once
2+
3+
namespace duckdb {
4+
5+
// Extension version in format YYYYMMDDCC (date + ordinal counter)
6+
// Update this version when making changes
7+
constexpr const char *STOCHASTIC_VERSION = "2025120401";
8+
9+
} // namespace duckdb

0 commit comments

Comments
 (0)