You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19-19Lines changed: 19 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,14 +9,14 @@
9
9
10
10
SIMD-accelerated SQL engine for the JVM. Every table is a branchable, copy-on-write value.
11
11
12
-
Stratum is a columnar analytics engine that combines the performance of fused SIMD execution with the semantics of immutable data. Tables are persistent values — fork one in O(1), modify it independently, persist snapshots to named branches, and time-travel to any previous commit. It's the same model as Clojure's persistent collections and git's object store, applied to analytical data.
12
+
Stratum is a columnar analytics engine that combines the performance of fused SIMD execution with the semantics of immutable data. Tables are persistent values - fork one in O(1), modify it independently, persist snapshots to named branches, and time-travel to any previous commit. It's the same model as Clojure's persistent collections and git's object store, applied to analytical data.
13
13
14
14
## 30-Second Demo
15
15
16
16
Start a PostgreSQL-compatible server and query CSV/Parquet files directly:
17
17
18
18
```bash
19
-
# Standalone JAR — no Clojure needed, just Java 21+
19
+
# Standalone JAR - no Clojure needed, just Java 21+
Stratum's architecture — fused SIMD execution over copy-on-write columnar data — delivers strong analytical performance.
53
+
Stratum's architecture - fused SIMD execution over copy-on-write columnar data - delivers strong analytical performance.
54
54
55
55
Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, 8-core Intel Lunar Lake. Full results in [doc/benchmarks.md](doc/benchmarks.md).
Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an isolated branch — modifications only touch the changed chunks, everything else is structurally shared. Persist snapshots to named branches, load them back, or time-travel to any previous commit.
137
+
Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an isolated branch - modifications only touch the changed chunks, everything else is structurally shared. Persist snapshots to named branches, load them back, or time-travel to any previous commit.
138
138
139
139
```clojure
140
140
(require '[stratum.api :as st])
@@ -144,7 +144,7 @@ Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an is
@@ -161,9 +161,9 @@ Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an is
161
161
162
162
**DML**: SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT), UPDATE FROM (joined updates), CREATE TABLE, DROP TABLE
163
163
164
-
**Joins**: INNER, LEFT, RIGHT, FULL — single and multi-column keys
164
+
**Joins**: INNER, LEFT, RIGHT, FULL - single and multi-column keys
165
165
166
-
**Window functions**: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST, LAG, LEAD, SUM/AVG/COUNT/MIN/MAX OVER — with PARTITION BY, ORDER BY, and frame clauses
166
+
**Window functions**: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST, LAG, LEAD, SUM/AVG/COUNT/MIN/MAX OVER - with PARTITION BY, ORDER BY, and frame clauses
167
167
168
168
**Subqueries and composition**: CTEs (WITH), correlated and uncorrelated subqueries, IN/NOT IN/EXISTS, set operations (UNION, INTERSECT, EXCEPT)
169
169
@@ -201,15 +201,15 @@ Bidirectional support: query `tech.ml.dataset` datasets directly with the Stratu
201
201
202
202
## Query DSL Reference
203
203
204
-
> **Note:** The DSL is still a work in progress. SQL strings are the more complete interface — use the DSL when you want to compose queries programmatically or pass in Clojure data directly without a SQL layer.
204
+
> **Note:** The DSL is still a work in progress. SQL strings are the more complete interface - use the DSL when you want to compose queries programmatically or pass in Clojure data directly without a SQL layer.
205
205
206
-
The DSL is intentionally flat. Every clause resolves column names by keyword lookup against a single merged map: `:from` establishes the base columns, `:join` merges in the dimension table's columns, and all subsequent clauses (`:where`, `:agg`, `:group`, `:select`, `:having`, `:order`) reference any column by its keyword. This makes it straightforward to build queries from Clojure data — no quoting, no SQL string interpolation, just maps and vectors. Composition (the DSL equivalent of SQL CTEs/subqueries) is done with Clojure `let`/`def`— see [Column Scoping and Composition](doc/query-engine.md#column-scoping-and-composition) for details.
206
+
The DSL is intentionally flat. Every clause resolves column names by keyword lookup against a single merged map: `:from` establishes the base columns, `:join` merges in the dimension table's columns, and all subsequent clauses (`:where`, `:agg`, `:group`, `:select`, `:having`, `:order`) reference any column by its keyword. This makes it straightforward to build queries from Clojure data - no quoting, no SQL string interpolation, just maps and vectors. Composition (the DSL equivalent of SQL CTEs/subqueries) is done with Clojure `let`/`def`- see [Column Scoping and Composition](doc/query-engine.md#column-scoping-and-composition) for details.
207
207
208
208
```clojure
209
209
;; Full query map
210
210
{:from {:col1 data1 :col2 data2} ;; Column data (arrays, indices, or encoded)
Need SIMD-accelerated analytics in your JVM stack? We offer integration support, custom development, and commercial licensing. Contact [contact@datahike.io](mailto:contact@datahike.io) or visit [datahike.io](https://datahike.io/about).
331
+
If you need help getting Stratum into production, we can help with integration, custom development, and support contracts. Contact [contact@datahike.io](mailto:contact@datahike.io) or visit [datahike.io](https://datahike.io/about).
Copy file name to clipboardExpand all lines: doc/anomaly-detection.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,11 +47,11 @@ Train an isolation forest on columnar data.
47
47
```
48
48
49
49
**Parameters:**
50
-
-`:from`— map of keyword to `double[]` or `long[]` columns (required)
51
-
-`:n-trees`— number of isolation trees (default 100)
52
-
-`:sample-size`— rows subsampled per tree (default 256). Controls tree depth: `ceil(log2(sample-size))`
53
-
-`:seed`— random seed for reproducibility
54
-
-`:contamination`— expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at `1 - contamination`)
50
+
-`:from`- map of keyword to `double[]` or `long[]` columns (required)
51
+
-`:n-trees`- number of isolation trees (default 100)
52
+
-`:sample-size`- rows subsampled per tree (default 256). Controls tree depth: `ceil(log2(sample-size))`
53
+
-`:seed`- random seed for reproducibility
54
+
-`:contamination`- expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at `1 - contamination`)
55
55
56
56
**Returns** a model map containing the flat forest array, metadata, and (if contamination was set) the threshold, training score min/max.
57
57
@@ -96,8 +96,8 @@ Prediction confidence based on tree agreement.
96
96
97
97
Returns `double[]` in `[0, 1]` where `1.0` means all trees fully agree on the point's isolation depth. Uses the coefficient of variation (CV) of per-tree path lengths: `confidence = 1 / (1 + CV)`.
98
98
99
-
-**High confidence** (>0.8): Trees agree — the prediction is reliable
100
-
-**Low confidence** (<0.5): Trees disagree — the point is in an ambiguous region
99
+
-**High confidence** (>0.8): Trees agree - the prediction is reliable
100
+
-**Low confidence** (<0.5): Trees disagree - the point is in an ambiguous region
101
101
102
102
### `iforest-rotate`
103
103
@@ -222,6 +222,6 @@ All inputs are validated against malli schemas (`stratum.specification`):
222
222
223
223
## Related Documentation
224
224
225
-
-[Query Engine](query-engine.md)— Using anomaly scores in queries
Copy file name to clipboardExpand all lines: doc/architecture.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,7 +65,7 @@ A persistent sorted set (PSS) tree of `ChunkEntry` records, each containing:
65
65
-**PersistentColChunk**: CoW wrapper around a `long[]` or `double[]` (8192 elements default)
66
66
-**ChunkStats**: per-chunk count, sum, sum-of-squares, min, max
67
67
68
-
Indices support O(1) fork via structural sharing and copy-on-write on mutation. The query engine can stream over chunks without materializing the full array (64KB per chunk fits L2 cache). When persisted, the PSS tree is stored in konserve and lazy-loaded on demand — opening a billion-row index costs nothing until chunks are actually accessed.
68
+
Indices support O(1) fork via structural sharing and copy-on-write on mutation. The query engine can stream over chunks without materializing the full array (64KB per chunk fits L2 cache). When persisted, the PSS tree is stored in konserve and lazy-loaded on demand - opening a billion-row index costs nothing until chunks are actually accessed.
0 commit comments