Skip to content

Commit 38c9a3f

Browse files
committed
Improve docs: consistent punctuation and CTA phrasing
Replace em dashes with hyphens for consistency, align "Work with us" section with datahike.io site copy.
1 parent 9ef2270 commit 38c9a3f

File tree

12 files changed

+109
-109
lines changed

12 files changed

+109
-109
lines changed

AGENTS.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -149,8 +149,8 @@ User → stratum.api/q
149149
```
150150

151151
**Data representations:**
152-
- `long[]` / `double[]` heap arrays (JVM GC managed)
153-
- `PersistentColumnIndex` chunked B-tree with per-chunk statistics and zone maps
152+
- `long[]` / `double[]` - heap arrays (JVM GC managed)
153+
- `PersistentColumnIndex` - chunked B-tree with per-chunk statistics and zone maps
154154
- `String[]` → dictionary-encoded `long[]` for group-by and LIKE
155155

156156
## Important Constraints
@@ -203,10 +203,10 @@ clj -M:release:test
203203
## Technical Documentation
204204

205205
See `doc/` for in-depth documentation:
206-
- [Architecture](doc/architecture.md) System overview, module map, walkthrough
207-
- [SIMD Internals](doc/simd-internals.md) Java Vector API patterns, JIT lessons
208-
- [Query Engine](doc/query-engine.md) Dispatch logic, expressions, optimization
209-
- [Storage and Indices](doc/storage-and-indices.md) Chunks, CoW, zone maps
210-
- [Benchmarks](doc/benchmarks.md) Methodology, results, reproducing
211-
- [SQL Interface](doc/sql-interface.md) PgWire server, SQL translation
212-
- [Anomaly Detection](doc/anomaly-detection.md) Isolation forest training, scoring, online rotation
206+
- [Architecture](doc/architecture.md) - System overview, module map, walkthrough
207+
- [SIMD Internals](doc/simd-internals.md) - Java Vector API patterns, JIT lessons
208+
- [Query Engine](doc/query-engine.md) - Dispatch logic, expressions, optimization
209+
- [Storage and Indices](doc/storage-and-indices.md) - Chunks, CoW, zone maps
210+
- [Benchmarks](doc/benchmarks.md) - Methodology, results, reproducing
211+
- [SQL Interface](doc/sql-interface.md) - PgWire server, SQL translation
212+
- [Anomaly Detection](doc/anomaly-detection.md) - Isolation forest training, scoring, online rotation

NOTEBOOKS.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Interactive notebooks for exploring Stratum's features.
1010
;; In a running REPL (clj -M:repl):
1111
(require '[scicloj.clay.v2.api :as clay])
1212

13-
;; Render to HTML opens in browser at http://localhost:1971/
13+
;; Render to HTML - opens in browser at http://localhost:1971/
1414
(clay/make! {:source-path "notebooks/stratum_intro.clj" :show true})
1515

1616
;; Generate Quarto document
@@ -38,17 +38,17 @@ clj -M:dev -i notebooks/test_persistence.clj
3838

3939
Introduction for Clojure data science practitioners:
4040

41-
- **Column maps & SQL** DSL query maps and SQL strings, same engine
42-
- **Tablecloth interop** Pass `tc/dataset` directly, zero copy
43-
- **Fused SIMD execution** Why it's fast, live timing on 1M rows
44-
- **Zone map pruning** Range queries skip irrelevant chunks automatically
45-
- **Persistence** `st/sync!`, `st/fork`, `st/load`, time-travel by commit UUID
46-
- **Statistics** STDDEV, VARIANCE, CORR natively in a single pass
47-
- **Hash joins** INNER, LEFT, RIGHT, FULL
41+
- **Column maps & SQL** - DSL query maps and SQL strings, same engine
42+
- **Tablecloth interop** - Pass `tc/dataset` directly, zero copy
43+
- **Fused SIMD execution** - Why it's fast, live timing on 1M rows
44+
- **Zone map pruning** - Range queries skip irrelevant chunks automatically
45+
- **Persistence** - `st/sync!`, `st/fork`, `st/load`, time-travel by commit UUID
46+
- **Statistics** - STDDEV, VARIANCE, CORR natively in a single pass
47+
- **Hash joins** - INNER, LEFT, RIGHT, FULL
4848

4949
### `datahike_integration.clj`
5050

51-
Datahike + Stratum entity queries alongside OLAP analytics, auto-sync
51+
Datahike + Stratum - entity queries alongside OLAP analytics, auto-sync
5252
via `d/listen!`, Yggdrasil composite for atomic snapshots, SQL via PgWire.
5353

5454
## Writing Notebooks

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@
99

1010
SIMD-accelerated SQL engine for the JVM. Every table is a branchable, copy-on-write value.
1111

12-
Stratum is a columnar analytics engine that combines the performance of fused SIMD execution with the semantics of immutable data. Tables are persistent values fork one in O(1), modify it independently, persist snapshots to named branches, and time-travel to any previous commit. It's the same model as Clojure's persistent collections and git's object store, applied to analytical data.
12+
Stratum is a columnar analytics engine that combines the performance of fused SIMD execution with the semantics of immutable data. Tables are persistent values - fork one in O(1), modify it independently, persist snapshots to named branches, and time-travel to any previous commit. It's the same model as Clojure's persistent collections and git's object store, applied to analytical data.
1313

1414
## 30-Second Demo
1515

1616
Start a PostgreSQL-compatible server and query CSV/Parquet files directly:
1717

1818
```bash
19-
# Standalone JAR no Clojure needed, just Java 21+
19+
# Standalone JAR - no Clojure needed, just Java 21+
2020
java --add-modules jdk.incubator.vector -jar stratum-standalone.jar --demo
2121

2222
# Or with your own data
@@ -50,7 +50,7 @@ clj -M:server --demo
5050

5151
## Performance
5252

53-
Stratum's architecture fused SIMD execution over copy-on-write columnar data delivers strong analytical performance.
53+
Stratum's architecture - fused SIMD execution over copy-on-write columnar data - delivers strong analytical performance.
5454

5555
Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, 8-core Intel Lunar Lake. Full results in [doc/benchmarks.md](doc/benchmarks.md).
5656

@@ -134,7 +134,7 @@ clj -M:olap cb # ClickBench tier only
134134

135135
## Snapshots and Branching
136136

137-
Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an isolated branch modifications only touch the changed chunks, everything else is structurally shared. Persist snapshots to named branches, load them back, or time-travel to any previous commit.
137+
Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an isolated branch - modifications only touch the changed chunks, everything else is structurally shared. Persist snapshots to named branches, load them back, or time-travel to any previous commit.
138138

139139
```clojure
140140
(require '[stratum.api :as st])
@@ -144,7 +144,7 @@ Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an is
144144
:qty (long-array [1 2 3])}
145145
{:name "orders"}))
146146

147-
;; O(1) fork structural sharing, independent mutations
147+
;; O(1) fork - structural sharing, independent mutations
148148
(def experiment (st/fork ds))
149149

150150
;; Persist to storage
@@ -161,9 +161,9 @@ Every Stratum dataset is a copy-on-write value. Fork one in O(1) to create an is
161161

162162
**DML**: SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT), UPDATE FROM (joined updates), CREATE TABLE, DROP TABLE
163163

164-
**Joins**: INNER, LEFT, RIGHT, FULL single and multi-column keys
164+
**Joins**: INNER, LEFT, RIGHT, FULL - single and multi-column keys
165165

166-
**Window functions**: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST, LAG, LEAD, SUM/AVG/COUNT/MIN/MAX OVER with PARTITION BY, ORDER BY, and frame clauses
166+
**Window functions**: ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST, LAG, LEAD, SUM/AVG/COUNT/MIN/MAX OVER - with PARTITION BY, ORDER BY, and frame clauses
167167

168168
**Subqueries and composition**: CTEs (WITH), correlated and uncorrelated subqueries, IN/NOT IN/EXISTS, set operations (UNION, INTERSECT, EXCEPT)
169169

@@ -201,15 +201,15 @@ Bidirectional support: query `tech.ml.dataset` datasets directly with the Stratu
201201

202202
## Query DSL Reference
203203

204-
> **Note:** The DSL is still a work in progress. SQL strings are the more complete interface use the DSL when you want to compose queries programmatically or pass in Clojure data directly without a SQL layer.
204+
> **Note:** The DSL is still a work in progress. SQL strings are the more complete interface - use the DSL when you want to compose queries programmatically or pass in Clojure data directly without a SQL layer.
205205
206-
The DSL is intentionally flat. Every clause resolves column names by keyword lookup against a single merged map: `:from` establishes the base columns, `:join` merges in the dimension table's columns, and all subsequent clauses (`:where`, `:agg`, `:group`, `:select`, `:having`, `:order`) reference any column by its keyword. This makes it straightforward to build queries from Clojure data no quoting, no SQL string interpolation, just maps and vectors. Composition (the DSL equivalent of SQL CTEs/subqueries) is done with Clojure `let`/`def` see [Column Scoping and Composition](doc/query-engine.md#column-scoping-and-composition) for details.
206+
The DSL is intentionally flat. Every clause resolves column names by keyword lookup against a single merged map: `:from` establishes the base columns, `:join` merges in the dimension table's columns, and all subsequent clauses (`:where`, `:agg`, `:group`, `:select`, `:having`, `:order`) reference any column by its keyword. This makes it straightforward to build queries from Clojure data - no quoting, no SQL string interpolation, just maps and vectors. Composition (the DSL equivalent of SQL CTEs/subqueries) is done with Clojure `let`/`def` - see [Column Scoping and Composition](doc/query-engine.md#column-scoping-and-composition) for details.
207207

208208
```clojure
209209
;; Full query map
210210
{:from {:col1 data1 :col2 data2} ;; Column data (arrays, indices, or encoded)
211211
:join [{:with {:k data} ;; Dimension table columns
212-
:on [:= :col1 :k] ;; :col1 from :from, :k from :with both visible after join
212+
:on [:= :col1 :k] ;; :col1 from :from, :k from :with - both visible after join
213213
:type :inner}]
214214
:where [[:< :col1 100] [:like :name "%foo%"]] ;; Predicates
215215
:select [:col1 [:as [:* :col2 100] :pct]] ;; Projection
@@ -234,12 +234,12 @@ The DSL is intentionally flat. Every clause resolves column names by keyword loo
234234

235235
## Ecosystem
236236

237-
Stratum is part of the [Replikativ](https://github.com/replikativ) ecosystem a set of composable, immutable data systems:
237+
Stratum is part of the [Replikativ](https://github.com/replikativ) ecosystem - a set of composable, immutable data systems:
238238

239-
- **[Datahike](https://github.com/replikativ/datahike)** immutable graph database with Datalog queries
240-
- **[Yggdrasil](https://github.com/replikativ/yggdrasil)** branching protocol for multi-system snapshots
241-
- **[Scriptum](https://github.com/replikativ/scriptum)** full-text search
242-
- **[Proximum](https://github.com/replikativ/proximum)** vector search
239+
- **[Datahike](https://github.com/replikativ/datahike)** - immutable graph database with Datalog queries
240+
- **[Yggdrasil](https://github.com/replikativ/yggdrasil)** - branching protocol for multi-system snapshots
241+
- **[Scriptum](https://github.com/replikativ/scriptum)** - full-text search
242+
- **[Proximum](https://github.com/replikativ/proximum)** - vector search
243243

244244
All share copy-on-write semantics and can be branched together via Yggdrasil.
245245

@@ -268,8 +268,8 @@ User → stratum.api/q
268268
```
269269

270270
**Data representations:**
271-
- `long[]` / `double[]` heap arrays for raw columnar data
272-
- `PersistentColumnIndex` chunked B-tree with per-chunk statistics and zone maps
271+
- `long[]` / `double[]` - heap arrays for raw columnar data
272+
- `PersistentColumnIndex` - chunked B-tree with per-chunk statistics and zone maps
273273
- `String[]` → dictionary-encoded `long[]` for group-by and LIKE
274274

275275
## Installation
@@ -326,9 +326,9 @@ javac --add-modules jdk.incubator.vector -d target/classes \
326326
# Restart REPL (JVM can't reload classes)
327327
```
328328

329-
## Commercial Support
329+
## Work with us
330330

331-
Need SIMD-accelerated analytics in your JVM stack? We offer integration support, custom development, and commercial licensing. Contact [contact@datahike.io](mailto:contact@datahike.io) or visit [datahike.io](https://datahike.io/about).
331+
If you need help getting Stratum into production, we can help with integration, custom development, and support contracts. Contact [contact@datahike.io](mailto:contact@datahike.io) or visit [datahike.io](https://datahike.io/about).
332332

333333
## License
334334

doc/anomaly-detection.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,11 @@ Train an isolation forest on columnar data.
4747
```
4848

4949
**Parameters:**
50-
- `:from` map of keyword to `double[]` or `long[]` columns (required)
51-
- `:n-trees` number of isolation trees (default 100)
52-
- `:sample-size` rows subsampled per tree (default 256). Controls tree depth: `ceil(log2(sample-size))`
53-
- `:seed` random seed for reproducibility
54-
- `:contamination` expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at `1 - contamination`)
50+
- `:from` - map of keyword to `double[]` or `long[]` columns (required)
51+
- `:n-trees` - number of isolation trees (default 100)
52+
- `:sample-size` - rows subsampled per tree (default 256). Controls tree depth: `ceil(log2(sample-size))`
53+
- `:seed` - random seed for reproducibility
54+
- `:contamination` - expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at `1 - contamination`)
5555

5656
**Returns** a model map containing the flat forest array, metadata, and (if contamination was set) the threshold, training score min/max.
5757

@@ -96,8 +96,8 @@ Prediction confidence based on tree agreement.
9696

9797
Returns `double[]` in `[0, 1]` where `1.0` means all trees fully agree on the point's isolation depth. Uses the coefficient of variation (CV) of per-tree path lengths: `confidence = 1 / (1 + CV)`.
9898

99-
- **High confidence** (>0.8): Trees agree the prediction is reliable
100-
- **Low confidence** (<0.5): Trees disagree the point is in an ambiguous region
99+
- **High confidence** (>0.8): Trees agree - the prediction is reliable
100+
- **Low confidence** (<0.5): Trees disagree - the point is in an ambiguous region
101101

102102
### `iforest-rotate`
103103

@@ -222,6 +222,6 @@ All inputs are validated against malli schemas (`stratum.specification`):
222222

223223
## Related Documentation
224224

225-
- [Query Engine](query-engine.md) Using anomaly scores in queries
226-
- [SQL Interface](sql-interface.md) SQL anomaly functions
227-
- [Architecture](architecture.md) System overview
225+
- [Query Engine](query-engine.md) - Using anomaly scores in queries
226+
- [SQL Interface](sql-interface.md) - SQL anomaly functions
227+
- [Architecture](architecture.md) - System overview

doc/architecture.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ A persistent sorted set (PSS) tree of `ChunkEntry` records, each containing:
6565
- **PersistentColChunk**: CoW wrapper around a `long[]` or `double[]` (8192 elements default)
6666
- **ChunkStats**: per-chunk count, sum, sum-of-squares, min, max
6767

68-
Indices support O(1) fork via structural sharing and copy-on-write on mutation. The query engine can stream over chunks without materializing the full array (64KB per chunk fits L2 cache). When persisted, the PSS tree is stored in konserve and lazy-loaded on demand opening a billion-row index costs nothing until chunks are actually accessed.
68+
Indices support O(1) fork via structural sharing and copy-on-write on mutation. The query engine can stream over chunks without materializing the full array (64KB per chunk fits L2 cache). When persisted, the PSS tree is stored in konserve and lazy-loaded on demand - opening a billion-row index costs nothing until chunks are actually accessed.
6969

7070
### Dictionary-Encoded Strings
7171

@@ -127,9 +127,9 @@ Total time: ~4ms single-threaded, ~1ms multi-threaded (6M rows).
127127

128128
## Related Documentation
129129

130-
- [SIMD Internals](simd-internals.md) Java Vector API patterns, fused filter+aggregate, morsel-driven parallelism
131-
- [Query Engine](query-engine.md) Dispatch logic, expression evaluation, optimization
132-
- [Storage and Indices](storage-and-indices.md) Chunks, CoW semantics, zone maps, Konserve
133-
- [Benchmarks](benchmarks.md) Methodology, results, reproducing
134-
- [SQL Interface](sql-interface.md) PgWire server, SQL translation, supported subset
135-
- [Anomaly Detection](anomaly-detection.md) Isolation forest training, scoring, online rotation
130+
- [SIMD Internals](simd-internals.md) - Java Vector API patterns, fused filter+aggregate, morsel-driven parallelism
131+
- [Query Engine](query-engine.md) - Dispatch logic, expression evaluation, optimization
132+
- [Storage and Indices](storage-and-indices.md) - Chunks, CoW semantics, zone maps, Konserve
133+
- [Benchmarks](benchmarks.md) - Methodology, results, reproducing
134+
- [SQL Interface](sql-interface.md) - PgWire server, SQL translation, supported subset
135+
- [Anomaly Detection](anomaly-detection.md) - Isolation forest training, scoring, online rotation

doc/benchmarks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Standard decision-support queries on TPC-H lineitem data (6M rows from CSV).
2929
| B1 | TPC-H Q6: filter + SUM(price*discount) | **12.9ms** | 7.3ms | 27.9ms | 5.4ms | **2.2x** |
3030
| B2 | TPC-H Q1: GROUP BY + 7 aggregates | **74.6ms** | 23.4ms | 92.5ms | 16.8ms | **1.2x** |
3131
| B3 | SSB Q1.1: filter + SUM(price*discount) | **12.9ms** | 4.8ms | 28.3ms | 5.7ms | **2.2x** |
32-
| B4 | COUNT(*) no filter | **0.1ms** | | 0.4ms | 0.3ms | **4.0x** |
32+
| B4 | COUNT(*) no filter | **0.1ms** | - | 0.4ms | 0.3ms | **4.0x** |
3333
| B5 | Filtered COUNT (NEQ predicate) | **3.1ms** | 1.7ms | 12.2ms | 2.9ms | **4.0x** |
3434
| B6 | Low-cardinality GROUP BY + COUNT | **16.9ms** | 7.3ms | 24.0ms | 4.6ms | **1.4x** |
3535
| SSB-Q1.2 | Tighter filter + SUM(price*discount) | **12.5ms** | 4.8ms | 23.3ms | 4.5ms | **1.9x** |

doc/dataset.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,14 +69,14 @@ Only index-backed columns support persistence (`st/sync!`) and O(1) forking (`st
6969
Like Clojure collections, mutations require transient mode:
7070

7171
```clojure
72-
;; CORRECT transient → mutate → persistent
72+
;; CORRECT - transient → mutate → persistent
7373
(-> ds
7474
dataset/ds-transient
7575
(dataset/ds-set! :price 0 99.0)
7676
(dataset/ds-append! {:price 40.0 :qty 4})
7777
dataset/ds-persistent!)
7878

79-
;; WRONG will throw IllegalStateException
79+
;; WRONG - will throw IllegalStateException
8080
(dataset/ds-set! ds :price 0 99.0)
8181
```
8282

0 commit comments

Comments
 (0)