Skip to content

Commit c9943f6

Browse files
committed
Merge branch 'storage'
2 parents e484f4e + cf50876 commit c9943f6

File tree

20 files changed

+1390
-293
lines changed

20 files changed

+1390
-293
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# WIP
22

3+
- JVM: Storage! `d/storage`, `d/store`, `d/restore`, `d/addresses`, `d/collect-garbage`, `d/file-storage`, `d/restore-conn`. See [docs/storage.md](docs/storage.md) for details
4+
- `d/settings` and per-database `:branching-factor` (passed via :opts)
35
- New API: `d/find-datom`. Works like `d/datoms`, but only returns single datom, but is faster than `(first (d/datoms ...))`
4-
- Optimized various parts of CLJS version related to compilation and index access
6+
- CLJS: Optimized various parts of CLJS version related to compilation and index access
57

68
# 1.4.2
79

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ Books:
4646
Docs:
4747

4848
- API Docs [![cljdoc badge](https://cljdoc.org/badge/datascript/datascript)](https://cljdoc.org/d/datascript/datascript/CURRENT)
49+
- [docs/queries.md](docs/queries.md)
50+
- [docs/tuples.md](docs/tuples.md)
51+
- [docs/storage.md](docs/storage.md)
4952
- [Getting started](https://github.com/tonsky/datascript/wiki/Getting-started)
5053
- [Tutorials](https://github.com/kristianmandrup/datascript-tutorial)
5154
- [Tips & tricks](https://github.com/tonsky/datascript/wiki/Tips-&-tricks)

bench/datascript/bench/datascript.cljc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,8 @@
246246
(defn ^:export -main
247247
"clj -A:bench -M -m datascript.bench.datascript [--profile] (add-1 | add-5 | ...)*"
248248
[& args]
249-
(let [profile? (.contains (or args ()) "--profile")
249+
(let [args (or args ())
250+
profile? (.contains ^java.util.List args "--profile")
250251
args (remove #{"--profile"} args)
251252
names (or (not-empty args) (sort (keys benches)))
252253
_ (apply println #?(:clj "CLJ:" :cljs "CLJS:") names)

deps.edn

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
:deps {
3-
persistent-sorted-set/persistent-sorted-set {:mvn/version "0.2.3"}
3+
persistent-sorted-set/persistent-sorted-set {:mvn/version "0.3.0"}
44
}
55

66
:aliases {
@@ -16,6 +16,19 @@
1616
}
1717
}
1818

19+
:1.11.1 {
20+
:override-deps {
21+
org.clojure/clojure {:mvn/version "1.11.1"}
22+
}
23+
}
24+
25+
:dev {
26+
:extra-paths ["dev"]
27+
:extra-deps {
28+
org.clojure/tools.namespace {:mvn/version "1.3.0"}
29+
}
30+
}
31+
1932
:test {
2033
:extra-paths ["test"]
2134
:extra-deps {

dev/data_readers.clj

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{p user/p}

dev/user.clj

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
(ns user
2+
(:require
3+
[clojure.tools.namespace.repl :as ns]))
4+
5+
(ns/set-refresh-dirs "src" "bench" "test" #_"bench_datomic" #_"test_datomic")
6+
7+
(defn reload []
8+
(set! *warn-on-reflection* true)
9+
(let [res (ns/refresh)]
10+
(if (instance? Throwable res)
11+
(do
12+
(.printStackTrace ^Throwable res)
13+
(throw res))
14+
res)))
15+
16+
(def lock
17+
(Object.))
18+
19+
(defn position []
20+
(let [trace (->> (Thread/currentThread)
21+
(.getStackTrace)
22+
(seq))
23+
el ^StackTraceElement (nth trace 4)]
24+
(str "[" (clojure.lang.Compiler/demunge (.getClassName el)) " " (.getFileName el) ":" (.getLineNumber el) "]")))
25+
26+
(defn p [form]
27+
`(let [t# (System/currentTimeMillis)
28+
res# ~form]
29+
(locking lock
30+
(println (str "#p" (position) " " '~form " => (" (- (System/currentTimeMillis) t#) " ms) " res#)))
31+
res#))

docs/storage.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
# Storing DataScript on disk
2+
3+
There are three ways to store DataScript database on disk. First two are actually just serialization: they can turn your DB into string, which you can then store on disk or whenever you want.
4+
5+
## print/read
6+
7+
The oldest and slowest way. Works both for Clojure and ClojureScript. Just
8+
9+
```
10+
(def string
11+
(pr-str db))
12+
```
13+
14+
and later
15+
16+
```
17+
(read string)
18+
```
19+
20+
or, better,
21+
22+
```
23+
(clojure.edn/read-string
24+
{:readers d/data-readers}
25+
string)
26+
```
27+
28+
Upsides:
29+
30+
- Always works.
31+
32+
Downsides:
33+
34+
- Not incremental, always stores entire db.
35+
- EDN serialization only.
36+
- Reading back is slow.
37+
38+
## serializable/from-serializable
39+
40+
An interesting idea I don’t think I’ve seen anywhere. Bring your own serialization. Basically when you do
41+
42+
```
43+
(d/serializable db)
44+
```
45+
46+
it returns a datastructure that is serialization-friendly, even for JSON (it doesn’t use keywords, for example):
47+
48+
```
49+
{"count" 1, "tx0" 536870912, "max-eid" 1, "max-tx" 536870913, "schema" "nil", "attrs" [":name"], "keywords" [], "eavt" [[1 0 "Ivan" 1]], "aevt" [0], "avet" [], "branching-factor" 512, "ref-type" "soft"}
50+
```
51+
52+
After that, you are free to call
53+
54+
```
55+
(cheshire/generate-string
56+
(d/serializable db))
57+
```
58+
59+
or `pr-str`, or `transit/write`. Doesn’t matter, really. Up to you. It’s also pretty convenient, because DS itself doesn’t have to depend on these serialization libraries.
60+
61+
When it’s time to restore, you do the same procedure in reverse:
62+
63+
```
64+
(def db
65+
(d/from-serializable
66+
(cheshire/parse-string
67+
string)))
68+
```
69+
70+
What’s good about this apporach is that it’s also faster to deserialize than print/read method. Works both in Clojure and CLJS.
71+
72+
Upsides:
73+
74+
- Works in both Clojure and ClojureScript.
75+
- Can choose your own serialization format and implementation.
76+
- Faster than print/read method.
77+
78+
Downsides:
79+
80+
- Not incremental, always stores entire db.
81+
82+
## Storage (NEW! HOT! BEST!)
83+
84+
This way pretends we are a real database and does things most optimally: incremental and lazy, just like the big boys. But only on JVM, at least for now. The rest of the article will discuss this approach.
85+
86+
Storing database is easy. First, you have to implement `datascript.storage/IStorage` protocol:
87+
88+
```
89+
(def storage
90+
(reify datascript.storage/IStorage
91+
(-store [_ addr+data-seq]
92+
(doseq [[addr data] addr+data-seq]
93+
... serialize and store <addr> -> <data> somehow ...))
94+
(-restore [_ addr]
95+
... load and de-serialize <data> stored at <addr>)))
96+
```
97+
98+
`-store` is batched so that you can, you know, wrap it in a transaction or something. `addr` is a 64-bit number. `data` is EDN-serializable data format, containing vectors, maps, integers, keywords and whatever you yourself put into values. So, not JSON-safe, but EDN-safe (as long as you don’t put weird stuff in values).
99+
100+
After you’re done implementing a store, call
101+
102+
```
103+
(d/store db storage)
104+
```
105+
106+
and you are done!
107+
108+
`d/store` is slightly mutable under-the-hood. In a sense that it remembers the storage, which nodes are stored, etc. It shouldn’t affect your regular use of a database, but still, I think you should know.
109+
110+
The good part about this is, if you modify your database, it’ll remember which parts of the B-tree were stored, and on a next store do an incremental update!
111+
112+
```
113+
(let [db' (d/db-with db
114+
tx-data)]
115+
(d/store db' storage))
116+
```
117+
118+
The code above will do much less `-store` calls because only some parts of the tress have changed. That’s one of the main propositions of this approach.
119+
120+
BTW, you can also specify storage during db creation, and then call `d/store` without storage argument:
121+
122+
```
123+
(let [schema nil
124+
db (d/empty-db schema {:storage storage})]
125+
(d/store db))
126+
```
127+
128+
Just a possibility. Once stored, DB will remember its storage. Actually, storing into different storage is not supported (yet? ever?), so you can just as well start calling `(d/store db)` without storage argument.
129+
130+
Ok, now for the fun stuff. How to read database from disk? Simple:
131+
132+
```
133+
(def db
134+
(d/restore storage))
135+
```
136+
137+
The fun part is, that `d/restore` does exactly zero reads! That’s right, restoring a database is lazy (and thus, super-fast). Only when you start accessing it, only relevant part will be fetched. E.g. this:
138+
139+
```
140+
(first (d/datoms db))
141+
```
142+
143+
Will do approximately 3-6 reads depending on how deep your db is (~ 2 + log512(datoms)).
144+
145+
Restored DB should work exactly as a normal DB. You can read whatever you want, query as much as you want, `entity`, `pull`, `datoms`, all should work as usual. If you modify restored DB, then `store` it, store will be incremental as well.
146+
147+
And that’s it! That’s all you need to know.
148+
149+
## Storage + conn
150+
151+
Storage also work with `conn`, and, as a convenience, if you specify `:storage` when creating `conn`, it’ll then `store` after each `transact!`:
152+
153+
```
154+
(def conn
155+
(d/create-conn schema {:storage storage}))
156+
157+
(d/transact! conn
158+
[[:db/add 1 :name "Ivan"]]) ;; <- will be stored automatically
159+
160+
(d/transact! conn
161+
[[:db/add 2 :name "Oleg"]]) ;; <- will be stored again
162+
```
163+
164+
You can restore `conn`, too:
165+
166+
```
167+
(def conn
168+
(d/restore-conn storage))
169+
```
170+
171+
## What is storage?
172+
173+
Important thing to understand is: one storage can only hold one database. This is because it uses constant addr to write its root. If you try to store two or more databases, last one will essentially overwrite everything else.
174+
175+
You can, of course, solve it on your level. If you need to store two or more databases, make a storage that uses two or more different file system directories (file storage) or two or more SQL tables (SQL storage). It’s entirely up to you.
176+
177+
## File Storage
178+
179+
DataScript comes with one default implementation of storage: `file-storage`. It takes a directory and stores everything about a database in it. Use it like this:
180+
181+
```
182+
(def storage
183+
(d/file-storage "/tmp/db"))
184+
185+
(d/store db storage)
186+
187+
(def db'
188+
(d/restore storage))
189+
```
190+
191+
It accepts couple of options. If you want your own serializer (it uses EDN by default):
192+
193+
```
194+
:freeze-fn :: (data) -> String. A serialization function
195+
:thaw-fn :: (String) -> data. A deserialization function
196+
```
197+
198+
If you want to read/write from/to input/output stream yourself:
199+
200+
```
201+
:write-fn :: (OutputStream data) -> void. Implement your own writer to FileOutputStream
202+
:read-fn :: (InputStream) -> Object. Implement your own reader from FileInputStream
203+
```
204+
205+
And finally, if you want to control how addresses are converted to file names:
206+
207+
```
208+
:addr->filename-fn :: (Long) -> String. Construct file name from address
209+
:filename->addr-fn :: (String) -> Long. Reconstruct address from file name
210+
```
211+
212+
All these options are optional.
213+
214+
Also, remember, it’s never safe to write to file system yourself. Always do it through battle-tested layer like SQLite or RocksDB or something. If you put `file-storage` in production, expect problems :)
215+
216+
## Garbage collection
217+
218+
The IStorage protocol actually has two more methods to implement:
219+
220+
```
221+
(-list-addresses [_]
222+
"Return seq that lists all addresses currently stored in your storage.
223+
Will be used during GC to remove keys that are no longer used.")
224+
225+
(-delete [_ addrs-seq]
226+
"Delete data stored under `addrs` (seq). Will be called during GC")
227+
```
228+
229+
What’s that about?
230+
231+
Well, when you store your database first time, every node gets an address. Then you add or remove some stuff from it, which creates new tree, which reuses parts of the old tree. You know, good old Clojure persistent data structures (it’s not literally Clojure, I have to roll my own, but the idea is the same).
232+
233+
Now, new tree consists of some old reused nodes that already have addresses and some new nodes that have no addresses yet. You save it again, and new nodes gets their addresses, and old nodes are just skipped, which make the whole process efficient.
234+
235+
Noticed the catch? New tree lost some nodes because it no longer needs them but they are still in storage! That means eventually this kind of garbage will accumulate, and your storage will hold way more nodes than needed to build last version of database.
236+
237+
(you might think: going back in history? but no. That’s not how we go about that)
238+
239+
That’s why garbage collection exists. Your storage needs to provide us with list of all the addresses that are currently in use, and a way to delete them. Then you call:
240+
241+
```
242+
(d/collect-garbage storage)
243+
```
244+
245+
and you are done! It will clean up everything that is not referenced by the current version of DB stored there or any past references that were restored from it and are still alive.
246+
247+
In the current implementation, expect that `d/collect-garbage` might be slow.
248+
249+
## Options
250+
251+
Databases support two primary options:
252+
253+
```
254+
:branching-factor <int>, default 512
255+
```
256+
257+
How wide your B-trees are. By default it means that each node will contain 256...512 keys. You can change it to e.g. 1024 and have less total nodes which are bigger if e.g. writing to storage has a high overhead or something. Or to something like 64 to have 32...64 keys in each node, if you want very high granularity and it’s cheaper to write small keys in your storage.
258+
259+
```
260+
:ref-type :strong | :soft | :weak, default :soft
261+
```
262+
263+
This is harder to explain. DataScript consists of three B-trees. Each one is, well, a tree. We store them per-node, so each node in a tree has an address if it was previously stored (new nodes have null).
264+
265+
Now, if some node was read from a storage (e.g. very first leaf and all the node leading to it, if you requested `(first (d/datoms db :eavt))`, then it will store its address and its value. But the value could, at any moment, be also restore from the storage, with some deserialization penalty.
266+
267+
So what `:ref-type` is controlling is how values in nodes that have both address and value are stored.
268+
269+
`:strong` means just a normal java reference, meaning, node value will never be unloaded from memory once read. You should use it if you are absolutely sure your entire database easily fits into memory. You will still benefit from lazy loading, but once loaded, it will never unload.
270+
271+
`:soft` is a sweet spot. It uses `SoftReference` to store the values. It means that normally values will not be unloaded, but under memory pressure, they just might.
272+
273+
`:weak` is similar to `:soft`, but uses `WeakReference` and will more aggressively unload your nodes. It was intended to be used in conjunction with pluggable caches like LRU, but this have not been yet implemented. TDB.
274+
275+
I think default options are fine for most, but in case you need to change them, you now know how.
276+
277+
## Conclusion
278+
279+
Storing gigabytes of data in DataScript and only loading parts that are need for the query was my initial vision for DataScript from the very beginning. I’m so happy we can finally get closer to that!
280+
281+
Let me know if you implement something cool with it.

project.clj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
:dependencies [
1010
[org.clojure/clojure "1.10.2" :scope "provided"]
1111
[org.clojure/clojurescript "1.10.844" :scope "provided"]
12-
[persistent-sorted-set "0.2.3"]
12+
[persistent-sorted-set "0.3.0"]
1313
]
1414

1515
:plugins [

script/repl.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
set -o errexit -o nounset -o pipefail
33
cd "`dirname $0`/.."
44

5-
clj -A:test:bench -J--add-opens=java.base/java.io=ALL-UNNAMED -X clojure.core.server/start-server :name repl :port 5555 :accept clojure.core.server/repl :server-daemon false
5+
echo "Starting Socket REPL server on port 5555"
6+
clj -A:1.11.1:dev:test:bench -X clojure.core.server/start-server :name repl :port 5555 :accept clojure.core.server/repl :server-daemon false

0 commit comments

Comments
 (0)