Skip to content

Commit dcfa67e

Browse files
authored
Add documentation for autocompressor (#70)
1 parent 9bfd786 commit dcfa67e

File tree

9 files changed

+398
-99
lines changed

9 files changed

+398
-99
lines changed

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ rayon = "1.3.0"
2020
string_cache = "0.8.0"
2121
env_logger = "0.9.0"
2222
log = "0.4.14"
23+
pyo3-log = "0.4.0"
24+
log-panics = "2.0.0"
2325

2426
[dependencies.state-map]
2527
git = "https://github.com/matrix-org/rust-matrix-state-map"

README.md

Lines changed: 200 additions & 87 deletions
Large diffs are not rendered by default.

auto_compressor/src/main.rs

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,15 @@ fn main() {
6565
.arg(
6666
Arg::with_name("postgres-url")
6767
.short("p")
68-
.value_name("URL")
69-
.help("The url for connecting to the postgres database.")
68+
.value_name("POSTGRES_LOCATION")
69+
.help("The configruation for connecting to the postgres database.")
7070
.long_help(concat!(
71-
"The url for connecting to the postgres database.This should be of",
72-
" the form \"postgresql://username:[email protected]/database\""))
71+
"The configuration for connecting to the Postgres database. This should be of the form ",
72+
r#""postgresql://username:[email protected]/database" or a key-value pair "#,
73+
r#"string: "user=username password=password dbname=database host=mydomain.com" "#,
74+
"See https://docs.rs/tokio-postgres/0.7.2/tokio_postgres/config/struct.Config.html ",
75+
"for the full details."
76+
))
7377
.takes_value(true)
7478
.required(true),
7579
).arg(
@@ -111,7 +115,10 @@ fn main() {
111115
.short("n")
112116
.value_name("CHUNKS_TO_COMPRESS")
113117
.help("The number of chunks to compress")
114-
.long_help("This many chunks of the database will be compressed ")
118+
.long_help(concat!(
119+
"This many chunks of the database will be compressed. The higher this number is set to, ",
120+
"the longer the compressor will run for."
121+
))
115122
.takes_value(true)
116123
.required(true),
117124
).get_matches();

compressor_integration_tests/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -358,7 +358,7 @@ fn functions_are_self_consistent() {
358358
pub fn setup_logger() {
359359
// setup the logger for the auto_compressor
360360
// The default can be overwritten with RUST_LOG
361-
// see the README for more information <--- TODO
361+
// see the README for more information
362362
if env::var("RUST_LOG").is_err() {
363363
let mut log_builder = env_logger::builder();
364364
// set is_test(true) so that the output is hidden by cargo test (unless the test fails)

docs/algorithm.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Compression algorithm
2+
3+
## What is state?
4+
State is things like who is in a room, what the room topic/name is, who has
5+
what privilege levels etc. Synapse keeps track of it for various reasons such as
6+
spotting invalid events (e.g. ones sent by banned users) and providing room membership
7+
information to clients.
8+
9+
## What is a state group?
10+
11+
Synapse needs to keep track of the state at the moment of each event. A state group
12+
corresponds to a unique state. The database table `event_to_state_groups` keeps track
13+
of the mapping from event ids to state group ids.
14+
15+
Consider the following simplified example:
16+
```
17+
State group id | State
18+
_____________________________________________
19+
1 | Alice in room
20+
2 | Alice in room, Bob in room
21+
3 | Bob in room
22+
23+
24+
Event id | What the event was
25+
______________________________________
26+
1 | Alice sends a message
27+
3 | Bob joins the room
28+
4 | Bob sends a message
29+
5 | Alice leaves the room
30+
6 | Bob sends a message
31+
32+
33+
Event id | State group id
34+
_________________________
35+
1 | 1
36+
2 | 1
37+
3 | 2
38+
4 | 2
39+
5 | 3
40+
6 | 3
41+
```
42+
43+
## What are deltas and predecessors?
44+
When a new state event happens (e.g. Bob joins the room) a new state group is created.
45+
BUT instead of copying all of the state from the previous state group, we just store
46+
the change from the previous group (saving on lots of storage space!). The difference
47+
from the previous state group is called the "delta".
48+
49+
So for the previous example, we would have the following (Note only rows 1 and 2 will
50+
make sense at this point):
51+
52+
```
53+
State group id | Previous state group id | Delta
54+
____________________________________________________________
55+
1 | NONE | Alice in room
56+
2 | 1 | Bob in room
57+
3 | NONE | Bob in room
58+
```
59+
60+
So why is state group 3's previous state group NONE and not 2? Well, the way that deltas
61+
work in Synapse is that they can only add in new state or overwrite old state, but they
62+
cannot remove it. (So if the room topic is changed then that is just overwriting state,
63+
but removing Alice from the room is neither an addition nor an overwriting). If it is
64+
impossible to find a delta, then you just start from scratch again with a "snapshot" of
65+
the entire state.
66+
67+
(NOTE this is not documentation on how synapse handles leaving rooms but is purely for illustrative
68+
purposes)
69+
70+
The state of a state group is worked out by following the previous state group's and adding
71+
together all of the deltas (with the most recent taking precedence).
72+
73+
The mapping from state group to previous state group takes place in `state_group_edges`
74+
and the deltas are stored in `state_groups_state`.
75+
76+
## What are we compressing then?
77+
In order to speed up the conversion from state group id to state, there is a limit of 100
78+
hops set by synapse (that is: we will only ever have to look up the deltas for a maximum of
79+
100 state groups). It does this by taking another "snapshot" every 100 state groups.
80+
81+
However, it is these snapshots that take up the bulk of the storage in a synapse database,
82+
so we want to find a way to reduce the number of them without dramatically increasing the
83+
maximum number of hops needed to do lookups.
84+
85+
86+
## Compression Algorithm
87+
88+
The algorithm works by attempting to create a *tree* of deltas, produced by
89+
appending state groups to different "levels". Each level has a maximum size, where
90+
each state group is appended to the lowest level that is not full. This tool calls a
91+
state group "compressed" once it has been added to
92+
one of these levels.
93+
94+
This produces a graph that looks approximately like the following, in the case
95+
of having two levels with the bottom level (L1) having a maximum size of 3:
96+
97+
```
98+
L2 <-------------------- L2 <---------- ...
99+
^--- L1 <--- L1 <--- L1 ^--- L1 <--- L1 <--- L1
100+
101+
NOTE: A <--- B means that state group B's predecessor is A
102+
```
103+
The structure that synapse creates by default would be equivalent to having one level with
104+
a maximum length of 100.
105+
106+
**Note**: Increasing the sum of the sizes of levels will increase the time it
107+
takes to query the full state of a given state group.

docs/python.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Running the compressor tools from python
2+
3+
Both the automatic and manual tools use PyO3 to allow the compressor
4+
to be run from Python.
5+
6+
To see any output from the tools, logging must be setup in Python before
7+
the compressor is run.
8+
9+
## Setting things up
10+
11+
1. Create a virtual environment in the place you want to use the compressor from
12+
(if it doesn't already exist)
13+
`$ virtualenv -p python3 venv`
14+
15+
2. Activate the virtual environment and install `maturin` (if you haven't already)
16+
`$ source venv/bin/activate`
17+
`$ pip install maturin`
18+
19+
3. Navigate to the correct location
20+
For the automatic tool:
21+
`$ cd /home/synapse/rust-synapse-compress-state/auto_compressor`
22+
For the manual tool:
23+
`$ cd /home/synapse/rust-synapse-compress-state`
24+
25+
3. Build and install the library
26+
`$ maturin develop`
27+
28+
This will install the relevant compressor tool into the activated virtual environment.
29+
30+
## Automatic tool example:
31+
32+
```python
33+
import auto_compressor
34+
35+
auto_compressor.compress_state_events_table(
36+
db_url="postgresql://localhost/synapse",
37+
chunk_size=500,
38+
default_levels="100,50,25",
39+
number_of_chunks=100
40+
)
41+
```
42+
43+
# Manual tool example:
44+
45+
```python
46+
import synapse_compress_state
47+
48+
synapse_compress_state.run_compression(
49+
db_url="postgresql://localhost/synapse",
50+
room_id="!some_room:example.com",
51+
output_file="out.sql",
52+
transactions=True
53+
)
54+
```

src/lib.rs

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
// of arguments - this hopefully doesn't make the code unclear
2121
// #[allow(clippy::too_many_arguments)] is therefore used around some functions
2222

23-
use log::{info, warn};
23+
use log::{info, warn, LevelFilter};
2424
use pyo3::{exceptions, prelude::*};
2525

2626
use clap::{crate_authors, crate_description, crate_name, crate_version, value_t, App, Arg};
@@ -121,11 +121,15 @@ impl Config {
121121
.arg(
122122
Arg::with_name("postgres-url")
123123
.short("p")
124-
.value_name("URL")
125-
.help("The url for connecting to the postgres database.")
124+
.value_name("POSTGRES_LOCATION")
125+
.help("The configruation for connecting to the postgres database.")
126126
.long_help(concat!(
127-
"The url for connecting to the postgres database.This should be of",
128-
" the form \"postgresql://username:[email protected]/database\""))
127+
"The configuration for connecting to the Postgres database. This should be of the form ",
128+
r#""postgresql://username:[email protected]/database" or a key-value pair "#,
129+
r#"string: "user=username password=password dbname=database host=mydomain.com" "#,
130+
"See https://docs.rs/tokio-postgres/0.7.2/tokio_postgres/config/struct.Config.html ",
131+
"for the full details."
132+
))
129133
.takes_value(true)
130134
.required(true),
131135
).arg(
@@ -781,6 +785,16 @@ fn run_compression(
781785
/// Python module - "import synapse_compress_state" to use
782786
#[pymodule]
783787
fn synapse_compress_state(_py: Python, m: &PyModule) -> PyResult<()> {
788+
let _ = pyo3_log::Logger::default()
789+
// don't send out anything lower than a warning from other crates
790+
.filter(LevelFilter::Warn)
791+
// don't log warnings from synapse_compress_state, the auto_compressor handles these
792+
// situations and provides better log messages
793+
.filter_target("synapse_compress_state".to_owned(), LevelFilter::Debug)
794+
.install();
795+
// ensure any panics produce error messages in the log
796+
log_panics::init();
797+
784798
m.add_function(wrap_pyfunction!(run_compression, m)?)?;
785799
Ok(())
786800
}

src/main.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ use synapse_compress_state as comp_state;
2929
fn main() {
3030
// setup the logger
3131
// The default can be overwritten with RUST_LOG
32-
// see the README for more information <---- TODO
32+
// see the README for more information
3333
if env::var("RUST_LOG").is_err() {
3434
let mut log_builder = env_logger::builder();
3535
// Only output the log message (and not the prefixed timestamp etc.)

0 commit comments

Comments
 (0)