You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reference implementations of the LDBC Social Network Benchmark's Interactive workload ([paper](https://homepages.cwi.nl/~boncz/snb-challenge/snb-sigmod.pdf), [specification on GitHub pages](https://ldbcouncil.org/ldbc_snb_docs/), [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf)).
6
+
This repository contains reference implementations of the LDBC Social Network Benchmark's Interactive v2 workload. See details on the benchmark, see the [SIGMOD 2015 paper](https://ldbcouncil.org/docs/papers/ldbc-snb-interactive-sigmod-2015.pdf), [specification on GitHub Pages](https://ldbcouncil.org/ldbc_snb_docs/), and [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf).
7
7
8
-
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2021-12.pdf)).
8
+
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2022-11.pdf)).
9
+
10
+
:warning:
11
+
This workload is still under design. If you are looking for a stable, auditable version, use the [Interactive v1 workload](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
9
12
10
13
## Notes
11
14
12
15
:warning: Please keep in mind the following when using this repository.
13
16
14
17
* The goal of the implementations in this repository is to serve as **reference implementations** which other implementations can cross-validated against. Therefore, our primary objective was readability and not absolute performance when formulating the queries.
15
18
16
-
* The default workload contains updates which are persisted in the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
17
-
18
-
* We expect most systems-under-test to use multi-threaded execution for their benchmark runs. **To allow running the updates on multiple threads, the update stream files need to be partitioned accordingly by the generator.** We have [pre-generated](#benchmark-data-sets) these for frequent partition numbers (1, 2, ..., 1024 and 24, 48, 96, ..., 768) and scale factors up to 1000.
19
+
* The default workload contains updates which change the state of the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
*[Microsoft SQL Server (Transact-SQL) implementation](mssql/README.md)
32
31
*[Umbra (SQL) implementation](umbra/README.md)
33
32
34
33
For detailed instructions, consult the READMEs of the projects.
35
34
36
-
To build a subset of the projects, use Maven profiles, e.g. to build the reference implementations, run:
37
-
38
-
```bash
39
-
mvn clean package -DskipTests -Pcypher,postgres
40
-
```
41
-
42
35
## User's guide
43
36
44
37
### Building the project
45
-
This project uses Java 11.
38
+
This project uses Java 17.
46
39
47
-
To build the project, run:
40
+
To build the entire project, run:
48
41
49
42
```bash
50
43
scripts/build.sh
51
44
```
52
45
46
+
To build a subset of the projects, e.g. to build the PostgreSQL implementation, run its individual build script:
47
+
48
+
```bash
49
+
postgres/scripts/build.sh
50
+
```
51
+
53
52
### Inputs
54
53
55
-
The benchmark framework relies on the following inputs produced by the [SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/):
54
+
The benchmark framework relies on the following inputs produced by the [SNB Datagen's new (Spark) version](https://github.com/ldbc/ldbc_snb_datagen_spark/).
55
+
56
+
Currently, the initial data set, update streams, and parameters can generated with the following command:
57
+
58
+
```bash
59
+
export SF= #The scale factor to generate
60
+
export LDBC_SNB_DATAGEN_DIR= # Path to the LDBC SNB datagen directory
61
+
export LDBC_SNB_DATAGEN_MAX_MEM= #Maximum memory the datagen could use, e.g. 16G
62
+
export LDBC_SNB_DRIVER_DIR= # Path to the LDBC SNB driver directory
63
+
export DATA_INPUT_TYPE=parquet
64
+
# If using the Docker Datagen version, set the env variable:
65
+
export USE_DATAGEN_DOCKER=true
66
+
67
+
scripts/generate-all.sh
68
+
```
56
69
57
-
***Initial data set:** the SNB graph in CSV format (`social_network/{static,dynamic}`)
58
-
***Update streams:** the input for the update operations (`social_network/updateStream_*.csv`)
59
-
***Substitution parameters:** the input parameters for the complex queries. It is produced by the Datagen (`substitution_parameters/`)
70
+
### Pre-generate data sets
71
+
72
+
[Pre-generated SF1-SF300 data sets](snb-interactive-pre-generated-data-sets.md) are available.
73
+
74
+
### Loading the data
75
+
76
+
Select the system to be tested, e.g. [PostgreSQL](postgres/).
77
+
Load the data set as described in the README file of the selected system.
78
+
For most systems, this involves setting an environment variable to the correct location and invoking the `scripts/load-in-one-step.sh` script.
60
79
61
80
### Driver modes
62
81
63
-
For each implementation, it is possible to perform to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes: create validation parameters, validate, and benchmark.
64
-
The execution in all three modes should be started after the initial data set was loaded into the system under test.
82
+
For each implementation, it is possible to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes.
83
+
All of these runs should be started with the initial data set loaded to the database.
65
84
66
85
1. Create validation parameters with the `driver/create-validation-parameters.sh` script.
67
86
68
87
***Inputs:**
69
88
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
70
-
* The update streams are the `updateStream_0_0_{forum,person}.csv`files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
71
-
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.
72
-
***Output:** The results will be stored in the validation parameters file (e.g. `validation_params.csv`) file set in the `create_validation_parameters` configuration property.
89
+
* The update streams are the files from the `inserts` and `deletes` directories in the directory`ldbc.snb.interactive.updates_dir` configuration property.
90
+
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.[TODO]
91
+
***Output:** The results will be stored in the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
73
92
***Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
74
93
75
-
2. Validate against an existing reference output (called "validation parameters") with the `driver/validate.sh` script.
94
+
2. Validate against existing validation parameters with the `driver/validate.sh` script.
76
95
77
96
***Input:**
78
-
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.csv`) file set in the `validate_database` configuration property.
97
+
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
79
98
* The update operations are also based on the content of the validation parameters file.
80
99
***Output:**
81
100
* The validation either passes of fails.
82
101
* The per query results of the validation are printed to the console.
83
102
* If the validation failed, the results are saved to the `validation_params-failed-expected.json` and `validation_params-failed-actual.json` files.
84
103
***Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
85
104
86
-
For the data sets generated with [Datagen v0.3.6](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v0.3.6), pre-generated [validation parameters for SF0.1 to SF10](https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/interactive-v1/validation_params-sf0.1-sf10.tar.zst) are available.
87
-
88
105
3. Run the benchmark with the `driver/benchmark.sh` script.
89
106
90
107
***Inputs:**
91
108
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
92
-
* The update streams are the `updateStream_*_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
93
-
* To get *2n* write threads, the framework requires *n*`updateStream_*_forum.csv` and *n*`updateStream_*_person.csv` files.
94
-
* If you are generating the data sets from scratch, set `ldbc.snb.datagen.serializer.numUpdatePartitions` to *n* in the [data generator](https://github.com/ldbc/ldbc_snb_datagen_hadoop) to get produce these.
95
-
* The goal of the benchmark is the achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
109
+
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
110
+
* The goal of the benchmark is to achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
96
111
* Set the `thread_count` property to the size of the thread pool for read operations.
97
112
* For audited benchmarks, ensure that the `warmup` and `operation_count` properties are set so that the warmup and benchmark phases last for 30+ minutes and 2+ hours, respectively.
98
113
***Output:**
@@ -101,70 +116,24 @@ The execution in all three modes should be started after the initial data set wa
101
116
* The detailed results of the benchmark are printed to the console and saved in the `results/` directory.
102
117
***Parallelism:** Multi-threaded execution is recommended to achieve the best result.
103
118
104
-
For more details on validating and benchmarking, visit the [driver's documentation](https://github.com/ldbc/ldbc_snb_interactive_driver/tree/v1-dev/docs).
105
-
106
119
## Developer's guide
107
120
108
121
To create a new implementation, it is recommended to use one of the existing ones: the Neo4j implementation for graph database management systems and the PostgreSQL implementation for RDBMSs.
109
122
110
123
The implementation process looks roughly as follows:
111
124
112
125
1. Create a bulk loader which loads the initial data set to the database.
113
-
2. Implement the complex and short reads queries (22 in total).
114
-
3. Implement the 7 update queries.
115
-
4. Test the implementation against the reference implementations using various scale factors.
116
-
5. Optimize the implementation.
117
-
118
-
## Data sets
119
-
120
-
### Benchmark data sets
121
-
122
-
To generate the benchmark data sets, use the [Hadoop-based LDBC SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v0.3.7-SNAPSHOT).
123
-
124
-
The key configurations are the following:
125
-
126
-
*`ldbc.snb.datagen.generator.scaleFactor`: set this to `snb.interactive.${SCALE_FACTOR}` where `${SCALE_FACTOR}` is the desired scale factor
127
-
*`ldbc.snb.datagen.serializer.numUpdatePartitions`: set this to the number of write threads used in the benchmark runs
128
-
* serializers: set these to the required format, e.g. the ones starting with `CsvMergeForeign` or `CsvComposite`
Producing large-scale data sets requires non-trivial amounts of memory and computing resources (e.g. SF100 requires 24GB memory and takes about 4 hours to generate on a single machine).
136
-
To mitigate this, we have pregenerated data sets using 9 different serializers and the update streams using 17 different partition numbers:
The data sets are available at the [SURF/CWI data repository](https://hdl.handle.net/11112/e6e00558-a2c3-9214-473e-04a16de09bf8). We also provide [direct links and download scripts](https://github.com/ldbc/data-sets-surf-repository).
142
-
143
-
### Test data set
144
-
145
-
The test data sets are placed in the `cypher/test-data/` directory for Neo4j and in the `postgres/test-data/` for the SQL systems.
146
-
147
-
To generate a data set with the same characteristics, see the [documentation on generating the test data set](test-data.md).
126
+
1. Add the required glue code to the Java driver that allows parameterized execution of queries and operators.
127
+
1. Implement the complex and short reads queries (21 in total).
128
+
1. Implement the insert and delete operations (16 in total).
129
+
1. Test the implementation against the reference implementations using various scale factors.
130
+
1. Optimize the implementation.
148
131
149
132
## Preparing for an audited run
150
133
151
134
Implementations of the Interactive workload can be audited by a certified LDBC auditor.
152
-
The [Auditing Policies chapter](http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
153
-
If you are considering commissioning an LDBC SNB audit, please study the [audit questionnaire](snb-interactive-audit-questionnaire.md).
154
-
155
-
156
-
### Determining the best TCR
157
-
158
-
1. Select a scale factor and configure the `driver/benchmark.properties` file as described in the [Driver modes](#driver-modes) section.
159
-
2. Load the data set with `scripts/load-in-one-step.sh`.
160
-
3. Create a backup with `scripts/backup-database.sh`.
161
-
4. Run the `driver/determine-best-tcr.sh`.
162
-
5. Once the "best TCR" value has been determined, test it with a full workload (at least 0.5h for warmup operation and at least 2h of benchmark time), and make further adjustments if necessary.
163
-
164
-
### Recommendations
135
+
The [Auditing Policies chapter](https://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
165
136
166
-
We have a few recommendations for creating audited implementations. (These are not requirements – implementations are allowed to deviate from these recommendations.)
137
+
If you plan to get your system audited, please reach to the [LDBC Steering Committee](https://ldbcouncil.org/organizational-members/).
167
138
168
-
* The implementation should target a popular Linux distribution (e.g. Ubuntu LTS, CentOS, Fedora).
169
-
* Use a containerized setup, where the DBMS is running in a Docker container.
170
-
* Instead of a specific hardware, target a cloud virtual machine instance (e.g. AWS `r5d.12xlarge`). Both bare-metal and regular instances can be used for audited runs.
139
+
:warning: Audited runs are currently only possible with the [v1 version](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
0 commit comments