You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository contains reference implementations of the LDBC Social Network Benchmark's Interactive v2 workload. See details on the benchmark, see the [SIGMOD 2015 paper](https://ldbcouncil.org/docs/papers/ldbc-snb-interactive-sigmod-2015.pdf), [specification on GitHub Pages](https://ldbcouncil.org/ldbc_snb_docs/), and [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf).
6
+
Reference implementations of the LDBC Social Network Benchmark's Interactive workload ([paper](https://homepages.cwi.nl/~boncz/snb-challenge/snb-sigmod.pdf), [specification on GitHub pages](https://ldbcouncil.org/ldbc_snb_docs/), [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf)).
7
7
8
-
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2022-11.pdf)).
9
-
10
-
:warning:
11
-
This workload is still under design. If you are looking for a stable, auditable version, use the [Interactive v1 workload](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
8
+
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2021-12.pdf)).
12
9
13
10
## Notes
14
11
15
12
:warning: Please keep in mind the following when using this repository.
16
13
17
14
* The goal of the implementations in this repository is to serve as **reference implementations** which other implementations can cross-validated against. Therefore, our primary objective was readability and not absolute performance when formulating the queries.
18
15
19
-
* The default workload contains updates which change the state of the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
16
+
* The default workload contains updates which are persisted in the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
17
+
18
+
* We expect most systems-under-test to use multi-threaded execution for their benchmark runs. **To allow running the updates on multiple threads, the update stream files need to be partitioned accordingly by the generator.** We have [pre-generated](#benchmark-data-sets) these for frequent partition numbers (1, 2, ..., 1024 and 24, 48, 96, ..., 768) and scale factors up to 1000.
For detailed instructions, consult the READMEs of the projects.
34
35
35
-
## User's guide
36
-
37
-
### Building the project
38
-
This project uses Java 17.
39
-
40
-
To build the entire project, run:
36
+
To build a subset of the projects, use Maven profiles, e.g. to build the reference implementations, run:
41
37
42
38
```bash
43
-
scripts/build.sh
39
+
mvn clean package -DskipTests -Pcypher,postgres
44
40
```
45
41
46
-
To build a subset of the projects, e.g. to build the PostgreSQL implementation, run its individual build script:
47
-
48
-
```bash
49
-
postgres/scripts/build.sh
50
-
```
51
-
52
-
### Inputs
42
+
## User's guide
53
43
54
-
The benchmark framework relies on the following inputs produced by the [SNB Datagen's new (Spark) version](https://github.com/ldbc/ldbc_snb_datagen_spark/).
44
+
### Building the project
45
+
This project uses Java 11.
55
46
56
-
Currently, the initial data set, update streams, and parameters can generated with the following command:
47
+
To build the project, run:
57
48
58
49
```bash
59
-
export SF= #The scale factor to generate
60
-
export LDBC_SNB_DATAGEN_DIR= # Path to the LDBC SNB datagen directory
61
-
export LDBC_SNB_DATAGEN_MAX_MEM= #Maximum memory the datagen could use, e.g. 16G
62
-
export LDBC_SNB_DRIVER_DIR= # Path to the LDBC SNB driver directory
63
-
export DATA_INPUT_TYPE=parquet
64
-
# If using the Docker Datagen version, set the env variable:
65
-
export USE_DATAGEN_DOCKER=true
66
-
67
-
scripts/generate-all.sh
50
+
scripts/build.sh
68
51
```
69
52
70
-
### Pre-generate data sets
71
-
72
-
[Pre-generated SF1-SF300 data sets](snb-interactive-pre-generated-data-sets.md) are available.
53
+
### Inputs
73
54
74
-
### Loading the data
55
+
The benchmark framework relies on the following inputs produced by the [SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/):
75
56
76
-
Select the system to be tested, e.g. [PostgreSQL](postgres/).
77
-
Load the data set as described in the README file of the selected system.
78
-
For most systems, this involves setting an environment variable to the correct location and invoking the `scripts/load-in-one-step.sh` script.
57
+
***Initial data set:**the SNB graph in CSV format (`social_network/{static,dynamic}`)
58
+
***Update streams:**the input for the update operations (`social_network/updateStream_*.csv`)
59
+
***Substitution parameters:** the input parameters for the complex queries. It is produced by the Datagen (`substitution_parameters/`)
79
60
80
61
### Driver modes
81
62
82
-
For each implementation, it is possible to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes.
83
-
All of these runs should be started with the initial data set loaded to the database.
63
+
For each implementation, it is possible to perform to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_v1_driver) three modes: create validation parameters, validate, and benchmark.
64
+
The execution in all three modes should be started after the initial data set was loaded into the system under test.
84
65
85
66
1. Create validation parameters with the `driver/create-validation-parameters.sh` script.
86
67
87
68
***Inputs:**
88
69
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
89
-
* The update streams are the files from the `inserts` and `deletes` directories in the directory`ldbc.snb.interactive.updates_dir` configuration property.
90
-
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.[TODO]
91
-
***Output:** The results will be stored in the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
70
+
* The update streams are the `updateStream_0_0_{forum,person}.csv`files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
71
+
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.
72
+
***Output:** The results will be stored in the validation parameters file (e.g. `validation_params.csv`) file set in the `create_validation_parameters` configuration property.
92
73
***Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
93
74
94
-
2. Validate against existing validation parameters with the `driver/validate.sh` script.
75
+
2. Validate against an existing reference output (called "validation parameters") with the `driver/validate.sh` script.
95
76
96
77
***Input:**
97
-
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
78
+
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.csv`) file set in the `validate_database` configuration property.
98
79
* The update operations are also based on the content of the validation parameters file.
99
80
***Output:**
100
81
* The validation either passes of fails.
@@ -106,8 +87,10 @@ All of these runs should be started with the initial data set loaded to the data
106
87
107
88
***Inputs:**
108
89
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
109
-
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
110
-
* The goal of the benchmark is to achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
90
+
* The update streams are the `updateStream_*_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
91
+
* To get *2n* write threads, the framework requires *n*`updateStream_*_forum.csv` and *n*`updateStream_*_person.csv` files.
92
+
* If you are generating the data sets from scratch, set `ldbc.snb.datagen.serializer.numUpdatePartitions` to *n* in the [data generator](https://github.com/ldbc/ldbc_snb_datagen_hadoop) to get produce these.
93
+
* The goal of the benchmark is the achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
111
94
* Set the `thread_count` property to the size of the thread pool for read operations.
112
95
* For audited benchmarks, ensure that the `warmup` and `operation_count` properties are set so that the warmup and benchmark phases last for 30+ minutes and 2+ hours, respectively.
113
96
***Output:**
@@ -116,24 +99,71 @@ All of these runs should be started with the initial data set loaded to the data
116
99
* The detailed results of the benchmark are printed to the console and saved in the `results/` directory.
117
100
***Parallelism:** Multi-threaded execution is recommended to achieve the best result.
118
101
102
+
For more details on validating and benchmarking, visit the [driver's documentation](https://github.com/ldbc/ldbc_snb_interactive_v1_driver/tree/main/docs).
103
+
119
104
## Developer's guide
120
105
121
106
To create a new implementation, it is recommended to use one of the existing ones: the Neo4j implementation for graph database management systems and the PostgreSQL implementation for RDBMSs.
122
107
123
108
The implementation process looks roughly as follows:
124
109
125
110
1. Create a bulk loader which loads the initial data set to the database.
126
-
1. Add the required glue code to the Java driver that allows parameterized execution of queries and operators.
127
-
1. Implement the complex and short reads queries (21 in total).
128
-
1. Implement the insert and delete operations (16 in total).
129
-
1. Test the implementation against the reference implementations using various scale factors.
130
-
1. Optimize the implementation.
111
+
2. Implement the complex and short reads queries (22 in total).
112
+
3. Implement the 7 update queries.
113
+
4. Test the implementation against the reference implementations using various scale factors.
114
+
5. Optimize the implementation.
115
+
116
+
## Data sets
117
+
118
+
### Benchmark data sets
119
+
120
+
To generate the benchmark data sets, use the [Hadoop-based LDBC SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v1.0.0).
121
+
122
+
The key configurations are the following:
123
+
124
+
*`ldbc.snb.datagen.generator.scaleFactor`: set this to `snb.interactive.${SCALE_FACTOR}` where `${SCALE_FACTOR}` is the desired scale factor
125
+
*`ldbc.snb.datagen.serializer.numUpdatePartitions`: set this to the number of write threads used in the benchmark runs
126
+
* serializers: set these to the required format, e.g. the ones starting with `CsvMergeForeign` or `CsvComposite`
Producing large-scale data sets requires non-trivial amounts of memory and computing resources (e.g. SF100 requires 24GB memory and takes about 4 hours to generate on a single machine).
134
+
To mitigate this, we have pregenerated data sets using 9 different serializers and the update streams using 17 different partition numbers:
The data sets are available at the [SURF/CWI data repository](https://repository.surfsara.nl/datasets/cwi/ldbc-snb-interactive-v1-datagen-v100). We also provide [direct links](https://ldbcouncil.org/data-sets-surf-repository/snb-interactive-v1-datagen-v100) and a [download script](https://ldbcouncil.org/data-sets-surf-repository/#usage) (which stages the data sets from tape storage if they are not immediately available).
140
+
141
+
We pre-generated [**validation parameters for SF0.1 to SF10**](https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/interactive-v1/validation_params-interactive-v1.0.0-sf0.1-to-sf10.tar.zst) using the Neo4j reference implementation.
142
+
143
+
### Test data set
144
+
145
+
The test data sets are placed in the `cypher/test-data/` directory for Neo4j and in the `postgres/test-data/` for the SQL systems.
146
+
147
+
To generate a data set with the same characteristics, see the [documentation on generating the test data set](test-data).
131
148
132
149
## Preparing for an audited run
133
150
134
151
Implementations of the Interactive workload can be audited by a certified LDBC auditor.
135
-
The [Auditing Policies chapter of the specification](https://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf) describes the auditing process and the required artifacts.
152
+
The [Auditing Policies chapter](http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
153
+
If you are considering commissioning an LDBC SNB audit, please study the [auditing process document](https://ldbcouncil.org/docs/ldbc-snb-auditing-process.pdf) and the [audit questionnaire](snb-interactive-audit-questionnaire.md).
154
+
155
+
### Determining the best TCR
156
+
157
+
1. Select a scale factor and configure the `driver/benchmark.properties` file as described in the [Driver modes](#driver-modes) section.
158
+
2. Load the data set with `scripts/load-in-one-step.sh`.
159
+
3. Create a backup with `scripts/backup-database.sh`.
160
+
4. Run the `driver/determine-best-tcr.sh`.
161
+
5. Once the "best TCR" value has been determined, test it with a full workload (at least 0.5h for warmup operation and at least 2h of benchmark time), and make further adjustments if necessary.
162
+
163
+
### Recommendations
136
164
137
-
If you plan to get your system audited, please reach to the [LDBC Steering Committee](https://ldbcouncil.org/organizational-members/).
165
+
We have a few recommendations for creating audited implementations. (These are not requirements – implementations are allowed to deviate from these recommendations.)
138
166
139
-
:warning: Audited runs are currently only possible with the [v1 version](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
167
+
* The implementation should target a popular Linux distribution (e.g. Ubuntu LTS, CentOS, Fedora).
168
+
* Use a containerized setup, where the DBMS is running in a Docker container.
169
+
* Instead of a specific hardware, target a cloud virtual machine instance (e.g. AWS `r5d.12xlarge`). Both bare-metal and regular instances can be used for audited runs.
0 commit comments