Skip to content

Commit 54522c9

Browse files
committed
Update README with new data sets, fix version
1 parent 2844a0f commit 54522c9

File tree

1 file changed

+88
-58
lines changed

1 file changed

+88
-58
lines changed

README.md

Lines changed: 88 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,81 @@
11
![LDBC logo](ldbc-logo.png)
2-
# LDBC SNB Interactive v2 workload implementations
2+
# LDBC SNB Interactive v1 workload implementations
33

44
[![Build Status](https://circleci.com/gh/ldbc/ldbc_snb_interactive_v1_impls.svg?style=svg)](https://circleci.com/gh/ldbc/ldbc_snb_interactive_v1_impls)
55

6-
This repository contains reference implementations of the LDBC Social Network Benchmark's Interactive v2 workload. See details on the benchmark, see the [SIGMOD 2015 paper](https://ldbcouncil.org/docs/papers/ldbc-snb-interactive-sigmod-2015.pdf), [specification on GitHub Pages](https://ldbcouncil.org/ldbc_snb_docs/), and [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf).
6+
Reference implementations of the LDBC Social Network Benchmark's Interactive workload ([paper](https://homepages.cwi.nl/~boncz/snb-challenge/snb-sigmod.pdf), [specification on GitHub pages](https://ldbcouncil.org/ldbc_snb_docs/), [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf)).
77

8-
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2022-11.pdf)).
9-
10-
:warning:
11-
This workload is still under design. If you are looking for a stable, auditable version, use the [Interactive v1 workload](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
8+
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2021-12.pdf)).
129

1310
## Notes
1411

1512
:warning: Please keep in mind the following when using this repository.
1613

1714
* The goal of the implementations in this repository is to serve as **reference implementations** which other implementations can cross-validated against. Therefore, our primary objective was readability and not absolute performance when formulating the queries.
1815

19-
* The default workload contains updates which change the state of the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
16+
* The default workload contains updates which are persisted in the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
17+
18+
* We expect most systems-under-test to use multi-threaded execution for their benchmark runs. **To allow running the updates on multiple threads, the update stream files need to be partitioned accordingly by the generator.** We have [pre-generated](#benchmark-data-sets) these for frequent partition numbers (1, 2, ..., 1024 and 24, 48, 96, ..., 768) and scale factors up to 1000.
2019

2120
## Implementations
2221

23-
We provide two reference implementations:
22+
We provide three reference implementations:
2423

2524
* [Neo4j (Cypher) implementation](cypher/README.md)
2625
* [PostgreSQL (SQL) implementation](postgres/README.md)
26+
* [GraphDB (SPARQL) implementation](graphdb/README.md)
2727

2828
Additional implementations:
2929

30-
* [Microsoft SQL Server (Transact-SQL) implementation](mssql/README.md)
30+
* [DuckDB (SQL) implementation](duckdb/README.md)
31+
* [TigerGraph (GSQL) implementation](tigergraph/README.md)
3132
* [Umbra (SQL) implementation](umbra/README.md)
3233

3334
For detailed instructions, consult the READMEs of the projects.
3435

35-
## User's guide
36-
37-
### Building the project
38-
This project uses Java 17.
39-
40-
To build the entire project, run:
36+
To build a subset of the projects, use Maven profiles, e.g. to build the reference implementations, run:
4137

4238
```bash
43-
scripts/build.sh
39+
mvn clean package -DskipTests -Pcypher,postgres
4440
```
4541

46-
To build a subset of the projects, e.g. to build the PostgreSQL implementation, run its individual build script:
47-
48-
```bash
49-
postgres/scripts/build.sh
50-
```
51-
52-
### Inputs
42+
## User's guide
5343

54-
The benchmark framework relies on the following inputs produced by the [SNB Datagen's new (Spark) version](https://github.com/ldbc/ldbc_snb_datagen_spark/).
44+
### Building the project
45+
This project uses Java 11.
5546

56-
Currently, the initial data set, update streams, and parameters can generated with the following command:
47+
To build the project, run:
5748

5849
```bash
59-
export SF= #The scale factor to generate
60-
export LDBC_SNB_DATAGEN_DIR= # Path to the LDBC SNB datagen directory
61-
export LDBC_SNB_DATAGEN_MAX_MEM= #Maximum memory the datagen could use, e.g. 16G
62-
export LDBC_SNB_DRIVER_DIR= # Path to the LDBC SNB driver directory
63-
export DATA_INPUT_TYPE=parquet
64-
# If using the Docker Datagen version, set the env variable:
65-
export USE_DATAGEN_DOCKER=true
66-
67-
scripts/generate-all.sh
50+
scripts/build.sh
6851
```
6952

70-
### Pre-generate data sets
71-
72-
[Pre-generated SF1-SF300 data sets](snb-interactive-pre-generated-data-sets.md) are available.
53+
### Inputs
7354

74-
### Loading the data
55+
The benchmark framework relies on the following inputs produced by the [SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/):
7556

76-
Select the system to be tested, e.g. [PostgreSQL](postgres/).
77-
Load the data set as described in the README file of the selected system.
78-
For most systems, this involves setting an environment variable to the correct location and invoking the `scripts/load-in-one-step.sh` script.
57+
* **Initial data set:** the SNB graph in CSV format (`social_network/{static,dynamic}`)
58+
* **Update streams:** the input for the update operations (`social_network/updateStream_*.csv`)
59+
* **Substitution parameters:** the input parameters for the complex queries. It is produced by the Datagen (`substitution_parameters/`)
7960

8061
### Driver modes
8162

82-
For each implementation, it is possible to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes.
83-
All of these runs should be started with the initial data set loaded to the database.
63+
For each implementation, it is possible to perform to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_v1_driver) three modes: create validation parameters, validate, and benchmark.
64+
The execution in all three modes should be started after the initial data set was loaded into the system under test.
8465

8566
1. Create validation parameters with the `driver/create-validation-parameters.sh` script.
8667

8768
* **Inputs:**
8869
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
89-
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
90-
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage. [TODO]
91-
* **Output:** The results will be stored in the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
70+
* The update streams are the `updateStream_0_0_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
71+
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.
72+
* **Output:** The results will be stored in the validation parameters file (e.g. `validation_params.csv`) file set in the `create_validation_parameters` configuration property.
9273
* **Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
9374

94-
2. Validate against existing validation parameters with the `driver/validate.sh` script.
75+
2. Validate against an existing reference output (called "validation parameters") with the `driver/validate.sh` script.
9576

9677
* **Input:**
97-
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
78+
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.csv`) file set in the `validate_database` configuration property.
9879
* The update operations are also based on the content of the validation parameters file.
9980
* **Output:**
10081
* The validation either passes of fails.
@@ -106,8 +87,10 @@ All of these runs should be started with the initial data set loaded to the data
10687

10788
* **Inputs:**
10889
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
109-
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
110-
* The goal of the benchmark is to achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
90+
* The update streams are the `updateStream_*_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
91+
* To get *2n* write threads, the framework requires *n* `updateStream_*_forum.csv` and *n* `updateStream_*_person.csv` files.
92+
* If you are generating the data sets from scratch, set `ldbc.snb.datagen.serializer.numUpdatePartitions` to *n* in the [data generator](https://github.com/ldbc/ldbc_snb_datagen_hadoop) to get produce these.
93+
* The goal of the benchmark is the achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
11194
* Set the `thread_count` property to the size of the thread pool for read operations.
11295
* For audited benchmarks, ensure that the `warmup` and `operation_count` properties are set so that the warmup and benchmark phases last for 30+ minutes and 2+ hours, respectively.
11396
* **Output:**
@@ -116,24 +99,71 @@ All of these runs should be started with the initial data set loaded to the data
11699
* The detailed results of the benchmark are printed to the console and saved in the `results/` directory.
117100
* **Parallelism:** Multi-threaded execution is recommended to achieve the best result.
118101

102+
For more details on validating and benchmarking, visit the [driver's documentation](https://github.com/ldbc/ldbc_snb_interactive_v1_driver/tree/main/docs).
103+
119104
## Developer's guide
120105

121106
To create a new implementation, it is recommended to use one of the existing ones: the Neo4j implementation for graph database management systems and the PostgreSQL implementation for RDBMSs.
122107

123108
The implementation process looks roughly as follows:
124109

125110
1. Create a bulk loader which loads the initial data set to the database.
126-
1. Add the required glue code to the Java driver that allows parameterized execution of queries and operators.
127-
1. Implement the complex and short reads queries (21 in total).
128-
1. Implement the insert and delete operations (16 in total).
129-
1. Test the implementation against the reference implementations using various scale factors.
130-
1. Optimize the implementation.
111+
2. Implement the complex and short reads queries (22 in total).
112+
3. Implement the 7 update queries.
113+
4. Test the implementation against the reference implementations using various scale factors.
114+
5. Optimize the implementation.
115+
116+
## Data sets
117+
118+
### Benchmark data sets
119+
120+
To generate the benchmark data sets, use the [Hadoop-based LDBC SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v1.0.0).
121+
122+
The key configurations are the following:
123+
124+
* `ldbc.snb.datagen.generator.scaleFactor`: set this to `snb.interactive.${SCALE_FACTOR}` where `${SCALE_FACTOR}` is the desired scale factor
125+
* `ldbc.snb.datagen.serializer.numUpdatePartitions`: set this to the number of write threads used in the benchmark runs
126+
* serializers: set these to the required format, e.g. the ones starting with `CsvMergeForeign` or `CsvComposite`
127+
* `ldbc.snb.datagen.serializer.dynamicActivitySerializer`
128+
* `ldbc.snb.datagen.serializer.dynamicPersonSerializer`
129+
* `ldbc.snb.datagen.serializer.staticSerializer`
130+
131+
### Pre-generated data sets
132+
133+
Producing large-scale data sets requires non-trivial amounts of memory and computing resources (e.g. SF100 requires 24GB memory and takes about 4 hours to generate on a single machine).
134+
To mitigate this, we have pregenerated data sets using 9 different serializers and the update streams using 17 different partition numbers:
135+
136+
* Serializers: csv_basic, csv_basic-longdateformatter, csv_composite, csv_composite-longdateformatter, csv_composite_merge_foreign, csv_composite_merge_foreign-longdateformatter, csv_merge_foreign, csv_merge_foreign-longdateformatter, ttl
137+
* Partition numbers: 2^k (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024) and 6×2^k (24, 48, 96, 192, 384, 768).
138+
139+
The data sets are available at the [SURF/CWI data repository](https://repository.surfsara.nl/datasets/cwi/ldbc-snb-interactive-v1-datagen-v100). We also provide [direct links](https://ldbcouncil.org/data-sets-surf-repository/snb-interactive-v1-datagen-v100) and a [download script](https://ldbcouncil.org/data-sets-surf-repository/#usage) (which stages the data sets from tape storage if they are not immediately available).
140+
141+
We pre-generated [**validation parameters for SF0.1 to SF10**](https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/interactive-v1/validation_params-interactive-v1.0.0-sf0.1-to-sf10.tar.zst) using the Neo4j reference implementation.
142+
143+
### Test data set
144+
145+
The test data sets are placed in the `cypher/test-data/` directory for Neo4j and in the `postgres/test-data/` for the SQL systems.
146+
147+
To generate a data set with the same characteristics, see the [documentation on generating the test data set](test-data).
131148

132149
## Preparing for an audited run
133150

134151
Implementations of the Interactive workload can be audited by a certified LDBC auditor.
135-
The [Auditing Policies chapter of the specification](https://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf) describes the auditing process and the required artifacts.
152+
The [Auditing Policies chapter](http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
153+
If you are considering commissioning an LDBC SNB audit, please study the [auditing process document](https://ldbcouncil.org/docs/ldbc-snb-auditing-process.pdf) and the [audit questionnaire](snb-interactive-audit-questionnaire.md).
154+
155+
### Determining the best TCR
156+
157+
1. Select a scale factor and configure the `driver/benchmark.properties` file as described in the [Driver modes](#driver-modes) section.
158+
2. Load the data set with `scripts/load-in-one-step.sh`.
159+
3. Create a backup with `scripts/backup-database.sh`.
160+
4. Run the `driver/determine-best-tcr.sh`.
161+
5. Once the "best TCR" value has been determined, test it with a full workload (at least 0.5h for warmup operation and at least 2h of benchmark time), and make further adjustments if necessary.
162+
163+
### Recommendations
136164

137-
If you plan to get your system audited, please reach to the [LDBC Steering Committee](https://ldbcouncil.org/organizational-members/).
165+
We have a few recommendations for creating audited implementations. (These are not requirements – implementations are allowed to deviate from these recommendations.)
138166

139-
:warning: Audited runs are currently only possible with the [v1 version](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
167+
* The implementation should target a popular Linux distribution (e.g. Ubuntu LTS, CentOS, Fedora).
168+
* Use a containerized setup, where the DBMS is running in a Docker container.
169+
* Instead of a specific hardware, target a cloud virtual machine instance (e.g. AWS `r5d.12xlarge`). Both bare-metal and regular instances can be used for audited runs.

0 commit comments

Comments
 (0)