Skip to content

Commit 8c3a67a

Browse files
committed
Update README
1 parent e7751cb commit 8c3a67a

File tree

1 file changed

+58
-89
lines changed

1 file changed

+58
-89
lines changed

README.md

Lines changed: 58 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,98 +1,113 @@
11
![LDBC logo](ldbc-logo.png)
2-
# LDBC SNB Interactive workload implementations
2+
# LDBC SNB Interactive v2 workload implementations
33

44
[![Build Status](https://circleci.com/gh/ldbc/ldbc_snb_interactive_impls.svg?style=svg)](https://circleci.com/gh/ldbc/ldbc_snb_interactive_impls)
55

6-
Reference implementations of the LDBC Social Network Benchmark's Interactive workload ([paper](https://homepages.cwi.nl/~boncz/snb-challenge/snb-sigmod.pdf), [specification on GitHub pages](https://ldbcouncil.org/ldbc_snb_docs/), [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf)).
6+
This repository contains reference implementations of the LDBC Social Network Benchmark's Interactive v2 workload. See details on the benchmark, see the [SIGMOD 2015 paper](https://ldbcouncil.org/docs/papers/ldbc-snb-interactive-sigmod-2015.pdf), [specification on GitHub Pages](https://ldbcouncil.org/ldbc_snb_docs/), and [specification on arXiv](https://arxiv.org/pdf/2001.02299.pdf).
77

8-
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2021-12.pdf)).
8+
To get started with the LDBC SNB benchmarks, check out our introductory presentation: [The LDBC Social Network Benchmark](https://docs.google.com/presentation/d/1NilxSrKQnFq4WzWMY2-OodZQ2TEksKzKBmgB20C_0Nw/) ([PDF](https://ldbcouncil.org/docs/presentations/ldbc-snb-2022-11.pdf)).
9+
10+
:warning:
11+
This workload is still under design. If you are looking for a stable, auditable version, use the [Interactive v1 workload](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).
912

1013
## Notes
1114

1215
:warning: Please keep in mind the following when using this repository.
1316

1417
* The goal of the implementations in this repository is to serve as **reference implementations** which other implementations can cross-validated against. Therefore, our primary objective was readability and not absolute performance when formulating the queries.
1518

16-
* The default workload contains updates which are persisted in the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
17-
18-
* We expect most systems-under-test to use multi-threaded execution for their benchmark runs. **To allow running the updates on multiple threads, the update stream files need to be partitioned accordingly by the generator.** We have [pre-generated](#benchmark-data-sets) these for frequent partition numbers (1, 2, ..., 1024 and 24, 48, 96, ..., 768) and scale factors up to 1000.
19+
* The default workload contains updates which change the state of the database. Therefore, **the database needs to be reloaded or restored from backup before each run**. Use the provided `scripts/backup-database.sh` and `scripts/restore-database.sh` scripts to achieve this.
1920

2021
## Implementations
2122

22-
We provide three reference implementations:
23+
We provide two reference implementations:
2324

2425
* [Neo4j (Cypher) implementation](cypher/README.md)
2526
* [PostgreSQL (SQL) implementation](postgres/README.md)
26-
* [GraphDB (SPARQL) implementation](graphdb/README.md)
2727

2828
Additional implementations:
2929

30-
* [DuckDB (SQL) implementation](duckdb/README.md)
31-
* [TigerGraph (GSQL) implementation](tigergraph/README.md)
30+
* [Microsoft SQL Server (Transact-SQL) implementation](mssql/README.md)
3231
* [Umbra (SQL) implementation](umbra/README.md)
3332

3433
For detailed instructions, consult the READMEs of the projects.
3534

36-
To build a subset of the projects, use Maven profiles, e.g. to build the reference implementations, run:
37-
38-
```bash
39-
mvn clean package -DskipTests -Pcypher,postgres
40-
```
41-
4235
## User's guide
4336

4437
### Building the project
45-
This project uses Java 11.
38+
This project uses Java 17.
4639

47-
To build the project, run:
40+
To build the entire project, run:
4841

4942
```bash
5043
scripts/build.sh
5144
```
5245

46+
To build a subset of the projects, e.g. to build the PostgreSQL implementation, run its individual build script:
47+
48+
```bash
49+
postgres/scripts/build.sh
50+
```
51+
5352
### Inputs
5453

55-
The benchmark framework relies on the following inputs produced by the [SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/):
54+
The benchmark framework relies on the following inputs produced by the [SNB Datagen's new (Spark) version](https://github.com/ldbc/ldbc_snb_datagen_spark/).
55+
56+
Currently, the initial data set, update streams, and parameters can generated with the following command:
57+
58+
```bash
59+
export SF= #The scale factor to generate
60+
export LDBC_SNB_DATAGEN_DIR= # Path to the LDBC SNB datagen directory
61+
export LDBC_SNB_DATAGEN_MAX_MEM= #Maximum memory the datagen could use, e.g. 16G
62+
export LDBC_SNB_DRIVER_DIR= # Path to the LDBC SNB driver directory
63+
export DATA_INPUT_TYPE=parquet
64+
# If using the Docker Datagen version, set the env variable:
65+
export USE_DATAGEN_DOCKER=true
66+
67+
scripts/generate-all.sh
68+
```
5669

57-
* **Initial data set:** the SNB graph in CSV format (`social_network/{static,dynamic}`)
58-
* **Update streams:** the input for the update operations (`social_network/updateStream_*.csv`)
59-
* **Substitution parameters:** the input parameters for the complex queries. It is produced by the Datagen (`substitution_parameters/`)
70+
### Pre-generate data sets
71+
72+
[Pre-generated SF1-SF300 data sets](snb-interactive-pre-generated-data-sets.md) are available.
73+
74+
### Loading the data
75+
76+
Select the system to be tested, e.g. [PostgreSQL](postgres/).
77+
Load the data set as described in the README file of the selected system.
78+
For most systems, this involves setting an environment variable to the correct location and invoking the `scripts/load-in-one-step.sh` script.
6079

6180
### Driver modes
6281

63-
For each implementation, it is possible to perform to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes: create validation parameters, validate, and benchmark.
64-
The execution in all three modes should be started after the initial data set was loaded into the system under test.
82+
For each implementation, it is possible to perform the run in one of the [SNB driver's](https://github.com/ldbc/ldbc_snb_interactive_driver) three modes.
83+
All of these runs should be started with the initial data set loaded to the database.
6584

6685
1. Create validation parameters with the `driver/create-validation-parameters.sh` script.
6786

6887
* **Inputs:**
6988
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
70-
* The update streams are the `updateStream_0_0_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
71-
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage.
72-
* **Output:** The results will be stored in the validation parameters file (e.g. `validation_params.csv`) file set in the `create_validation_parameters` configuration property.
89+
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
90+
* For this mode, the query frequencies are set to a uniform `1` value to ensure the best average test coverage. [TODO]
91+
* **Output:** The results will be stored in the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
7392
* **Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
7493

75-
2. Validate against an existing reference output (called "validation parameters") with the `driver/validate.sh` script.
94+
2. Validate against existing validation parameters with the `driver/validate.sh` script.
7695

7796
* **Input:**
78-
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.csv`) file set in the `validate_database` configuration property.
97+
* The query substitution parameters are taken from the validation parameters file (e.g. `validation_params.json`) file set in the `validate_database` configuration property.
7998
* The update operations are also based on the content of the validation parameters file.
8099
* **Output:**
81100
* The validation either passes of fails.
82101
* The per query results of the validation are printed to the console.
83102
* If the validation failed, the results are saved to the `validation_params-failed-expected.json` and `validation_params-failed-actual.json` files.
84103
* **Parallelism:** The execution must be single-threaded to ensure a deterministic order of operations.
85104

86-
For the data sets generated with [Datagen v0.3.6](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v0.3.6), pre-generated [validation parameters for SF0.1 to SF10](https://pub-383410a98aef4cb686f0c7601eddd25f.r2.dev/interactive-v1/validation_params-sf0.1-sf10.tar.zst) are available.
87-
88105
3. Run the benchmark with the `driver/benchmark.sh` script.
89106

90107
* **Inputs:**
91108
* The query substitution parameters are taken from the directory set in `ldbc.snb.interactive.parameters_dir` configuration property.
92-
* The update streams are the `updateStream_*_{forum,person}.csv` files from the location set in the `ldbc.snb.interactive.updates_dir` configuration property.
93-
* To get *2n* write threads, the framework requires *n* `updateStream_*_forum.csv` and *n* `updateStream_*_person.csv` files.
94-
* If you are generating the data sets from scratch, set `ldbc.snb.datagen.serializer.numUpdatePartitions` to *n* in the [data generator](https://github.com/ldbc/ldbc_snb_datagen_hadoop) to get produce these.
95-
* The goal of the benchmark is the achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
109+
* The update streams are the files from the `inserts` and `deletes` directories in the directory `ldbc.snb.interactive.updates_dir` configuration property.
110+
* The goal of the benchmark is to achieve the best (lowest possible) `time_compression_ratio` value while ensuring that the 95% on-time requirement is kept (i.e. 95% of the queries can be started within 1 second of their scheduled time). If your benchmark run returns "failed schedule audit", increase this number (which lowers the time compression rate) until it passes.
96111
* Set the `thread_count` property to the size of the thread pool for read operations.
97112
* For audited benchmarks, ensure that the `warmup` and `operation_count` properties are set so that the warmup and benchmark phases last for 30+ minutes and 2+ hours, respectively.
98113
* **Output:**
@@ -101,70 +116,24 @@ The execution in all three modes should be started after the initial data set wa
101116
* The detailed results of the benchmark are printed to the console and saved in the `results/` directory.
102117
* **Parallelism:** Multi-threaded execution is recommended to achieve the best result.
103118

104-
For more details on validating and benchmarking, visit the [driver's documentation](https://github.com/ldbc/ldbc_snb_interactive_driver/tree/v1-dev/docs).
105-
106119
## Developer's guide
107120

108121
To create a new implementation, it is recommended to use one of the existing ones: the Neo4j implementation for graph database management systems and the PostgreSQL implementation for RDBMSs.
109122

110123
The implementation process looks roughly as follows:
111124

112125
1. Create a bulk loader which loads the initial data set to the database.
113-
2. Implement the complex and short reads queries (22 in total).
114-
3. Implement the 7 update queries.
115-
4. Test the implementation against the reference implementations using various scale factors.
116-
5. Optimize the implementation.
117-
118-
## Data sets
119-
120-
### Benchmark data sets
121-
122-
To generate the benchmark data sets, use the [Hadoop-based LDBC SNB Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/releases/tag/v0.3.7-SNAPSHOT).
123-
124-
The key configurations are the following:
125-
126-
* `ldbc.snb.datagen.generator.scaleFactor`: set this to `snb.interactive.${SCALE_FACTOR}` where `${SCALE_FACTOR}` is the desired scale factor
127-
* `ldbc.snb.datagen.serializer.numUpdatePartitions`: set this to the number of write threads used in the benchmark runs
128-
* serializers: set these to the required format, e.g. the ones starting with `CsvMergeForeign` or `CsvComposite`
129-
* `ldbc.snb.datagen.serializer.dynamicActivitySerializer`
130-
* `ldbc.snb.datagen.serializer.dynamicPersonSerializer`
131-
* `ldbc.snb.datagen.serializer.staticSerializer`
132-
133-
### Pre-generated data sets
134-
135-
Producing large-scale data sets requires non-trivial amounts of memory and computing resources (e.g. SF100 requires 24GB memory and takes about 4 hours to generate on a single machine).
136-
To mitigate this, we have pregenerated data sets using 9 different serializers and the update streams using 17 different partition numbers:
137-
138-
* Serializers: csv_basic, csv_basic-longdateformatter, csv_composite, csv_composite-longdateformatter, csv_composite_merge_foreign, csv_composite_merge_foreign-longdateformatter, csv_merge_foreign, csv_merge_foreign-longdateformatter, ttl
139-
* Partition numbers: 2^k (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024) and 6×2^k (24, 48, 96, 192, 384, 768).
140-
141-
The data sets are available at the [SURF/CWI data repository](https://hdl.handle.net/11112/e6e00558-a2c3-9214-473e-04a16de09bf8). We also provide [direct links and download scripts](https://github.com/ldbc/data-sets-surf-repository).
142-
143-
### Test data set
144-
145-
The test data sets are placed in the `cypher/test-data/` directory for Neo4j and in the `postgres/test-data/` for the SQL systems.
146-
147-
To generate a data set with the same characteristics, see the [documentation on generating the test data set](test-data.md).
126+
1. Add the required glue code to the Java driver that allows parameterized execution of queries and operators.
127+
1. Implement the complex and short reads queries (21 in total).
128+
1. Implement the insert and delete operations (16 in total).
129+
1. Test the implementation against the reference implementations using various scale factors.
130+
1. Optimize the implementation.
148131

149132
## Preparing for an audited run
150133

151134
Implementations of the Interactive workload can be audited by a certified LDBC auditor.
152-
The [Auditing Policies chapter](http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
153-
If you are considering commissioning an LDBC SNB audit, please study the [audit questionnaire](snb-interactive-audit-questionnaire.md).
154-
155-
156-
### Determining the best TCR
157-
158-
1. Select a scale factor and configure the `driver/benchmark.properties` file as described in the [Driver modes](#driver-modes) section.
159-
2. Load the data set with `scripts/load-in-one-step.sh`.
160-
3. Create a backup with `scripts/backup-database.sh`.
161-
4. Run the `driver/determine-best-tcr.sh`.
162-
5. Once the "best TCR" value has been determined, test it with a full workload (at least 0.5h for warmup operation and at least 2h of benchmark time), and make further adjustments if necessary.
163-
164-
### Recommendations
135+
The [Auditing Policies chapter](https://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf#chapter.7) of the specification describes the auditing process and the required artifacts.
165136

166-
We have a few recommendations for creating audited implementations. (These are not requirements – implementations are allowed to deviate from these recommendations.)
137+
If you plan to get your system audited, please reach to the [LDBC Steering Committee](https://ldbcouncil.org/organizational-members/).
167138

168-
* The implementation should target a popular Linux distribution (e.g. Ubuntu LTS, CentOS, Fedora).
169-
* Use a containerized setup, where the DBMS is running in a Docker container.
170-
* Instead of a specific hardware, target a cloud virtual machine instance (e.g. AWS `r5d.12xlarge`). Both bare-metal and regular instances can be used for audited runs.
139+
:warning: Audited runs are currently only possible with the [v1 version](https://github.com/ldbc/ldbc_snb_interactive_v1_impls).

0 commit comments

Comments
 (0)