You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Datagen is part of the [LDBC project](https://ldbcouncil.org/).
7
+
The LDBC SNB Data Generator (Datagen) produces the datasets for the [LDBC Social Network Benchmark's workloads](https://ldbcouncil.org/benchmarks/snb/). The generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official [LDBC SNB specification document](https://github.com/ldbc/ldbc_snb_docs).
8
8
9
9
:scroll: If you wish to cite the LDBC SNB, please refer to the [documentation repository](https://github.com/ldbc/ldbc_snb_docs#how-to-cite-ldbc-benchmarks).
10
10
11
11
:warning: There are two different versions of the Datagen:
12
12
13
-
* The [Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/) generates the Interactive SF1-1000 data sets
13
+
* The [Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/) generates the Interactive workload's SF1-1000 data sets.
14
14
* For the BI workload, use the Spark-based Datagen (in this repository).
15
15
* For the Interactive workloads's larger data sets, there is no out-of-the-box solution (see [this issue](https://github.com/ldbc/ldbc_snb_interactive/issues/173)).
16
16
17
-
The LDBC SNB Data Generator (Datagen) is responsible for providing the datasets used by all the LDBC benchmarks. This data generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official [LDBC SNB specification document](https://github.com/ldbc/ldbc_snb_docs).
18
-
19
17
[Generated small data sets](https://ldbcouncil.org/ldbc_snb_datagen_spark/) are deployed by the CI.
20
18
21
19
## Quick start
@@ -27,7 +25,7 @@ You can build the JAR with both Maven and SBT.
27
25
* To assemble the JAR file with Maven, run:
28
26
29
27
```bash
30
-
tools/build.sh
28
+
./tools/build.sh
31
29
```
32
30
33
31
* For faster builds during development, consider using SBT. To assemble the JAR file with SBT, run:
@@ -45,48 +43,52 @@ and install the dependencies.
45
43
46
44
E.g. with [pyenv](https://github.com/pyenv/pyenv) and [pyenv-virtualenv](https://github.com/pyenv/pyenv-virtualenv):
47
45
```bash
48
-
pyenv install 3.7.7
49
-
pyenv virtualenv 3.7.7 ldbc_datagen_tools
46
+
pyenv install 3.7.13
47
+
pyenv virtualenv 3.7.13 ldbc_datagen_tools
50
48
pyenv local ldbc_datagen_tools
51
49
pip install -U pip
52
50
pip install ./tools
53
51
```
54
52
### Running locally
55
53
56
-
The `tools/run.py` is intended for**local runs**. To use it, download and extract Spark as follows.
54
+
The `./tools/run.py` script is intended for**local runs**. To use it, download and extract Spark as follows.
57
55
58
56
#### Spark 3.2.x
59
57
60
58
Spark 3.2.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.2.x.
61
59
60
+
To place Spark under `/opt/`:
61
+
62
62
```bash
63
63
curl https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format parquet --scale-factor 0.003 --mode bi
137
+
```
138
+
131
139
* For the `interactive` and `bi` formats, the `--format-options` argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the [Spark formatting options](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.sql.DataFrameWriter) fordetails), and whether quoting the fieldsin the CSV required:
<!-- SNB Datagen images are available via [Docker Hub](https://hub.docker.com/r/ldbc/datagen/) (currently outdated). -->
149
163
150
164
The Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:
151
165
152
166
```bash
153
-
tools/docker-build.sh
167
+
./tools/docker-build.sh
154
168
```
155
169
156
-
See [Build the JAR](#build-the-jar) to build the library. Then, run the following:
170
+
See [Build the JAR](#build-the-jar) to build the library (e.g. by invoking `./tools/build.sh`). Then, run the following:
157
171
158
172
```bash
159
-
tools/docker-run.sh
173
+
./tools/docker-run.sh
160
174
```
161
175
162
176
### Elastic MapReduce
163
177
164
-
We provide scripts to run Datagen on AWS EMR. See the README in the [`tools/emr`](tools/emr) directory for details.
165
-
166
-
## Larger scale factors
167
-
168
-
The scale factors SF3k+ are currently being fine-tuned, both regarding optimizing the generator and also for tuning the distributions.
178
+
We provide scripts to run Datagen on AWS EMR. See the README in the [`./tools/emr`](tools/emr) directory for details.
169
179
170
180
## Graph schema
171
181
@@ -177,4 +187,4 @@ The graph schema is as follows:
177
187
178
188
* When running the tests, they might throw a `java.net.UnknownHostException: your_hostname: your_hostname: Name or service not known` coming from `org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal`. The solution is to add an entry of your machine's hostname to the `/etc/hosts` file: `127.0.1.1 your_hostname`.
179
189
* If you are using Docker and Spark runs out of space, make sure that Docker has enough space to store its containers. To move the location of the Docker containers to a larger disk, stop Docker, edit (or create) the `/etc/docker/daemon.json` file and add `{ "data-root": "/path/to/new/docker/data/dir" }`, then sync the old folder if needed, and restart Docker. (See [more detailed instructions](https://www.guguweb.com/2019/02/07/how-to-move-docker-data-directory-to-another-location-on-ubuntu/)).
180
-
* If you are using a local Spark installation and run out of space in `/tmp`, set the `SPARK_LOCAL_DIRS` to point to a directory with enough free space.
190
+
* If you are using a local Spark installation and run out of space in `/tmp` (`java.io.IOException: No space left on device`), set the `SPARK_LOCAL_DIRS` to point to a directory with enough free space.
This README is deployed to <http://ldbcouncil.org/ldbc_snb_datagen_spark>.
3
+
This README is deployed to <https://ldbcouncil.org/ldbc_snb_datagen_spark>.
4
4
5
5
## Generated data sets
6
6
7
-
The following data sets are generated for the `dev` variant, to be used for the BI workload.
7
+
The following data sets were generated for the **LDBC Social Network Benchmark's BI (Business Intelligence) workload** by the latest commit at <https://github.com/ldbc/ldbc_snb_datagen_spark>.
8
8
9
-
If you are looking for data sets to implement the Interactive workload, please use the [Hadoop-based legacy Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop) or reach out to us.
9
+
If you are looking for data sets of the **SNB Interactive workload**, please use the [legacy Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop) or download them from the [SURF/CWI data repository](https://hdl.handle.net/11112/e6e00558-a2c3-9214-473e-04a16de09bf8).
0 commit comments