Skip to content

Commit 7a2fe3e

Browse files
authored
Merge branch 'main' into spark3.2
2 parents 4fcad8f + 95be89f commit 7a2fe3e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+1591
-714
lines changed

.circleci/config.yml

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,20 @@ jobs:
4242
sudo apt install -y openjdk-8-jdk zip
4343
sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/java-8-openjdk-amd64/bin/java 1
4444
sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/bin/java
45-
java -version
46-
- run: docker build . -t ldbc/spark
45+
- run:
46+
name: Build Docker container
47+
command: |
48+
docker build . -t ldbc/spark
4749
- restore_cache:
4850
keys:
4951
- m2-dep-branch:{{ .Branch }}-pom:{{ checksum "pom.xml" }}-
5052
- m2-dep-branch:dev-pom:{{ checksum "pom.xml" }}-
5153
- m2-dep-branch:{{ .Branch }}-
5254
- m2-dep-branch:dev-
53-
- run: mvn -ntp clean test-compile assembly:assembly
55+
- run:
56+
name: Build JAR file
57+
command: |
58+
mvn -ntp clean test-compile assembly:assembly
5459
- save_cache:
5560
key: m2-dep-branch:{{ .Branch }}-pom:{{ checksum "pom.xml" }}-{{ epoch }}
5661
paths:
@@ -74,6 +79,7 @@ jobs:
7479
- run:
7580
name: Generate SF0.003 / BI / composite-merged CSVs
7681
command: |
82+
# we generate factors here but they are moved to a separate archive (social-network-sf0.003-bi-factors.zip)
7783
tools/docker-run.sh --mode bi --scale-factor 0.003 --generate-factors
7884
mv out/ social-network-sf0.003-bi-composite-merged-fk/
7985
- run:
@@ -118,7 +124,8 @@ jobs:
118124
# include the CircleCI configuration in the deployed package to provide the 'filters' instructions (and prevent failed builds on the gh-pages branch)
119125
mv .circleci dist/
120126
# move factors to a separate directory
121-
mv social-network-sf0.003-bi-composite-merged-fk/factors social-network-sf0.003-bi-factors
127+
mkdir social-network-sf0.003-bi-factors
128+
mv social-network-sf0.003-bi-composite-merged-fk/factors social-network-sf0.003-bi-factors/factors
122129
# compress each directory
123130
for d in social-network-sf0.003*; do
124131
echo "Generated with <https://github.com/ldbc/ldbc_snb_datagen_spark/commit/${CIRCLE_SHA1}>" > $d/README.md

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,10 @@ local.properties
4444

4545
/*.crc
4646
/*.csv
47-
out/
47+
/out/
48+
/out-*/
49+
/out.tar.zst
50+
/out-*.tar.zst
4851
datagen_output/
4952

5053
/sf*/

NOTICE.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright [2020-]2021 Linked Data Benchmark Council
1+
Copyright [2020-]2022 Linked Data Benchmark Council
22

33
Licensed under the Apache License, Version 2.0 (the "License");
44
you may not use this file except in compliance with the License.

README.md

Lines changed: 40 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,16 @@
44

55
[![Build Status](https://circleci.com/gh/ldbc/ldbc_snb_datagen_spark.svg?style=svg)](https://circleci.com/gh/ldbc/ldbc_snb_datagen_spark)
66

7-
Datagen is part of the [LDBC project](https://ldbcouncil.org/).
7+
The LDBC SNB Data Generator (Datagen) produces the datasets for the [LDBC Social Network Benchmark's workloads](https://ldbcouncil.org/benchmarks/snb/). The generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official [LDBC SNB specification document](https://github.com/ldbc/ldbc_snb_docs).
88

99
:scroll: If you wish to cite the LDBC SNB, please refer to the [documentation repository](https://github.com/ldbc/ldbc_snb_docs#how-to-cite-ldbc-benchmarks).
1010

1111
:warning: There are two different versions of the Datagen:
1212

13-
* The [Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/) generates the Interactive SF1-1000 data sets
13+
* The [Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop/) generates the Interactive workload's SF1-1000 data sets.
1414
* For the BI workload, use the Spark-based Datagen (in this repository).
1515
* For the Interactive workloads's larger data sets, there is no out-of-the-box solution (see [this issue](https://github.com/ldbc/ldbc_snb_interactive/issues/173)).
1616

17-
The LDBC SNB Data Generator (Datagen) is responsible for providing the datasets used by all the LDBC benchmarks. This data generator is designed to produce directed labelled graphs that mimic the characteristics of those graphs of real data. A detailed description of the schema produced by Datagen, as well as the format of the output files, can be found in the latest version of official [LDBC SNB specification document](https://github.com/ldbc/ldbc_snb_docs).
18-
1917
[Generated small data sets](https://ldbcouncil.org/ldbc_snb_datagen_spark/) are deployed by the CI.
2018

2119
## Quick start
@@ -27,7 +25,7 @@ You can build the JAR with both Maven and SBT.
2725
* To assemble the JAR file with Maven, run:
2826

2927
```bash
30-
tools/build.sh
28+
./tools/build.sh
3129
```
3230

3331
* For faster builds during development, consider using SBT. To assemble the JAR file with SBT, run:
@@ -45,48 +43,52 @@ and install the dependencies.
4543

4644
E.g. with [pyenv](https://github.com/pyenv/pyenv) and [pyenv-virtualenv](https://github.com/pyenv/pyenv-virtualenv):
4745
```bash
48-
pyenv install 3.7.7
49-
pyenv virtualenv 3.7.7 ldbc_datagen_tools
46+
pyenv install 3.7.13
47+
pyenv virtualenv 3.7.13 ldbc_datagen_tools
5048
pyenv local ldbc_datagen_tools
5149
pip install -U pip
5250
pip install ./tools
5351
```
5452
### Running locally
5553

56-
The `tools/run.py` is intended for **local runs**. To use it, download and extract Spark as follows.
54+
The `./tools/run.py` script is intended for **local runs**. To use it, download and extract Spark as follows.
5755

5856
#### Spark 3.2.x
5957

6058
Spark 3.2.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.2.x.
6159

60+
To place Spark under `/opt/`:
61+
6262
```bash
6363
curl https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
6464
export SPARK_HOME="/opt/spark-3.2.0-bin-hadoop3.2"
6565
export PATH="$SPARK_HOME/bin":"$PATH"
6666
```
6767

68-
Both Java 8 and Java 11 work.
69-
70-
To build, run
68+
To place under `~/`:
7169

7270
```bash
73-
tools/build.sh
71+
curl https://downloads.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz | tar -xz -C ~/
72+
export SPARK_HOME=~/spark-3.2.0-bin-hadoop3.2
73+
export PATH="$SPARK_HOME/bin":"$PATH"
7474
```
7575

76-
Run the script with:
76+
Both Java 8 and Java 11 are supported.
77+
78+
Once you have Spark in place and built the JAR file, run the generator as follows:
7779

7880
```bash
7981
export PLATFORM_VERSION=2.12_spark3.2
8082
export DATAGEN_VERSION=0.5.0-SNAPSHOT
81-
tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>
83+
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>
8284
```
8385

8486
#### Runtime configuration arguments
8587

8688
The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:
8789

8890
```bash
89-
tools/run.py --help
91+
./tools/run.py --help
9092
```
9193

9294
To generate a single `part-*.csv` file, reduce the parallelism (number of Spark partitions) to 1.
@@ -104,12 +106,6 @@ To get a complete list of the arguments, pass `--help` to the JAR file:
104106
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --help
105107
```
106108

107-
* Passing `params.ini` files:
108-
109-
```bash
110-
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --param-file params.ini
111-
```
112-
113109
* Generating `CsvBasic` files in **Interactive mode**:
114110

115111
```bash
@@ -122,12 +118,24 @@ To get a complete list of the arguments, pass `--help` to the JAR file:
122118
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip
123119
```
124120

121+
* Generating `CsvCompositeMergeForeign` files in **BI mode** and generating factors:
122+
123+
```bash
124+
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode bi --generate-factors
125+
```
126+
125127
* Generating CSVs in **raw mode**:
126128

127129
```bash
128130
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw
129131
```
130132

133+
* Generating Parquet files:
134+
135+
```bash
136+
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format parquet --scale-factor 0.003 --mode bi
137+
```
138+
131139
* For the `interactive` and `bi` formats, the `--format-options` argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the [Spark formatting options](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.sql.DataFrameWriter) for details), and whether quoting the fields in the CSV required:
132140

133141
```bash
@@ -143,29 +151,31 @@ export SPARK_CONF_DIR=./conf
143151
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode interactive --scale-factor 0.003
144152
```
145153

154+
It is also possible to pass a parameter file:
155+
156+
```bash
157+
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar -- --format csv --param-file params.ini
158+
```
159+
146160
### Docker image
147161

148162
<!-- SNB Datagen images are available via [Docker Hub](https://hub.docker.com/r/ldbc/datagen/) (currently outdated). -->
149163

150164
The Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:
151165

152166
```bash
153-
tools/docker-build.sh
167+
./tools/docker-build.sh
154168
```
155169

156-
See [Build the JAR](#build-the-jar) to build the library. Then, run the following:
170+
See [Build the JAR](#build-the-jar) to build the library (e.g. by invoking `./tools/build.sh`). Then, run the following:
157171

158172
```bash
159-
tools/docker-run.sh
173+
./tools/docker-run.sh
160174
```
161175

162176
### Elastic MapReduce
163177

164-
We provide scripts to run Datagen on AWS EMR. See the README in the [`tools/emr`](tools/emr) directory for details.
165-
166-
## Larger scale factors
167-
168-
The scale factors SF3k+ are currently being fine-tuned, both regarding optimizing the generator and also for tuning the distributions.
178+
We provide scripts to run Datagen on AWS EMR. See the README in the [`./tools/emr`](tools/emr) directory for details.
169179

170180
## Graph schema
171181

@@ -177,4 +187,4 @@ The graph schema is as follows:
177187

178188
* When running the tests, they might throw a `java.net.UnknownHostException: your_hostname: your_hostname: Name or service not known` coming from `org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal`. The solution is to add an entry of your machine's hostname to the `/etc/hosts` file: `127.0.1.1 your_hostname`.
179189
* If you are using Docker and Spark runs out of space, make sure that Docker has enough space to store its containers. To move the location of the Docker containers to a larger disk, stop Docker, edit (or create) the `/etc/docker/daemon.json` file and add `{ "data-root": "/path/to/new/docker/data/dir" }`, then sync the old folder if needed, and restart Docker. (See [more detailed instructions](https://www.guguweb.com/2019/02/07/how-to-move-docker-data-directory-to-another-location-on-ubuntu/)).
180-
* If you are using a local Spark installation and run out of space in `/tmp`, set the `SPARK_LOCAL_DIRS` to point to a directory with enough free space.
190+
* If you are using a local Spark installation and run out of space in `/tmp` (`java.io.IOException: No space left on device`), set the `SPARK_LOCAL_DIRS` to point to a directory with enough free space.

code_of_conduct.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
For our code of conduct, see: https://github.com/ldbc/community/blob/main/code_of_conduct.md

contributing.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
For our contributor's guide, see: https://github.com/ldbc/community/blob/main/contributing.md

dist/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# LDBC SNB Datagen (Spark variant) – Latest artefacts
22

3-
This README is deployed to <http://ldbcouncil.org/ldbc_snb_datagen_spark>.
3+
This README is deployed to <https://ldbcouncil.org/ldbc_snb_datagen_spark>.
44

55
## Generated data sets
66

7-
The following data sets are generated for the `dev` variant, to be used for the BI workload.
7+
The following data sets were generated for the **LDBC Social Network Benchmark's BI (Business Intelligence) workload** by the latest commit at <https://github.com/ldbc/ldbc_snb_datagen_spark>.
88

9-
If you are looking for data sets to implement the Interactive workload, please use the [Hadoop-based legacy Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop) or reach out to us.
9+
If you are looking for data sets of the **SNB Interactive workload**, please use the [legacy Hadoop-based Datagen](https://github.com/ldbc/ldbc_snb_datagen_hadoop) or download them from the [SURF/CWI data repository](https://hdl.handle.net/11112/e6e00558-a2c3-9214-473e-04a16de09bf8).
1010

1111
{% for file in site.static_files %}
1212
{% if file.extname == ".zip" -%}

src/main/java/ldbc/snb/datagen/entities/statictype/place/PopularPlace.java

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,20 @@ public PopularPlace(String name, double latitude, double longitude) {
4747
this.longitude = longitude;
4848
}
4949

50-
public String getName() { return name; }
50+
public String getName() {
51+
return name;
52+
}
5153

5254
public void setName(String name) {
5355
this.name = name;
5456
}
5557

56-
public double getLatitude() { return latitude; }
58+
public double getLatitude() {
59+
return latitude;
60+
}
5761

58-
public double getLongitude() { return longitude; }
62+
public double getLongitude() {
63+
return longitude;
64+
}
5965

6066
}

src/main/java/ldbc/snb/datagen/generator/generators/PersonGenerator.java

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -151,11 +151,14 @@ private Person generatePerson() {
151151
base = base.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
152152
base = base.replaceAll(" ", ".");
153153
base = base.replaceAll("[.]+", ".");
154-
for (int i = 0; i < numEmails; i++) {
154+
while (person.getEmails().size() < numEmails) {
155155
String email = base + "" + person.getAccountId() + "@" +
156156
Dictionaries.emails.getRandomEmail(randomFarm.get(RandomGeneratorFarm.Aspect.TOP_EMAIL),
157157
randomFarm.get(RandomGeneratorFarm.Aspect.EMAIL));
158-
person.getEmails().add(email);
158+
// avoid duplicates
159+
if (!person.getEmails().contains(email)) {
160+
person.getEmails().add(email);
161+
}
159162
}
160163

161164
// Set class year

0 commit comments

Comments
 (0)