Skip to content

Commit 2c117ae

Browse files
authored
Merge pull request #94 from ldbc/batching
Major refactoring to prepare for batching the dynamic part of the graph
2 parents e399902 + 0687e92 commit 2c117ae

File tree

146 files changed

+821
-693
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

146 files changed

+821
-693
lines changed

.gitignore

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ scripts/
2424
/social_network/
2525
*.log
2626
test_log
27+
.DS_Store
2728

2829
# Eclipse stuff
2930
.metadata
@@ -47,5 +48,8 @@ out/
4748
datagen_output/
4849

4950
# Hadoop directory extracted as instructed in the Quick Start guide
50-
/hadoop-2.6.0.tar.gz
51-
/hadoop-2.6.0/
51+
/hadoop-*.tar.gz
52+
/hadoop-*/
53+
54+
# configuration file
55+
params.ini

.travis.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,10 @@ before_install:
1313
- docker build . --tag ldbc/datagen
1414
install: true
1515
script:
16+
- cp params-csv.ini params.ini
1617
- docker run --rm --mount type=bind,source="$(pwd)/",target="/opt/ldbc_snb_datagen/out" --mount type=bind,source="$(pwd)/params.ini",target="/opt/ldbc_snb_datagen/params.ini" ldbc/datagen
18+
- "[[ `md5sum social_network/*.csv | sort | md5sum` == 'fa046e4c44e4c3e8f6858720c45d80ed -' ]]"
19+
- "[[ `md5sum substitution_parameters/interactive_* | sort | md5sum` == '5cba23795df372c19688b05c5a9f318f -' ]]"
1720
- mkdir out
1821
- cp -r substitution_parameters out/
1922
notifications:

README.md

Lines changed: 26 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -21,42 +21,50 @@ The LDBC-SNB Data Generator (Datagen) is the responsible of providing the data s
2121

2222
## Quick start
2323

24-
There are three main ways to run Datagen:
25-
(1) using a pseudo-distributed Hadoop installation,
26-
(2) running the same setup in a Docker image,
27-
(3) running on a distributed Hadoop cluster.
24+
### Configuration
25+
26+
Initialize the `params.ini` file as needed. For example, to generate the basic CSV files, issue:
27+
28+
```bash
29+
cp params-csv.ini params.ini
30+
```
31+
32+
There are three main ways to run Datagen, each using a different approach to configure the amount of memory available.
33+
34+
1. using a pseudo-distributed Hadoop installation,
35+
2. running the same setup in a Docker image,
36+
3. running on a distributed Hadoop cluster.
2837

2938
### Pseudo-distributed Hadoop node
3039

31-
To grab Hadoop, extract it, and set the environment values to sensible defaults, and generate the data as specified in the `params.ini` file, run the following script:
40+
To configure the amount of memory available, set the `HADOOP_CLIENT_OPTS` environment variable.
41+
To grab Hadoop, extract it, and set the environment values to sensible defaults, and generate the data as specified in the `params-csv.ini` file, run the following script:
3242

3343
```bash
34-
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
35-
tar xf hadoop-2.6.0.tar.gz
44+
cp params-csv.ini params.ini
45+
wget http://archive.apache.org/dist/hadoop/core/hadoop-2.9.2/hadoop-2.9.2.tar.gz
46+
tar xf hadoop-2.9.2.tar.gz
3647
export HADOOP_CLIENT_OPTS="-Xmx2G"
37-
# set this to the Hadoop 2.6.0 directory
38-
export HADOOP_HOME=`pwd`/hadoop-2.6.0
48+
# set this to the Hadoop 2.9.2 directory
49+
export HADOOP_HOME=`pwd`/hadoop-2.9.2
3950
# set this to the repository's directory
4051
export LDBC_SNB_DATAGEN_HOME=`pwd`
4152
./run.sh
4253
```
4354

4455
### Docker image
4556

46-
The image can be simply built with the provided Dockerfile.
47-
To build, execute the following command from the repository directory:
57+
SNB datagen images are available via [Docker Hub](https://hub.docker.com/r/ldbc/datagen/) where you may find both the latest version of the generator as well as previous stable versions.
58+
59+
Alternatively, the image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:
4860

4961
```bash
5062
docker build . --tag ldbc/datagen
5163
```
5264

53-
#### Configuration
54-
55-
To configure the amount of memory available, set the `HADOOP_CLIENT_OPTS` variable in the Dockerfile. The default value is `-Xmx8G`.
56-
5765
#### Running
5866

59-
In order to run the container, a `params.ini` file is required. For reference, please see the `params*.ini` files in the repository. The file will be mounted in the container by the `--mount type=bind,source="$(pwd)/params.ini,target="/opt/ldbc_snb_datagen/params.ini"` option. If required, the source path can be set to a different path.
67+
Set the `params.ini` in the repository as for the pseudo-distributed case. The file will be mounted in the container by the `--mount type=bind,source="$(pwd)/params.ini,target="/opt/ldbc_snb_datagen/params.ini"` option. If required, the source path can be set to a different path.
6068

6169
The container outputs its results in the `/opt/ldbc_snb_datagen/out/` directory which contains two sub-directories, `social_network/` and `subsitution_parameters`. In order to save the results of the generation, a directory must be mounted in the container from the host. The driver requires the results be in the datagen repository directory. To generate the data, run the following command which includes changing the owner (`chown`) of the Docker-mounted volumes:
6270

@@ -65,11 +73,11 @@ docker run --rm --mount type=bind,source="$(pwd)/",target="/opt/ldbc_snb_datagen
6573
sudo chown -R $USER:$USER social_network/ substitution_parameters/
6674
```
6775

68-
If you need to raise the memory limit, use the `-e HADOOP_CLIENT_OPTS="-Xmx..."` parameter to override the default value (`-Xmx8G`).
76+
If you need to raise the memory limit, use the `-e HADOOP_CLIENT_OPTS="-Xmx..."` parameter to override the default value (`-Xmx2G`).
6977

7078
### Hadoop cluster
7179

72-
Instructions are currently not provided. (TBD)
80+
Instructions are currently not provided.
7381

7482
### Community provided tools
7583

docker_run.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ if [ ! -f /opt/ldbc_snb_datagen/params.ini ]; then
88
fi
99

1010
# Running the generator
11-
/opt/hadoop-2.6.0/bin/hadoop jar /opt/ldbc_snb_datagen/target/ldbc_snb_datagen-0.2.7-jar-with-dependencies.jar /opt/ldbc_snb_datagen/params.ini
11+
/opt/hadoop-2.6.0/bin/hadoop jar /opt/ldbc_snb_datagen/target/ldbc_snb_datagen-0.4.0-SNAPSHOT-jar-with-dependencies.jar /opt/ldbc_snb_datagen/params.ini
1212

1313
# Cleanup
1414
rm -f m*personFactors*

graphalytics-generate-old.sh

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,9 @@ for VERSION in v0.2.1 v0.2.2 v0.2.3 v0.2.4 v0.2.5; do
5151
# vertices
5252
echo > params.ini
5353
echo ldbc.snb.datagen.generator.scaleFactor:graphalytics.$SCALE_FACTOR >> params.ini
54-
echo ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVPersonSerializer >> params.ini
55-
echo ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.empty.EmptyInvariantSerializer >> params.ini
56-
echo ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyPersonActivitySerializer >> params.ini
54+
echo ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVDynamicPersonSerializer >> params.ini
55+
echo ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.empty.EmptyStaticSerializer >> params.ini
56+
echo ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyDynamicActivitySerializer >> params.ini
5757

5858
./run.sh
5959
tail -n +2 social_network/person_0_0.csv | wc -l >> ../datagen-graphalytics.log
@@ -63,12 +63,12 @@ for VERSION in v0.2.1 v0.2.2 v0.2.3 v0.2.4 v0.2.5; do
6363
fi
6464

6565
# edges
66-
# from version 0.2.2, it's also possible to use the CSVPersonSerializerWithWeights serializer, which adds edge weights
66+
# from version 0.2.2, it's also possible to use the CSVDynamicPersonSerializerWithWeights serializer, which adds edge weights
6767
echo > params.ini
6868
echo ldbc.snb.datagen.generator.scaleFactor:graphalytics.$SCALE_FACTOR >> params.ini
69-
echo ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVPersonSerializer >> params.ini
70-
echo ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.empty.EmptyInvariantSerializer >> params.ini
71-
echo ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyPersonActivitySerializer >> params.ini
69+
echo ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVDynamicPersonSerializer >> params.ini
70+
echo ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.empty.EmptyStaticSerializer >> params.ini
71+
echo ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyDynamicActivitySerializer >> params.ini
7272

7373
./run.sh
7474
tail -n +2 social_network/person_knows_person_0_0.csv | wc -l >> ../datagen-graphalytics.log
@@ -79,7 +79,7 @@ for VERSION in v0.2.1 v0.2.2 v0.2.3 v0.2.4 v0.2.5; do
7979
done
8080

8181
# For versions 0.2.6-0.2.8, we only need a single run, which produces both the vertices and the edges
82-
# using the CSVPersonSerializerExtended class, which also produces edge weights
82+
# using the CSVDynamicPersonSerializerExtended class, which also produces edge weights
8383
for VERSION in v0.2.6 v0.2.7 v0.2.8; do
8484
echo $VERSION >> ../datagen-graphalytics.log
8585

@@ -90,9 +90,9 @@ for VERSION in v0.2.6 v0.2.7 v0.2.8; do
9090
# vertices and edges
9191
echo > params.ini
9292
echo ldbc.snb.datagen.generator.scaleFactor:graphalytics.$SCALE_FACTOR >> params.ini
93-
echo ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVPersonSerializerExtended >> params.ini
94-
echo ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.empty.EmptyInvariantSerializer >> params.ini
95-
echo ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyPersonActivitySerializer >> params.ini
93+
echo ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVDynamicPersonSerializerExtended >> params.ini
94+
echo ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.empty.EmptyStaticSerializer >> params.ini
95+
echo ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyDynamicActivitySerializer >> params.ini
9696

9797
./run.sh
9898
tail -n +2 social_network/person_0_0.csv | wc -l >> ../datagen-graphalytics.log

params-composite-foreign-key.ini

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
22

3-
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositeMergeForeignPersonSerializer
4-
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositeMergeForeignInvariantSerializer
5-
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositeMergeForeignPersonActivitySerializer
3+
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.compositemergeforeign.CSVCompositeMergeForeignDynamicPersonSerializer
4+
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.compositemergeforeign.CSVCompositeMergeForeignStaticSerializer
5+
ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.compositemergeforeign.CSVCompositeMergeForeignDynamicActivitySerializer

params-composite.ini

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
22

3-
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositePersonSerializer
4-
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositeInvariantSerializer
5-
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVCompositePersonActivitySerializer
3+
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.composite.CSVCompositeDynamicPersonSerializer
4+
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.composite.CSVCompositeStaticSerializer
5+
ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.composite.CSVCompositeDynamicActivitySerializer

params-csv.ini

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
2+
3+
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.basic.CSVDynamicPersonSerializer
4+
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.basic.CSVStaticSerializer
5+
ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.basic.CSVDynamicActivitySerializer
6+
7+
# To generate RFC 3339-compliant timestamps (https://tools.ietf.org/html/rfc3339), uncomment the following line:
8+
#ldbc.snb.datagen.serializer.formatter.StringDateFormatter.dateTimeFormat:yyyy-MM-dd'T'HH:mm:ss.SSS+00:00

params-foreign-key.ini

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ldbc.snb.datagen.generator.scaleFactor:snb.interactive.1
22

3-
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonSerializer
4-
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignInvariantSerializer
5-
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.snb.interactive.CSVMergeForeignPersonActivitySerializer
3+
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.mergeforeign.CSVMergeForeignDynamicPersonSerializer
4+
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.mergeforeign.CSVMergeForeignStaticSerializer
5+
ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.mergeforeign.CSVMergeForeignDynamicActivitySerializer

params-graphalytics.ini

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ldbc.snb.datagen.generator.scaleFactor:graphalytics.1
22

3-
ldbc.snb.datagen.serializer.personSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVPersonSerializerExtended
4-
ldbc.snb.datagen.serializer.invariantSerializer:ldbc.snb.datagen.serializer.empty.EmptyInvariantSerializer
5-
ldbc.snb.datagen.serializer.personActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyPersonActivitySerializer
3+
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.graphalytics.CSVDynamicPersonSerializerExtended
4+
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.empty.EmptyStaticSerializer
5+
ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.empty.EmptyDynamicActivitySerializer

0 commit comments

Comments
 (0)