Skip to content

Commit 8570f78

Browse files
authored
Merge pull request #338 from ldbc/default-spark3.1
2 parents b64516e + 57054b7 commit 8570f78

File tree

10 files changed

+63
-40
lines changed

10 files changed

+63
-40
lines changed

.circleci/config.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@ jobs:
2929
test:
3030
resource_class: xlarge
3131
executor: my-executor
32+
environment:
33+
PLATFORM_VERSION: 2.12_spark3.1
34+
DATAGEN_VERSION: 0.4.0-SNAPSHOT
3235
steps:
3336
- checkout
3437
- run: |

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM bde2020/spark-master:2.4.5-hadoop2.7
1+
FROM bde2020/spark-master:3.1.1-hadoop3.2
22

33
ENV GOSU_VERSION 1.12
44

README.md

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -55,51 +55,61 @@ pip install ./tools
5555

5656
The `tools/run.py` is intended for **local runs**. To use it, download and extract Spark as follows.
5757

58-
#### Spark 2.4.x
58+
#### Spark 3.1.x
59+
60+
Spark 3.1.x is the recommended runtime to use. The rest of the instructions are provided assuming Spark 3.1.x.
5961

6062
```bash
61-
curl https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz | sudo tar -xz -C /opt/
62-
export SPARK_HOME="/opt/spark-2.4.8-bin-hadoop2.7"
63+
curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | sudo tar -xz -C /opt/
64+
export SPARK_HOME="/opt/spark-3.1.2-bin-hadoop3.2"
6365
export PATH="$SPARK_HOME/bin":"$PATH"
6466
```
6567

66-
Make sure you use Java 8.
68+
Both Java 8 and Java 11 work.
69+
70+
To build, run
71+
72+
```bash
73+
tools/build.sh
74+
```
6775

6876
Run the script with:
6977

7078
```bash
71-
export PLATFORM_VERSION=2.11_spark2.4
79+
export PLATFORM_VERSION=2.12_spark3.1
7280
export DATAGEN_VERSION=0.4.0-SNAPSHOT
73-
7481
tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>
7582
```
7683

77-
#### Spark 3.1.x
84+
#### Older Spark versions
85+
86+
##### Spark 2.4.x
87+
88+
Spark 2.4.x with Hadoop 2.7 (Scala 2.11 / JVM 8) is supported, but it is recommended to switch to Spark 3.
7889

7990
```bash
80-
curl https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz | sudo tar -xz -C /opt/
81-
export SPARK_HOME="/opt/spark-3.1.2-bin-hadoop2.7"
91+
curl https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz | sudo tar -xz -C /opt/
92+
export SPARK_HOME="/opt/spark-2.4.8-bin-hadoop2.7"
8293
export PATH="$SPARK_HOME/bin":"$PATH"
8394
```
8495

85-
Both Java 8 and Java 11 work.
96+
Make sure you use Java 8.
8697

8798
To build, run
8899

89100
```bash
90-
tools/build.sh -Pspark3.1
101+
tools/build.sh -Pspark2.4
91102
```
92103

93104
Run the script with:
94105

95106
```bash
96-
export PLATFORM_VERSION=2.12_spark3.1
107+
export PLATFORM_VERSION=2.11_spark2.4
97108
export DATAGEN_VERSION=0.4.0-SNAPSHOT
109+
98110
tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar <runtime configuration arguments> -- <generator configuration arguments>
99111
```
100112

101-
The rest of the instructions are provided assuming Spark 2.4.x.
102-
103113
#### Runtime configuration arguments
104114

105115
The runtime configuration arguments determine the amount of memory, number of threads, degree of parallelism. For a list of arguments, see:
@@ -111,7 +121,7 @@ tools/run.py --help
111121
To generate a single `part-*.csv` file, reduce the parallelism (number of Spark partitions) to 1.
112122

113123
```bash
114-
./tools/run.py ./target/ldbc_snb_datagen_2.11_spark2.4-0.4.0-SNAPSHOT.jar --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive
124+
./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive
115125
```
116126
#### Generator configuration arguments
117127

pom.xml

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@
1414
<java.version>1.8</java.version>
1515
<maven.compiler.source>${java.version}</maven.compiler.source>
1616
<maven.compiler.target>${java.version}</maven.compiler.target>
17-
<scala.version>2.11.12</scala.version>
18-
<scala.compat.version>2.11</scala.compat.version>
19-
<spark.version>2.4.5</spark.version>
20-
<spark.compat.version>2.4</spark.compat.version>
17+
<scala.version>2.12.15</scala.version>
18+
<scala.compat.version>2.12</scala.compat.version>
19+
<spark.version>3.1.2</spark.version>
20+
<spark.compat.version>3.1</spark.compat.version>
2121
<spec2.version>4.2.0</spec2.version>
2222
</properties>
2323

@@ -296,10 +296,17 @@
296296
<profile>
297297
<id>spark3.1</id>
298298
<properties>
299-
<scala.version>2.12.12</scala.version>
300-
<scala.compat.version>2.12</scala.compat.version>
301-
<spark.version>3.1.1</spark.version>
302-
<spark.compat.version>3.1</spark.compat.version>
299+
<!-- This is the default profile. -->
300+
</properties>
301+
</profile>
302+
<profile>
303+
<id>spark2.4</id>
304+
<properties>
305+
<java.version>1.8</java.version>
306+
<scala.version>2.11.12</scala.version>
307+
<scala.compat.version>2.11</scala.compat.version>
308+
<spark.version>2.4.8</spark.version>
309+
<spark.compat.version>2.4</spark.compat.version>
303310
</properties>
304311
</profile>
305312
</profiles>

src/main/java/ldbc/snb/datagen/dictionary/NamesDictionary.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ public void extractSurNames() {
101101
String infos[] = line.split(",");
102102
String locationName = infos[1];
103103
int locationId = locationDic.getCountryId(locationName);
104-
if (locationId != locationDic.INVALID_LOCATION) {
104+
if (locationId != PlaceDictionary.INVALID_LOCATION) {
105105
String surName = infos[2].trim();
106106
surNamesByLocations.get(locationId).add(surName);
107107
totalSurNames++;
@@ -127,7 +127,7 @@ public void extractGivenNames() {
127127
int gender = Integer.parseInt(infos[2]);
128128
int birthYearPeriod = Integer.parseInt(infos[3]);
129129
int locationId = locationDic.getCountryId(locationName);
130-
if (locationId != locationDic.INVALID_LOCATION) {
130+
if (locationId != PlaceDictionary.INVALID_LOCATION) {
131131
String givenName = infos[1].trim();
132132
if (gender == 0) {
133133
givenNamesByLocationsMale.get(birthYearPeriod).get(locationId).add(givenName);

src/main/scala/ldbc/snb/datagen/generation/generator/SparkKnowsGenerator.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ object SparkKnowsGenerator {
2727
val indexed = ranker(persons)
2828
.map { case (k, v) => (k / blockSize, (k, v)) }
2929

30-
val percentagesJava = percentages.map(new java.lang.Float(_)).asJava
30+
val percentagesJava = percentages.map(Float.box).asJava
3131

3232
indexed
3333
// groupByKey wouldn't guarantee keeping the order inside groups
@@ -40,7 +40,7 @@ object SparkKnowsGenerator {
4040
.mapPartitions(groups => {
4141
DatagenContext.initialize(conf)
4242
val knowsGeneratorClass = Class.forName(knowsGeneratorClassName)
43-
val knowsGenerator = knowsGeneratorClass.newInstance().asInstanceOf[KnowsGenerator]
43+
val knowsGenerator = knowsGeneratorClass.getConstructor().newInstance().asInstanceOf[KnowsGenerator]
4444
knowsGenerator.initialize(conf)
4545
val personSimilarity = DatagenParams.getPersonSimularity
4646

tools/datagen/lib.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
platform_version = "2.11_spark2.4"
1+
platform_version = "2.12_spark3.1"
22
version = "0.4.0-SNAPSHOT"
33
main_class = 'ldbc.snb.datagen.spark.LdbcDatagen'

tools/docker-run.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
#!/bin/bash
22

3-
[ ! -f target/ldbc_snb_datagen_2.11_spark2.4-0.4.0-SNAPSHOT-jar-with-dependencies.jar ] && echo "target/ldbc_snb_datagen_2.11_spark2.4-0.4.0-SNAPSHOT-jar-with-dependencies.jar does not exist, exiting" && exit 1
3+
[ ! -f target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}-jar-with-dependencies.jar ] && echo "target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}-jar-with-dependencies.jar does not exist, exiting" && exit 1
44

55
# make sure that out directory exists and clean previously generated data
66
mkdir -p out/
77
rm -rf out/*
88
docker run \
99
--env uid=`id -u` \
1010
--volume `pwd`/out:/mnt/data \
11-
--volume `pwd`/target/ldbc_snb_datagen_2.11_spark2.4-0.4.0-SNAPSHOT-jar-with-dependencies.jar:/mnt/datagen.jar \
11+
--volume `pwd`/target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}-jar-with-dependencies.jar:/mnt/datagen.jar \
1212
ldbc/spark \
1313
--output-dir /mnt/data \
1414
${@} # pass arguments of this script to the submit.sh script (Docker entrypoint)

tools/emr/README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ pip install -e .
4242
1. Upload the JAR to S3. (We don't version the JARs yet, so you can only make sure that you run the intended code this way :( )
4343

4444
```bash
45-
PLATFORM_VERSION=2.11_spark2.4 # use 2.12_spark3.1 if you want to run on emr-6.3.0
45+
PLATFORM_VERSION=2.12_spark3.1
4646
VERSION=0.4.0-SNAPHOT
4747
aws s3 cp target/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar s3://${BUCKET_NAME}/jars/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar
4848
```
@@ -65,14 +65,17 @@ To use spot instances, add the `--use-spot` argument:
6565
./tools/emr/submit_datagen_job.py --use-spot --bucket ${BUCKET_NAME} ${JOB_NAME} ${SCALE_FACTOR} csv raw
6666
```
6767

68-
### Using a different EMR version
68+
### Using a different Spark / EMR version
69+
6970

70-
We use EMR 5.13.0 by default. You can try out `emr-6.3.0` by specifying it with the `--emr-version` option.
71-
Make sure you uploaded the right JAR first!
71+
72+
We use EMR 6.3.0 by default, which contains Spark 3.1. You can use a different version by specifying it with the `--emr-version` option.
73+
EMR 5.33.0 is the recommended EMR version to be used with Spark 2.4.
74+
Make sure that you have uploaded the right JAR first!
7275

7376
```bash
74-
PLATFORM_VERSION=2.12_spark3.1
75-
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} --platform-version ${PLATFORM_VERSION} --emr-release emr-6.3.0 ${JOB_NAME} ${SCALE_FACTOR} csv raw
77+
PLATFORM_VERSION=2.11_spark2.4
78+
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} --platform-version ${PLATFORM_VERSION} --emr-release emr-5.33.0 ${JOB_NAME} ${SCALE_FACTOR} csv raw
7679
```
7780

7881
### Using a parameter file

tools/emr/submit_datagen_job.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
'az': 'us-west-2c',
3131
'yes': False,
3232
'ec2_key': None,
33-
'emr_release': 'emr-5.31.0'
33+
'emr_release': 'emr-6.3.0'
3434
}
3535

3636
pp = pprint.PrettyPrinter(indent=2)
@@ -238,13 +238,13 @@ def submit_datagen_job(name,
238238
help='EC2 key name for SSH connection')
239239
parser.add_argument('--platform-version',
240240
default=defaults['platform_version'],
241-
help='The spark platform the JAR is compiled for formatted like {scala.compat.version}_spark{spark.comapt.version}, e.g. 2.11_spark2.4, 2.12_spark3.1')
241+
help='The spark platform the JAR is compiled for formatted like {scala.compat.version}_spark{spark.compat.version}, e.g. 2.11_spark2.4, 2.12_spark3.1')
242242
parser.add_argument('--version',
243243
default=defaults['version'],
244244
help='LDBC SNB Datagen library version')
245245
parser.add_argument('--emr-release',
246246
default=defaults['emr_release'],
247-
help='The EMR release to use. E.g emr-5.31.0, emr-6.1.0')
247+
help='The EMR release to use. E.g emr-5.33.0, emr-6.3.0')
248248
parser.add_argument('-y', '--yes',
249249
default=defaults['yes'],
250250
action='store_true',

0 commit comments

Comments
 (0)