1
1
# Running Datagen on EMR
2
2
3
- We provide support scripts for running LDBC Datagen on EMR and storing the results
4
- on S3.
3
+ We provide support scripts for running LDBC Datagen on EMR and storing the results on S3.
5
4
6
5
## Creating the infrastructure
7
- Create an S3 bucket. This bucket will have the following layout:
6
+
7
+ ### S3 Bucket
8
+
9
+ Create an S3 bucket and the ` BUCKET_NAME ` environment variable accordingly.
10
+
11
+ The bucket will have the following layout (created by the scripts/jobs):
8
12
9
13
- ` params ` : parameter files
10
14
- ` jars ` : application JARs
11
15
- ` results ` : results of successful runs
12
16
- ` logs ` : logs of the jobs
13
17
18
+ ### AWS Roles
19
+
20
+ In AWS IAM, add the following roles with ** Create Role** | ** AWS service** | ** EMR** :
21
+
22
+ * ** EMR** a.k.a. ` AmazonElasticMapReduceRole ` , name it ` EMR_DefaultRole `
23
+ * ** EMR Role for EC2** a.k.a. ` AmazonElasticMapReduceforEC2Role ` , name it ` EMR_EC2_DefaultRole `
24
+
14
25
## Install the required libraries
15
26
16
27
1 . From the repository root, run:
@@ -34,12 +45,6 @@ VERSION=0.4.0-SNAPHOT
34
45
aws s3 cp target/ldbc_snb_datagen_${PLATFORM_VERSION} -${VERSION} -jar-with-dependencies.jar s3://${BUCKET_NAME} /jars/ldbc_snb_datagen_${PLATFORM_VERSION} -${VERSION} -jar-with-dependencies.jar
35
46
```
36
47
37
- 1 . Upload the generator parameter file to S3 (if required).
38
-
39
- ``` bash
40
- aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME} /params/params-csv-basic-sf10000.ini
41
- ```
42
-
43
48
1 . Submit the job. Run with ` --help ` for customization options.
44
49
45
50
``` bash
@@ -48,6 +53,10 @@ SCALE_FACTOR=10
48
53
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
49
54
```
50
55
56
+ Note: scale factors below 1 are not supported.
57
+
58
+ ### Using a different EMR version
59
+
51
60
We use EMR 5.13.0 by default. You can try out ` emr-6.3.0 ` by specifying it with the ` --emr-version ` option.
52
61
Make sure you uploaded the right JAR first!
53
62
@@ -56,3 +65,10 @@ PLATFORM_VERSION=2.12_spark3.1
56
65
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} --platform-version ${PLATFORM_VERSION} --emr-release emr-6.3.0 ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
57
66
```
58
67
68
+ ### Using a parameter file
69
+
70
+ The generator allows the use of an optional parameter file. To use a parameter file, upload it as follows.
71
+
72
+ ``` bash
73
+ aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME} /params/params-csv-basic-sf10000.ini
74
+ ```
0 commit comments