Skip to content

Commit ea08414

Browse files
committed
Update EMR instructions
1 parent ae9420a commit ea08414

File tree

1 file changed

+25
-9
lines changed

1 file changed

+25
-9
lines changed

tools/emr/README.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,27 @@
11
# Running Datagen on EMR
22

3-
We provide support scripts for running LDBC Datagen on EMR and storing the results
4-
on S3.
3+
We provide support scripts for running LDBC Datagen on EMR and storing the results on S3.
54

65
## Creating the infrastructure
7-
Create an S3 bucket. This bucket will have the following layout:
6+
7+
### S3 Bucket
8+
9+
Create an S3 bucket and the `BUCKET_NAME` environment variable accordingly.
10+
11+
The bucket will have the following layout (created by the scripts/jobs):
812

913
- `params`: parameter files
1014
- `jars`: application JARs
1115
- `results`: results of successful runs
1216
- `logs`: logs of the jobs
1317

18+
### AWS Roles
19+
20+
In AWS IAM, add the following roles with **Create Role** | **AWS service** | **EMR**:
21+
22+
* **EMR** a.k.a. `AmazonElasticMapReduceRole`, name it `EMR_DefaultRole`
23+
* **EMR Role for EC2** a.k.a. `AmazonElasticMapReduceforEC2Role`, name it `EMR_EC2_DefaultRole`
24+
1425
## Install the required libraries
1526

1627
1. From the repository root, run:
@@ -34,12 +45,6 @@ VERSION=0.4.0-SNAPHOT
3445
aws s3 cp target/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar s3://${BUCKET_NAME}/jars/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar
3546
```
3647

37-
1. Upload the generator parameter file to S3 (if required).
38-
39-
```bash
40-
aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME}/params/params-csv-basic-sf10000.ini
41-
```
42-
4348
1. Submit the job. Run with `--help` for customization options.
4449

4550
```bash
@@ -48,6 +53,10 @@ SCALE_FACTOR=10
4853
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
4954
```
5055

56+
Note: scale factors below 1 are not supported.
57+
58+
### Using a different EMR version
59+
5160
We use EMR 5.13.0 by default. You can try out `emr-6.3.0` by specifying it with the `--emr-version` option.
5261
Make sure you uploaded the right JAR first!
5362

@@ -56,3 +65,10 @@ PLATFORM_VERSION=2.12_spark3.1
5665
./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} --platform-version ${PLATFORM_VERSION} --emr-release emr-6.3.0 ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
5766
```
5867

68+
### Using a parameter file
69+
70+
The generator allows the use of an optional parameter file. To use a parameter file, upload it as follows.
71+
72+
```bash
73+
aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME}/params/params-csv-basic-sf10000.ini
74+
```

0 commit comments

Comments
 (0)