Update EMR instructions

szarnyasg · szarnyasg · commit ea08414484af · 2021-07-05T00:14:43.000+02:00
diff --git a/tools/emr/README.md b/tools/emr/README.md
@@ -1,16 +1,27 @@
 # Running Datagen on EMR
 
-We provide support scripts for running LDBC Datagen on EMR and storing the results
-on S3.
+We provide support scripts for running LDBC Datagen on EMR and storing the results on S3.
 
 ## Creating the infrastructure
-Create an S3 bucket. This bucket will have the following layout:
+
+### S3 Bucket
+
+Create an S3 bucket and the `BUCKET_NAME` environment variable accordingly.
+
+The bucket will have the following layout (created by the scripts/jobs):
 
 - `params`: parameter files
 - `jars`: application JARs
 - `results`: results of successful runs
 - `logs`: logs of the jobs
 
+### AWS Roles
+
+In AWS IAM, add the following roles with **Create Role** | **AWS service** | **EMR**:
+
+* **EMR** a.k.a. `AmazonElasticMapReduceRole`, name it `EMR_DefaultRole`
+* **EMR Role for EC2** a.k.a. `AmazonElasticMapReduceforEC2Role`, name it `EMR_EC2_DefaultRole`
+
 ## Install the required libraries
 
 1. From the repository root, run:
@@ -34,12 +45,6 @@ VERSION=0.4.0-SNAPHOT
 aws s3 cp target/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar s3://${BUCKET_NAME}/jars/ldbc_snb_datagen_${PLATFORM_VERSION}-${VERSION}-jar-with-dependencies.jar
 ```
 
-1. Upload the generator parameter file to S3 (if required).
-
-```bash
-aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME}/params/params-csv-basic-sf10000.ini
-```
-
 1. Submit the job. Run with `--help` for customization options.
 
 ```bash
@@ -48,6 +53,10 @@ SCALE_FACTOR=10
 ./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
 ```
 
+Note: scale factors below 1 are not supported.
+
+### Using a different EMR version
+
 We use EMR 5.13.0 by default. You can try out `emr-6.3.0` by specifying it with the `--emr-version` option.
 Make sure you uploaded the right JAR first!
 
@@ -56,3 +65,10 @@ PLATFORM_VERSION=2.12_spark3.1
 ./tools/emr/submit_datagen_job.py --bucket ${BUCKET_NAME} --platform-version ${PLATFORM_VERSION} --emr-release emr-6.3.0 ${JOB_NAME} ${SCALE_FACTOR} -- --format csv --mode raw
 ```
 
+### Using a parameter file
+
+The generator allows the use of an optional parameter file. To use a parameter file, upload it as follows.
+
+```bash
+aws s3 cp params-csv-basic-sf10000.ini s3://${BUCKET_NAME}/params/params-csv-basic-sf10000.ini
+```