Skip to content

Commit a42f878

Browse files
author
Mallika Sahay
committed
Small updates to manual set up doc
1 parent 1adae92 commit a42f878

File tree

2 files changed

+20
-15
lines changed

2 files changed

+20
-15
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -33,24 +33,16 @@ Segment requires access to an EMR cluster to perform necessary data processing.
3333
- **1** master node
3434
- **2** core nodes
3535
- **2** task nodes ![Configure the number of nodes](images/03_hardware-node-instances.png)
36-
37-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
38-
39-
### Enable EMR managed scaling for the Core and Task nodes
4036

41-
On the **Cluster Scaling** settings, select **Use EMR-managed scaling**, and select the following number of task units:
42-
- Minimum: **2**
43-
- Maximum: **8**
44-
- On-demand limit: **8**
45-
- Maximum Core Node: **2**
37+
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
4638

47-
![Configure the Cluster scaling options](images/04_cluster-scaling.png)
4839

4940
### Configure logging
5041

5142
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs will be written to a new prefix, and separated from the final processed data.
5243

53-
Set value of the **vendor** tag to `segment`.
44+
Set value of the **vendor** tag to `segment`. This is used in the IAM policy to provide Segment access to submit jobs in the EMR cluster.
45+
5446

5547
![Configure logging](images/05_logging.png)
5648

@@ -63,6 +55,9 @@ On the Security step, ensure that the following steps have been completed:
6355

6456
![Secure the cluster](images/06_secure-cluster.png)
6557

58+
The image uses the default settings, however these settings can be made more restrictive, if required.
59+
60+
6661
## Step 3 - Create an Access Management role and policy
6762

6863
The following steps provide examples of the IAM Role and IAM Policy.
@@ -104,7 +99,7 @@ Create a `segment-data-lake-role` role for Segment to assume. Attach the followi
10499
105100
### IAM Policy
106101

107-
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3
102+
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
108103

109104
```json
110105
{
@@ -162,7 +157,7 @@ Add a policy to the role created above to give Segment access to the relevant Gl
162157
"Effect": "Allow",
163158
"Action": "*",
164159
"Resource": [
165-
"arn:aws:s3:::$BUCKET_NAME/*",
160+
"arn:aws:s3:::$BUCKET_NAME/*",
166161
"arn:aws:s3:::$BUCKET_NAME"
167162
]
168163
},
@@ -174,6 +169,16 @@ Add a policy to the role created above to give Segment access to the relevant Gl
174169
"Resource": [
175170
"*"
176171
]
172+
},
173+
{
174+
"Sid": "",
175+
"Effect": "Allow",
176+
"Action": "iam:PassRole",
177+
"Resource": [
178+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_DefaultRole",
179+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_AutoScaling_DefaultRole",
180+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_EC2_DefaultRole"
181+
]
177182
}
178183
]
179184
}
@@ -188,5 +193,5 @@ Segment requires access to the data and schema for debugging data quality issues
188193
- Access the individual objects stored in S3 and the associated schema in order to understand data discrepancies
189194
- Run an Athena query on the underlying data stored in S3
190195
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
191-
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
192-
196+
![Debugging](images/dl_setup_glueerror.png)
197+
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
68.7 KB
Loading

0 commit comments

Comments
 (0)