You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+20-15Lines changed: 20 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,24 +33,16 @@ Segment requires access to an EMR cluster to perform necessary data processing.
33
33
-**1** master node
34
34
-**2** core nodes
35
35
-**2** task nodes 
36
-
37
-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
38
-
39
-
### Enable EMR managed scaling for the Core and Task nodes
40
36
41
-
On the **Cluster Scaling** settings, select **Use EMR-managed scaling**, and select the following number of task units:
42
-
- Minimum: **2**
43
-
- Maximum: **8**
44
-
- On-demand limit: **8**
45
-
- Maximum Core Node: **2**
37
+
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
46
38
47
-

48
39
49
40
### Configure logging
50
41
51
42
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs will be written to a new prefix, and separated from the final processed data.
52
43
53
-
Set value of the **vendor** tag to `segment`.
44
+
Set value of the **vendor** tag to `segment`. This is used in the IAM policy to provide Segment access to submit jobs in the EMR cluster.
45
+
54
46
55
47

56
48
@@ -63,6 +55,9 @@ On the Security step, ensure that the following steps have been completed:
63
55
64
56

65
57
58
+
The image uses the default settings, however these settings can be made more restrictive, if required.
59
+
60
+
66
61
## Step 3 - Create an Access Management role and policy
67
62
68
63
The following steps provide examples of the IAM Role and IAM Policy.
@@ -104,7 +99,7 @@ Create a `segment-data-lake-role` role for Segment to assume. Attach the followi
104
99
105
100
### IAM Policy
106
101
107
-
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3
102
+
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
108
103
109
104
```json
110
105
{
@@ -162,7 +157,7 @@ Add a policy to the role created above to give Segment access to the relevant Gl
162
157
"Effect": "Allow",
163
158
"Action": "*",
164
159
"Resource": [
165
-
"arn:aws:s3:::$BUCKET_NAME/*",
160
+
"arn:aws:s3:::$BUCKET_NAME/*",
166
161
"arn:aws:s3:::$BUCKET_NAME"
167
162
]
168
163
},
@@ -174,6 +169,16 @@ Add a policy to the role created above to give Segment access to the relevant Gl
@@ -188,5 +193,5 @@ Segment requires access to the data and schema for debugging data quality issues
188
193
- Access the individual objects stored in S3 and the associated schema in order to understand data discrepancies
189
194
- Run an Athena query on the underlying data stored in S3
190
195
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
191
-
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
192
-
196
+

197
+
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
0 commit comments