Skip to content

Commit f92e670

Browse files
author
markzegarelli
authored
Merge pull request #1115 from segmentio/dl_setup_guide
Small updates to manual set up doc
2 parents 1adae92 + 9d72c27 commit f92e670

File tree

2 files changed

+27
-22
lines changed

2 files changed

+27
-22
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 27 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The instructions below will guide you through the process required to configure
1111
In this step, you'll create the S3 bucket that will store both the intermediate and final data.
1212

1313
> info ""
14-
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, `segment-data-lake` is used.
14+
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
1515
1616
During the set up process, create a Lifecycle rule and set it to expire staging data after **14 days**. For more information, see Amazon's documentation, [How do I create a lifecycle?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html).
1717

@@ -25,44 +25,39 @@ Segment requires access to an EMR cluster to perform necessary data processing.
2525

2626
1. Locate and select EMR from the AWS console.
2727
2. Click **Create Cluster**, and open the **Advanced Options**.
28-
3. In the Advanced Options, on Step 1: Software and Steps, ensure the following options are selected, along with the defaults:
28+
3. In the Advanced Options, on Step 1: Software and Steps, ensure you select the following options, along with the defaults:
2929
- `Use for Hive table metadata`
3030
- `Use for Spark table metadata` ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png)
31-
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires some additional configuration. Creating a cluster in a public subnet is accessible from the internet. However, you can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
31+
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration. Creating a cluster in a public subnet is accessible from the internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
3232
5. In the Hardware Configuration section, create a cluster with the nodes listed below. This configuration uses the default **On demand** purchasing option for the instances.
3333
- **1** master node
3434
- **2** core nodes
3535
- **2** task nodes ![Configure the number of nodes](images/03_hardware-node-instances.png)
36-
37-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
38-
39-
### Enable EMR managed scaling for the Core and Task nodes
4036

41-
On the **Cluster Scaling** settings, select **Use EMR-managed scaling**, and select the following number of task units:
42-
- Minimum: **2**
43-
- Maximum: **8**
44-
- On-demand limit: **8**
45-
- Maximum Core Node: **2**
37+
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
4638

47-
![Configure the Cluster scaling options](images/04_cluster-scaling.png)
4839

4940
### Configure logging
5041

51-
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs will be written to a new prefix, and separated from the final processed data.
42+
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are to a new prefix, and separated from the final processed data.
43+
44+
Set value of the **vendor** tag to `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
5245

53-
Set value of the **vendor** tag to `segment`.
5446

5547
![Configure logging](images/05_logging.png)
5648

5749
### Secure the cluster
5850

59-
On the Security step, ensure that the following steps have been completed:
51+
On the Security step, be sure to complete the following steps:
6052
1. Create or select an **EC2 key pair**.
6153
2. Choose the appropriate roles in the **EC2 instance profile**.
6254
3. Select the appropriate security groups for the Master and Core & Task types.
6355

6456
![Secure the cluster](images/06_secure-cluster.png)
6557

58+
The image uses the default settings. You can make these settings more restrictive, if required.
59+
60+
6661
## Step 3 - Create an Access Management role and policy
6762

6863
The following steps provide examples of the IAM Role and IAM Policy.
@@ -104,7 +99,7 @@ Create a `segment-data-lake-role` role for Segment to assume. Attach the followi
10499
105100
### IAM Policy
106101

107-
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3
102+
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
108103

109104
```json
110105
{
@@ -162,7 +157,7 @@ Add a policy to the role created above to give Segment access to the relevant Gl
162157
"Effect": "Allow",
163158
"Action": "*",
164159
"Resource": [
165-
"arn:aws:s3:::$BUCKET_NAME/*",
160+
"arn:aws:s3:::$BUCKET_NAME/*",
166161
"arn:aws:s3:::$BUCKET_NAME"
167162
]
168163
},
@@ -174,19 +169,29 @@ Add a policy to the role created above to give Segment access to the relevant Gl
174169
"Resource": [
175170
"*"
176171
]
172+
},
173+
{
174+
"Sid": "",
175+
"Effect": "Allow",
176+
"Action": "iam:PassRole",
177+
"Resource": [
178+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_DefaultRole",
179+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_AutoScaling_DefaultRole",
180+
"arn:aws:iam::$ACCOUNT_ID:role/EMR_EC2_DefaultRole"
181+
]
177182
}
178183
]
179184
}
180185
```
181186

182187
> note ""
183-
> **NOTE:** The policy above grants full access to Athena, but the individual Glue and S3 policies decide which table can be queried. Segment queries only for debugging purposes, and will notify you be for running any queries.
188+
> **NOTE:** The policy above grants full access to Athena, but the individual Glue and S3 policies decide which table is queryable. Segment queries for debugging purposes, and will notify you be for running any queries.
184189
185190
## Debugging
186191

187192
Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
188-
- Access the individual objects stored in S3 and the associated schema in order to understand data discrepancies
193+
- Access the individual objects stored in S3 and the associated schema to understand data discrepancies
189194
- Run an Athena query on the underlying data stored in S3
190195
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
191-
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
192-
196+
![Debugging](images/dl_setup_glueerror.png)
197+
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
68.7 KB
Loading

0 commit comments

Comments
 (0)