Skip to content

Commit 7388307

Browse files
author
markzegarelli
authored
Merge pull request #1093 from segmentio/dl-manual-setup
moving this from a google doc into an actual doc
2 parents b9cf19c + ec249ac commit 7388307

File tree

8 files changed

+195
-3
lines changed

8 files changed

+195
-3
lines changed

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Before you set up Segment Data Lakes, you need the following resources:
2020

2121
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.11+. To support more versions of Terraform, the aws provider must use v2, which is included in our example main.tf.
2222

23-
You can also use our [manual set up instructions](https://docs.google.com/document/d/1GlWzS5KO4QaiVZx9pwfpgF-N-Xy2e_QQcdYSX-nLMDU/view) to configure these AWS resources if you prefer.
23+
You can also use our [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
2424

2525
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
2626

@@ -57,12 +57,12 @@ Once the Data Lakes destination is enabled, the first sync will begin approximat
5757

5858
## Step 3 - Verify Data is Synced to S3 and Glue
5959

60-
You will see event data and [sync reports](https://segment.com/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](https://segment.com/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](https://segment.com/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
60+
You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
6161

6262
To be alerted of sync failures via email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications).
6363
![](images/dl_activity_notifications2.png)
6464

65-
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](https://segment.com/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
65+
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
6666

6767

6868
## (Optional) Step 4 - Replay Historical Data
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
---
2+
hidden: true
3+
title: Configure the Data Lakes AWS Environment
4+
---
5+
6+
The instructions below will guide you through the process required to configure the environment required to begin loading data into your Segment Data Lake. For a more automated process, see [Step 1 - Configure AWS Resources](#step-1---configure-aws-resources) above.
7+
8+
9+
## Step 1 - Create an S3 Bucket
10+
11+
In this step, you'll create the S3 bucket that will store both the intermediate and final data.
12+
13+
> info ""
14+
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, `segment-data-lake` is used.
15+
16+
During the set up process, create a Lifecycle rule and set it to expire staging data after **14 days**. For more information, see Amazon's documentation, [How do I create a lifecycle?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html).
17+
18+
![Create a Lifecycle rule to expire staging data after 14 days](images/01_14-day-lifecycle.png)
19+
20+
## Step 2 - Configure an EMR cluster
21+
22+
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute as required.
23+
24+
### Configure the hardware and networking configuration
25+
26+
1. Locate and select EMR from the AWS console.
27+
2. Click **Create Cluster**, and open the **Advanced Options**.
28+
3. In the Advanced Options, on Step 1: Software and Steps, ensure the following options are selected, along with the defaults:
29+
- `Use for Hive table metadata`
30+
- `Use for Spark table metadata` ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png)
31+
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires some additional configuration. Creating a cluster in a public subnet is accessible from the internet. However, you can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
32+
5. In the Hardware Configuration section, create a cluster with the nodes listed below. This configuration uses the default **On demand** purchasing option for the instances.
33+
- **1** master node
34+
- **2** core nodes
35+
- **2** task nodes ![Configure the number of nodes](images/03_hardware-node-instances.png)
36+
37+
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
38+
39+
### Enable EMR managed scaling for the Core and Task nodes
40+
41+
On the **Cluster Scaling** settings, select **Use EMR-managed scaling**, and select the following number of task units:
42+
- Minimum: **2**
43+
- Maximum: **8**
44+
- On-demand limit: **8**
45+
- Maximum Core Node: **2**
46+
47+
![Configure the Cluster scaling options](images/04_cluster-scaling.png)
48+
49+
### Configure logging
50+
51+
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs will be written to a new prefix, and separated from the final processed data.
52+
53+
Set value of the **vendor** tag to `segment`.
54+
55+
![Configure logging](images/05_logging.png)
56+
57+
### Secure the cluster
58+
59+
On the Security step, ensure that the following steps have been completed:
60+
1. Create or select an **EC2 key pair**.
61+
2. Choose the appropriate roles in the **EC2 instance profile**.
62+
3. Select the appropriate security groups for the Master and Core & Task types.
63+
64+
![Secure the cluster](images/06_secure-cluster.png)
65+
66+
## Step 3 - Create an Access Management role and policy
67+
68+
The following steps provide examples of the IAM Role and IAM Policy.
69+
70+
### IAM Role
71+
72+
Create a `segment-data-lake-role` role for Segment to assume. Attach the following trust relationship document to the role:
73+
74+
```json
75+
{
76+
"Version": "2012-10-17",
77+
"Statement": [
78+
{
79+
"Sid": "",
80+
"Effect": "Allow",
81+
"Principal": {
82+
"AWS": [
83+
"arn:aws:iam::294048959147:role/customer-datalakes-prod-admin",
84+
"arn:aws:iam::294048959147:role/datalakes-aws-worker",
85+
"arn:aws:iam::294048959147:role/datalakes-customer-service"
86+
]
87+
},
88+
"Action": "sts:AssumeRole",
89+
"Condition": {
90+
"StringEquals": {
91+
"sts:ExternalId": [
92+
"SOURCE_1",
93+
"SOURCE_N"
94+
]
95+
}
96+
}
97+
}
98+
]
99+
}
100+
```
101+
102+
> note ""
103+
> **NOTE:** Replace the `ExternalID` list with the Segment `SourceId` values that are synced to the Data Lake.
104+
105+
### IAM Policy
106+
107+
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3
108+
109+
```json
110+
{
111+
"Version": "2012-10-17",
112+
"Statement": [
113+
{
114+
"Action": [
115+
"elasticmapreduce:TerminateJobFlows",
116+
"elasticmapreduce:RunJobFlow",
117+
"elasticmapreduce:DescribeStep",
118+
"elasticmapreduce:DescribeCluster",
119+
"elasticmapreduce:CancelSteps",
120+
"elasticmapreduce:AddJobFlowSteps"
121+
],
122+
"Effect": "Allow",
123+
"Resource": "*",
124+
"Condition": {
125+
"StringEquals": {
126+
"elasticmapreduce:ResourceTag/vendor": "segment"
127+
}
128+
}
129+
},
130+
{
131+
"Sid": "",
132+
"Effect": "Allow",
133+
"Action": [
134+
"glue:UpdateTable",
135+
"glue:UpdatePartition",
136+
"glue:GetTables",
137+
"glue:GetTableVersions",
138+
"glue:GetTableVersion",
139+
"glue:GetTable",
140+
"glue:GetPartitions",
141+
"glue:GetPartition",
142+
"glue:DeleteTableVersion",
143+
"glue:DeleteTable",
144+
"glue:DeletePartition",
145+
"glue:CreateTable",
146+
"glue:CreatePartition",
147+
"glue:CreateDatabase",
148+
"glue:BatchGetPartition",
149+
"glue:BatchDeleteTableVersion",
150+
"glue:BatchDeleteTable",
151+
"glue:BatchDeletePartition",
152+
"glue:BatchCreatePartition"
153+
],
154+
"Resource": [
155+
"arn:aws:glue:$REGION:$YOUR_ACCOUNT:table/*",
156+
"arn:aws:glue:$REGION:$YOUR_ACCOUNT:database/default",
157+
"arn:aws:glue:$REGION:$YOUR_ACCOUNT:database/*",
158+
"arn:aws:glue:$REGION:$YOUR_ACCOUNT:catalog"
159+
]
160+
},
161+
{
162+
"Effect": "Allow",
163+
"Action": "*",
164+
"Resource": [
165+
"arn:aws:s3:::$BUCKET_NAME/*",
166+
"arn:aws:s3:::$BUCKET_NAME"
167+
]
168+
},
169+
{
170+
"Effect": "Allow",
171+
"Action": [
172+
"athena:*"
173+
],
174+
"Resource": [
175+
"*"
176+
]
177+
}
178+
]
179+
}
180+
```
181+
182+
> note ""
183+
> **NOTE:** The policy above grants full access to Athena, but the individual Glue and S3 policies decide which table can be queried. Segment queries only for debugging purposes, and will notify you be for running any queries.
184+
185+
## Debugging
186+
187+
Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
188+
- Access the individual objects stored in S3 and the associated schema in order to understand data discrepancies
189+
- Run an Athena query on the underlying data stored in S3
190+
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
191+
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
192+
56.6 KB
Loading
269 KB
Loading
54.9 KB
Loading
31.1 KB
Loading
172 KB
Loading
485 KB
Loading

0 commit comments

Comments
 (0)