You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+42-23Lines changed: 42 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,55 +10,74 @@ The instructions below will guide you through the process required to configure
10
10
11
11
## Step 1 - Create an S3 Bucket
12
12
13
-
In this step, you'll create the S3 bucket that will store both the intermediate and final data.
13
+
In this step, you'll create the S3 bucket that will store both the intermediate and final data. For instructions on creating an S3 bucket, please see Amazon's documentation, [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
14
14
15
15
> info ""
16
16
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
17
17
18
-
During the set up process, create a Lifecycle rule and set it to expire staging data after **14 days**. For more information, see Amazon's documentation, [How do I create a lifecycle?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html).
18
+
After you create your S3 bucket, create a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For help on setting lifecycle configurations, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
19
19
20
-

20
+
The following lifecycle settings should be selected:
21
+
***Expire after:** 14 days
22
+
***Permanently delete after:** 14 days
23
+
***Clean up incomplete mulitpart uploads:** after 14 days
24
+
25
+
<!---  --->
21
26
22
27
## Step 2 - Configure an EMR cluster
23
28
24
-
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute as required.
29
+
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute resources as required.
25
30
26
31
### Configure the hardware and networking configuration
27
32
28
-
1. Locate and select EMR from the AWS console.
29
-
2. Click **Create Cluster**, and open the **Advanced Options**.
30
-
3. In the Advanced Options, on Step 1: Software and Steps, ensure you select the following options, along with the defaults:
31
-
-`Use for Hive table metadata`
32
-
-`Use for Spark table metadata`
33
-
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration. Creating a cluster in a public subnet is accessible from the internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
34
-
5. In the Hardware Configuration section, create a cluster with the nodes listed below. This configuration uses the default **On demand** purchasing option for the instances.
33
+
1. Select EMR from the AWS console by navigating to Services > Analytics > EMR.
34
+
2. Click **Create Cluster**, and select **Go to advanced options**.
35
+
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following software libraries:
36
+
- Hadoop 2.10.1
37
+
- Hive 2.3.7
38
+
- Hue 4.9.0
39
+
- Spark 2.4.7
40
+
- Pig 0.17.0
41
+
4. Under the AWS Glue Data Catalog settings, select the following options:
42
+
- Use for Hive table metadata
43
+
- Use for Spark table metadata
44
+
<!---  --->
45
+
5. Select **Next** to move to Step 2: Hardware.
46
+
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. You can create EMR instances in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet makes it accessible from the Internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. <br />
47
+
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
48
+
49
+
7. In the Hardware Configuration section, create a cluster that includes the following on-demand nodes:
35
50
-**1** master node
36
51
-**2** core nodes
37
-
-**2** task nodes 
52
+
-**2** task nodes
53
+
<!--- .
38
55
39
-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
56
+
8. Select **Next** to proceed to Step 3: General Cluster Settings.
40
57
41
58
42
59
### Configure logging
43
60
44
-
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are to a new prefix, and separated from the final processed data.
61
+
9.On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are given a new prefix, and separated from the final processed data.
45
62
46
-
Set value of the **vendor**tag to`segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
63
+
10. Add a new key-value pair to the Tags section, a **vendor**key with a value of`segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
47
64
65
+
11. Select **Next** to proceed to Step 4: Security.
This document contains the instructions to manually update an existing Segment
4
+
Data Lake destination to use a new EMR cluster with version 5.33.0. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
5
+
6
+
The Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster.
7
+
8
+
<!--- Any existing EMR clusters will
9
+
10
+
What happens to the existing EMR cluster? If there’s an ongoing sync, what will
11
+
happen to that?
12
+
If there is an ongoing sync in the existing cluster, the sync will complete (success/
13
+
fail) in the existing cluster. If the sync ends up failing and if the cluster setting has
14
+
been updated to use the new cluster, the next retry will be performed in the new
15
+
cluster.
16
+
. Does one need to stop a sync or disable the Segment Data Lake when
17
+
performing this update?
18
+
No, on-going syncs don’t need not be stopped nor Segment Data Lake needs to be
19
+
disabled. We will automatically restart any failed sync on the new cluster so there
20
+
should not be any manual intervention required.
21
+
22
+
. When can the customer safely delete the old EMR cluster?
23
+
The old EMR cluster could be deleted after all the Segment Data Lakes have been
24
+
updated to use the new cluster and the old EMR cluster doesn’t have any on-going
25
+
syncs. General recommendation is
26
+
Update EMR cluster setting in all the Segment Data Lakes
27
+
Wait for the next sync to be started and completed in the new cluster
28
+
Confirm new data is synced using the new cluster
29
+
Confirm no on-going jobs in the old cluster
30
+
Delete the old cluster --->
31
+
32
+
## Prerequisites
33
+
* S3 bucket with a lifecycle rule of 14 days
34
+
* An EMR cluster version 5.33.0 (for instructions)
35
+
* The ID of your EMR Cluster
36
+
37
+
## Procedure
38
+
1. Open your Segment App workspace and select your Data Lakes destination.
39
+
2. On the Settings tab, select EMR Cluster ID field and enter your EMR ID. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
40
+
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name should remain the same.
41
+
3. Select **Save**.
42
+
4. You can delete your old EMR cluster from AWS when the following conditions have been met:
43
+
* You have updated all Data Lakes to use the EMR cluster
44
+
* A sync has successfully completed in the new cluster
0 commit comments