Skip to content

Commit f2a32f4

Browse files
committed
DOC-356 First pass of the data lakes setup doc, adding page for upgrading emr cluster
1 parent c484675 commit f2a32f4

File tree

2 files changed

+88
-23
lines changed

2 files changed

+88
-23
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 42 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,55 +10,74 @@ The instructions below will guide you through the process required to configure
1010

1111
## Step 1 - Create an S3 Bucket
1212

13-
In this step, you'll create the S3 bucket that will store both the intermediate and final data.
13+
In this step, you'll create the S3 bucket that will store both the intermediate and final data. For instructions on creating an S3 bucket, please see Amazon's documentation, [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
1414

1515
> info ""
1616
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
1717
18-
During the set up process, create a Lifecycle rule and set it to expire staging data after **14 days**. For more information, see Amazon's documentation, [How do I create a lifecycle?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html).
18+
After you create your S3 bucket, create a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For help on setting lifecycle configurations, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
1919

20-
![Create a Lifecycle rule to expire staging data after 14 days](images/01_14-day-lifecycle.png)
20+
The following lifecycle settings should be selected:
21+
* **Expire after:** 14 days
22+
* **Permanently delete after:** 14 days
23+
* **Clean up incomplete mulitpart uploads:** after 14 days
24+
25+
<!--- ![Create a Lifecycle rule to expire staging data after 14 days](images/01_14-day-lifecycle.png) --->
2126

2227
## Step 2 - Configure an EMR cluster
2328

24-
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute as required.
29+
Segment requires access to an EMR cluster to perform necessary data processing. We recommend starting with a small cluster, with the option to add more compute resources as required.
2530

2631
### Configure the hardware and networking configuration
2732

28-
1. Locate and select EMR from the AWS console.
29-
2. Click **Create Cluster**, and open the **Advanced Options**.
30-
3. In the Advanced Options, on Step 1: Software and Steps, ensure you select the following options, along with the defaults:
31-
- `Use for Hive table metadata`
32-
- `Use for Spark table metadata` ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png)
33-
4. In the Networking setup section, select to create the cluster in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration. Creating a cluster in a public subnet is accessible from the internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security before you configure your EMR cluster.
34-
5. In the Hardware Configuration section, create a cluster with the nodes listed below. This configuration uses the default **On demand** purchasing option for the instances.
33+
1. Select EMR from the AWS console by navigating to Services > Analytics > EMR.
34+
2. Click **Create Cluster**, and select **Go to advanced options**.
35+
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following software libraries:
36+
- Hadoop 2.10.1
37+
- Hive 2.3.7
38+
- Hue 4.9.0
39+
- Spark 2.4.7
40+
- Pig 0.17.0
41+
4. Under the AWS Glue Data Catalog settings, select the following options:
42+
- Use for Hive table metadata
43+
- Use for Spark table metadata
44+
<!--- ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png) --->
45+
5. Select **Next** to move to Step 2: Hardware.
46+
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. You can create EMR instances in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet makes it accessible from the Internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. <br />
47+
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
48+
49+
7. In the Hardware Configuration section, create a cluster that includes the following on-demand nodes:
3550
- **1** master node
3651
- **2** core nodes
37-
- **2** task nodes ![Configure the number of nodes](images/03_hardware-node-instances.png)
52+
- **2** task nodes
53+
<!--- ![Configure the number of nodes](images/03_hardware-node-instances.png --->
54+
For more information about configuring cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
3855

39-
For more information about configuring the cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
56+
8. Select **Next** to proceed to Step 3: General Cluster Settings.
4057

4158

4259
### Configure logging
4360

44-
On the General Options step, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are to a new prefix, and separated from the final processed data.
61+
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are given a new prefix, and separated from the final processed data.
4562

46-
Set value of the **vendor** tag to `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
63+
10. Add a new key-value pair to the Tags section, a **vendor** key with a value of `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
4764

65+
11. Select **Next** to proceed to Step 4: Security.
4866

49-
![Configure logging](images/05_logging.png)
67+
<!---![Configure logging](images/05_logging.png) --->
5068

5169
### Secure the cluster
70+
12. Create or select an **EC2 key pair**.
71+
13. Choose the appropriate roles in the **EC2 instance profile**.
72+
14. Expand the EC2 security group section and select the appropriate security groups for the Master and Core & Task types.
73+
15. Update any additional security options, then select **Create cluster**.
5274

53-
On the Security step, be sure to complete the following steps:
54-
1. Create or select an **EC2 key pair**.
55-
2. Choose the appropriate roles in the **EC2 instance profile**.
56-
3. Select the appropriate security groups for the Master and Core & Task types.
57-
58-
![Secure the cluster](images/06_secure-cluster.png)
75+
<!--- ![Secure the cluster](images/06_secure-cluster.png)
5976
60-
The image uses the default settings. You can make these settings more restrictive, if required.
77+
The image uses the default settings. You can make these settings more restrictive, if required. --->
6178

79+
> note ""
80+
> If you are updating your Data Lakes instance, take note of the EMR cluster ID.
6281
6382
## Step 3 - Create an Access Management role and policy
6483

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Upgrading Data Lakes
2+
3+
This document contains the instructions to manually update an existing Segment
4+
Data Lake destination to use a new EMR cluster with version 5.33.0. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
5+
6+
The Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster.
7+
8+
<!--- Any existing EMR clusters will
9+
10+
What happens to the existing EMR cluster? If there’s an ongoing sync, what will
11+
happen to that?
12+
If there is an ongoing sync in the existing cluster, the sync will complete (success/
13+
fail) in the existing cluster. If the sync ends up failing and if the cluster setting has
14+
been updated to use the new cluster, the next retry will be performed in the new
15+
cluster.
16+
. Does one need to stop a sync or disable the Segment Data Lake when
17+
performing this update?
18+
No, on-going syncs don’t need not be stopped nor Segment Data Lake needs to be
19+
disabled. We will automatically restart any failed sync on the new cluster so there
20+
should not be any manual intervention required.
21+
22+
. When can the customer safely delete the old EMR cluster?
23+
The old EMR cluster could be deleted after all the Segment Data Lakes have been
24+
updated to use the new cluster and the old EMR cluster doesn’t have any on-going
25+
syncs. General recommendation is
26+
Update EMR cluster setting in all the Segment Data Lakes
27+
Wait for the next sync to be started and completed in the new cluster
28+
Confirm new data is synced using the new cluster
29+
Confirm no on-going jobs in the old cluster
30+
Delete the old cluster --->
31+
32+
## Prerequisites
33+
* S3 bucket with a lifecycle rule of 14 days
34+
* An EMR cluster version 5.33.0 (for instructions)
35+
* The ID of your EMR Cluster
36+
37+
## Procedure
38+
1. Open your Segment App workspace and select your Data Lakes destination.
39+
2. On the Settings tab, select EMR Cluster ID field and enter your EMR ID. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
40+
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name should remain the same.
41+
3. Select **Save**.
42+
4. You can delete your old EMR cluster from AWS when the following conditions have been met:
43+
* You have updated all Data Lakes to use the EMR cluster
44+
* A sync has successfully completed in the new cluster
45+
* Data is synced into the new cluster
46+
* There are no ongoing jobs in the old cluster

0 commit comments

Comments
 (0)