You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,8 @@ The instructions below will guide you through the process required to configure
13
13
In this step, you'll create the S3 bucket that will store both the intermediate and final data. For instructions on creating an S3 bucket, please see Amazon's documentation, [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
14
14
15
15
> info ""
16
-
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
16
+
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it.
17
+
<!--- In these instructions, the name is `segment-data-lake`. --->
17
18
18
19
After you create your S3 bucket, create a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For help on setting lifecycle configurations, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
19
20
@@ -32,7 +33,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
32
33
33
34
1. Select EMR from the AWS console by navigating to Services > Analytics > EMR.
34
35
2. Click **Create Cluster**, and select **Go to advanced options**.
35
-
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following software libraries:
36
+
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following applications:
36
37
- Hadoop 2.10.1
37
38
- Hive 2.3.7
38
39
- Hue 4.9.0
@@ -43,8 +44,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
43
44
- Use for Spark table metadata
44
45
<!---  --->
45
46
5. Select **Next** to move to Step 2: Hardware.
46
-
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. You can create EMR instances in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet makes it accessible from the Internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. <br />
47
-
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
47
+
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. EMR instances can be created in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. You can configure strict security groups for EMR clusters on public subnets to prevent inbound access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
48
48
49
49
7. In the Hardware Configuration section, create a cluster that includes the following on-demand nodes:
50
50
-**1** master node
@@ -55,12 +55,12 @@ For more information about configuring cluster hardware and networking, see Amaz
55
55
56
56
8. Select **Next** to proceed to Step 3: General Cluster Settings.
57
57
58
-
59
58
### Configure logging
60
59
61
-
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are given a new prefix, and separated from the final processed data.
60
+
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data. Once configured, logs are assigned a new prefix and separated from the final processed data.
61
+
<!--- (`segment-data-lakes` in this case) --->
62
62
63
-
10. Add a new key-value pair to the Tags section, a **vendor** key with a value of `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
63
+
10. Add a new key-value pair to the Tags section, a **vendor** key with a value of **segment**. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
64
64
65
65
11. Select **Next** to proceed to Step 4: Security.
66
66
@@ -77,7 +77,7 @@ For more information about configuring cluster hardware and networking, see Amaz
77
77
The image uses the default settings. You can make these settings more restrictive, if required. --->
78
78
79
79
> note ""
80
-
> If you are updating your Data Lakes instance, take note of the EMR cluster ID.
80
+
> **NOTE:**If you are updating the EMR cluster for your Data Lakes instance, note the EMR cluster ID.
81
81
82
82
## Step 3 - Create an Access Management role and policy
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/upgrade-emr-cluster.md
+12-29Lines changed: 12 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,42 +1,25 @@
1
-
# Upgrading Data Lakes
1
+
---
2
+
hidden: true
3
+
title: Upgrading EMR Clusters
4
+
---
5
+
{% include content/plan-grid.md name="data-lakes" %}
2
6
7
+
# Upgrading EMR Clusters
3
8
This document contains the instructions to manually update an existing Segment
4
-
Data Lake destination to use a new EMR cluster with version 5.33.0. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
9
+
Data Lake destination to use a new v5.33.0 EMR cluster. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
5
10
6
-
The Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster.
11
+
By updating your EMR cluster from 5.27.0 to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). Clusters running version 5.33.0 also allow for faster Parquet jobs and dynamic auto-scaling.
7
12
8
-
<!--- Any existing EMR clusters will
9
-
10
-
What happens to the existing EMR cluster? If there’s an ongoing sync, what will
11
-
happen to that?
12
-
If there is an ongoing sync in the existing cluster, the sync will complete (success/
13
-
fail) in the existing cluster. If the sync ends up failing and if the cluster setting has
14
-
been updated to use the new cluster, the next retry will be performed in the new
15
-
cluster.
16
-
. Does one need to stop a sync or disable the Segment Data Lake when
17
-
performing this update?
18
-
No, on-going syncs don’t need not be stopped nor Segment Data Lake needs to be
19
-
disabled. We will automatically restart any failed sync on the new cluster so there
20
-
should not be any manual intervention required.
21
-
22
-
. When can the customer safely delete the old EMR cluster?
23
-
The old EMR cluster could be deleted after all the Segment Data Lakes have been
24
-
updated to use the new cluster and the old EMR cluster doesn’t have any on-going
25
-
syncs. General recommendation is
26
-
Update EMR cluster setting in all the Segment Data Lakes
27
-
Wait for the next sync to be started and completed in the new cluster
28
-
Confirm new data is synced using the new cluster
29
-
Confirm no on-going jobs in the old cluster
30
-
Delete the old cluster --->
13
+
> info""
14
+
> Your Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are setting up a new EMR cluster will be restarted on the new cluster.
31
15
32
16
## Prerequisites
33
17
* S3 bucket with a lifecycle rule of 14 days
34
-
* An EMR cluster version 5.33.0 (for instructions)
35
-
* The ID of your EMR Cluster
18
+
* An EMR cluster version 5.33.0 (for help creating an v 5.33.0 EMR cluster, please see [Configure the Data Lakes AWS Environment](data-lakes-manual-setup.md))
36
19
37
20
## Procedure
38
21
1. Open your Segment App workspace and select your Data Lakes destination.
39
-
2. On the Settings tab, select EMR Cluster ID field and enter your EMR ID. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
22
+
2. On the Settings tab, select EMR Cluster ID field and enter the ID of your new EMR cluster. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
40
23
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name should remain the same.
41
24
3. Select **Save**.
42
25
4. You can delete your old EMR cluster from AWS when the following conditions have been met:
0 commit comments