Skip to content

Commit 515ad92

Browse files
committed
Clarifying info on EMR cluster updates page, copyediting data lakes setup page
1 parent f2a32f4 commit 515ad92

File tree

2 files changed

+20
-37
lines changed

2 files changed

+20
-37
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ The instructions below will guide you through the process required to configure
1313
In this step, you'll create the S3 bucket that will store both the intermediate and final data. For instructions on creating an S3 bucket, please see Amazon's documentation, [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
1414

1515
> info ""
16-
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it. In these instructions, the name is `segment-data-lake`.
16+
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it.
17+
<!--- In these instructions, the name is `segment-data-lake`. --->
1718
1819
After you create your S3 bucket, create a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For help on setting lifecycle configurations, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
1920

@@ -32,7 +33,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
3233

3334
1. Select EMR from the AWS console by navigating to Services > Analytics > EMR.
3435
2. Click **Create Cluster**, and select **Go to advanced options**.
35-
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following software libraries:
36+
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following applications:
3637
- Hadoop 2.10.1
3738
- Hive 2.3.7
3839
- Hue 4.9.0
@@ -43,8 +44,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
4344
- Use for Spark table metadata
4445
<!--- ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png) --->
4546
5. Select **Next** to move to Step 2: Hardware.
46-
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. You can create EMR instances in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet makes it accessible from the Internet. You can configure strict security groups to prevent inbound access to the cluster. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. <br />
47-
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
47+
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. EMR instances can be created in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. You can configure strict security groups for EMR clusters on public subnets to prevent inbound access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
4848

4949
7. In the Hardware Configuration section, create a cluster that includes the following on-demand nodes:
5050
- **1** master node
@@ -55,12 +55,12 @@ For more information about configuring cluster hardware and networking, see Amaz
5555

5656
8. Select **Next** to proceed to Step 3: General Cluster Settings.
5757

58-
5958
### Configure logging
6059

61-
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data (`segment-data-lakes` in this case). Once configured, logs are given a new prefix, and separated from the final processed data.
60+
9. On Step 3: General Cluster Settings, configure logging to use the same S3 bucket you configured as the destination for the final data. Once configured, logs are assigned a new prefix and separated from the final processed data.
61+
<!--- (`segment-data-lakes` in this case) --->
6262

63-
10. Add a new key-value pair to the Tags section, a **vendor** key with a value of `segment`. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
63+
10. Add a new key-value pair to the Tags section, a **vendor** key with a value of **segment**. The IAM policy uses this to provide Segment access to submit jobs in the EMR cluster.
6464

6565
11. Select **Next** to proceed to Step 4: Security.
6666

@@ -77,7 +77,7 @@ For more information about configuring cluster hardware and networking, see Amaz
7777
The image uses the default settings. You can make these settings more restrictive, if required. --->
7878

7979
> note ""
80-
> If you are updating your Data Lakes instance, take note of the EMR cluster ID.
80+
> **NOTE:** If you are updating the EMR cluster for your Data Lakes instance, note the EMR cluster ID.
8181
8282
## Step 3 - Create an Access Management role and policy
8383

src/connections/storage/data-lakes/upgrade-emr-cluster.md

Lines changed: 12 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,25 @@
1-
# Upgrading Data Lakes
1+
---
2+
hidden: true
3+
title: Upgrading EMR Clusters
4+
---
5+
{% include content/plan-grid.md name="data-lakes" %}
26

7+
# Upgrading EMR Clusters
38
This document contains the instructions to manually update an existing Segment
4-
Data Lake destination to use a new EMR cluster with version 5.33.0. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
9+
Data Lake destination to use a new v5.33.0 EMR cluster. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
510

6-
The Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster.
11+
By updating your EMR cluster from 5.27.0 to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). Clusters running version 5.33.0 also allow for faster Parquet jobs and dynamic auto-scaling.
712

8-
<!--- Any existing EMR clusters will
9-
10-
What happens to the existing EMR cluster? If there’s an ongoing sync, what will
11-
happen to that?
12-
If there is an ongoing sync in the existing cluster, the sync will complete (success/
13-
fail) in the existing cluster. If the sync ends up failing and if the cluster setting has
14-
been updated to use the new cluster, the next retry will be performed in the new
15-
cluster.
16-
. Does one need to stop a sync or disable the Segment Data Lake when
17-
performing this update?
18-
No, on-going syncs don’t need not be stopped nor Segment Data Lake needs to be
19-
disabled. We will automatically restart any failed sync on the new cluster so there
20-
should not be any manual intervention required.
21-
22-
. When can the customer safely delete the old EMR cluster?
23-
The old EMR cluster could be deleted after all the Segment Data Lakes have been
24-
updated to use the new cluster and the old EMR cluster doesn’t have any on-going
25-
syncs. General recommendation is
26-
Update EMR cluster setting in all the Segment Data Lakes
27-
Wait for the next sync to be started and completed in the new cluster
28-
Confirm new data is synced using the new cluster
29-
Confirm no on-going jobs in the old cluster
30-
Delete the old cluster --->
13+
> info""
14+
> Your Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are setting up a new EMR cluster will be restarted on the new cluster.
3115
3216
## Prerequisites
3317
* S3 bucket with a lifecycle rule of 14 days
34-
* An EMR cluster version 5.33.0 (for instructions)
35-
* The ID of your EMR Cluster
18+
* An EMR cluster version 5.33.0 (for help creating an v 5.33.0 EMR cluster, please see [Configure the Data Lakes AWS Environment](data-lakes-manual-setup.md))
3619

3720
## Procedure
3821
1. Open your Segment App workspace and select your Data Lakes destination.
39-
2. On the Settings tab, select EMR Cluster ID field and enter your EMR ID. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
22+
2. On the Settings tab, select EMR Cluster ID field and enter the ID of your new EMR cluster. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
4023
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name should remain the same.
4124
3. Select **Save**.
4225
4. You can delete your old EMR cluster from AWS when the following conditions have been met:

0 commit comments

Comments
 (0)