Skip to content

Commit 74868cd

Browse files
committed
Upgrade > update, copyediting, adding minimum node specs
1 parent 515ad92 commit 74868cd

File tree

2 files changed

+22
-15
lines changed

2 files changed

+22
-15
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
3333

3434
1. Select EMR from the AWS console by navigating to Services > Analytics > EMR.
3535
2. Click **Create Cluster**, and select **Go to advanced options**.
36-
3. In Advanced Options, on Step 1: Software and Steps, select the `emr-5.33.0` release and the following applications:
36+
3. In Advanced Options, on Step 1: Software and Steps, select both the `emr-5.33.0` release and the following applications:
3737
- Hadoop 2.10.1
3838
- Hive 2.3.7
3939
- Hue 4.9.0
@@ -44,14 +44,20 @@ Segment requires access to an EMR cluster to perform necessary data processing.
4444
- Use for Spark table metadata
4545
<!--- ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png) --->
4646
5. Select **Next** to move to Step 2: Hardware.
47-
6. Under the Networking section, select a Network and EC2 Subnet for your EMR instance. EMR instances can be created in either a public or private subnet. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. You can configure strict security groups for EMR clusters on public subnets to prevent inbound access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
47+
6. Under the Networking section, select a Network and EC2 Subnet (either public or private) for your EMR instance. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. Users who create clusters in public EC2 subnets can configure strict security groups for EMR clusters on public subnets to prevent inbound access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
4848

4949
7. In the Hardware Configuration section, create a cluster that includes the following on-demand nodes:
5050
- **1** master node
5151
- **2** core nodes
52-
- **2** task nodes
52+
- **2** task nodes
5353
<!--- ![Configure the number of nodes](images/03_hardware-node-instances.png --->
54-
For more information about configuring cluster hardware and networking, see Amazon's document, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
54+
Each of the master, core, and task nodes should meet or exceed the following specifications:
55+
* Instance type: mx5.xlarge
56+
* Number of vCores: 4
57+
* Memory: 16 GiB
58+
* EBS Storage: 64 GiB, EBS only storage
59+
60+
For more information about configuring cluster hardware and networking, please see Amazon's documentation, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
5561

5662
8. Select **Next** to proceed to Step 3: General Cluster Settings.
5763

src/connections/storage/data-lakes/upgrade-emr-cluster.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,29 @@
11
---
22
hidden: true
3-
title: Upgrading EMR Clusters
3+
title: Updating EMR Clusters
44
---
55
{% include content/plan-grid.md name="data-lakes" %}
66

7-
# Upgrading EMR Clusters
8-
This document contains the instructions to manually update an existing Segment
9-
Data Lake destination to use a new v5.33.0 EMR cluster. The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
7+
# Updating EMR Clusters
8+
You can manually update an existing Segment Data Lake destination to use a v5.33.0 EMR cluster.
9+
The Segment Data Lake on the new version will continue to use the Glue data catalog you have previously configured.
1010

11-
By updating your EMR cluster from 5.27.0 to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). Clusters running version 5.33.0 also allow for faster Parquet jobs and dynamic auto-scaling.
11+
By updating your EMR cluster from 5.27.0 to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc), use dynamic auto-scaling, and experience faster Parquet jobs.
1212

1313
> info""
14-
> Your Segment Data Lake does not need to be disabled during the upgrade process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are setting up a new EMR cluster will be restarted on the new cluster.
14+
> Your Segment Data Lake does not need to be disabled during the update process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are setting up a new EMR cluster will be restarted on the new cluster.
1515
1616
## Prerequisites
1717
* S3 bucket with a lifecycle rule of 14 days
18-
* An EMR cluster version 5.33.0 (for help creating an v 5.33.0 EMR cluster, please see [Configure the Data Lakes AWS Environment](data-lakes-manual-setup.md))
18+
* An EMR v5.33.0 cluster (for instructions on creating an EMR cluster, please see [Configure the Data Lakes AWS Environment](data-lakes-manual-setup.md))
1919

2020
## Procedure
21-
1. Open your Segment App workspace and select your Data Lakes destination.
22-
2. On the Settings tab, select EMR Cluster ID field and enter the ID of your new EMR cluster. For more information about your EMR Cluster, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
23-
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name should remain the same.
21+
1. Open your Segment App workspace and select the Data Lakes destination.
22+
2. On the Settings tab, select EMR Cluster ID field and enter the ID of your v5.33.0 EMR cluster. For help finding the cluster ID, please see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html) documentation. <br/>
23+
**Note:** Your Glue Catalog ID, IAM Role ARN, and Glue database name fields in Segment should remain the same.
2424
3. Select **Save**.
25-
4. You can delete your old EMR cluster from AWS when the following conditions have been met:
25+
4. View the EMR cluster in the AWS EMR Clusters page to verify the cluster is working correctly.
26+
5. Delete your v5.27.0 EMR cluster from AWS after the following conditions have been met:
2627
* You have updated all Data Lakes to use the EMR cluster
2728
* A sync has successfully completed in the new cluster
2829
* Data is synced into the new cluster

0 commit comments

Comments
 (0)