Skip to content

Commit 23a88c6

Browse files
committed
Copyediting, adding more context about VPCs
1 parent dbf9f5a commit 23a88c6

File tree

3 files changed

+46
-42
lines changed

3 files changed

+46
-42
lines changed

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,26 @@ title: Configure the Data Lakes AWS Environment
77

88
The instructions below will guide you through the process required to configure the environment required to begin loading data into your Segment Data Lake. For a more automated process, see [Set Up Segment Data Lakes](/src/connections/storage/catalog/data-lakes/index.md).
99

10+
As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
1011

11-
## Step 1 - Create an S3 Bucket
12+
## Step 1 - Create a VPC and an S3 bucket
1213

13-
In this step, you'll create the S3 bucket that will store both the intermediate and final data. For instructions on creating an S3 bucket, please see Amazon's documentation, [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
14+
In this step, you'll create a Virtual Private Cloud (VPC) to securely launch your AWS resources into and an S3 bucket that will store both the intermediate and final data.
15+
16+
To create a VPC, follow the instructions outlined in Amazon's documentation, [Create and configure your VPC](https://docs.aws.amazon.com/directoryservice/latest/admin-guide/gsg_create_vpc.html).
17+
18+
To create an S3 bucket, see Amazon's [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) instructions.
1419

1520
> info ""
1621
> Take note of the S3 bucket name you set in this step, as the rest of the set up flow requires it.
1722
<!--- In these instructions, the name is `segment-data-lake`. --->
1823
19-
After you create your S3 bucket, create a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For help on setting lifecycle configurations, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
24+
After creating an S3 bucket, configure a lifecycle rule for the bucket and set it to expire staging data after **14 days**. For instructions on configuring lifecycle rules, see Amazon's documentation, [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).
2025

21-
The following lifecycle settings should be selected:
26+
The following lifecycle settings should be applied to your staging data:
2227
* **Expire after:** 14 days
2328
* **Permanently delete after:** 14 days
24-
* **Clean up incomplete mulitpart uploads:** after 14 days
29+
* **Clean up incomplete multipart uploads:** after 14 days
2530

2631
<!--- ![Create a Lifecycle rule to expire staging data after 14 days](images/01_14-day-lifecycle.png) --->
2732

@@ -43,8 +48,8 @@ Segment requires access to an EMR cluster to perform necessary data processing.
4348
- Use for Hive table metadata
4449
- Use for Spark table metadata
4550
<!--- ![Select to use for both Have and Spark table metadata](images/02_hive-spark-table.png) --->
46-
5. Select **Next** to move to Step 2: Hardware.
47-
6. Under the Networking section, select a Network and EC2 Subnet (either public or private) for your EMR instance. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. Users who create clusters in public EC2 subnets can configure strict security groups for EMR clusters on public subnets to prevent inbound access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information. As a best practice, Segment recommends that you consult with your network and security teams before you configure your EMR cluster.
51+
5. Select **Next** to proceed to Step 2: Hardware.
52+
6. Under the Networking section, select a Network (the VPC you created in [Step 1](#step-1---create-a-vpc-and-an-s3-bucket)) and EC2 Subnet for your EMR instance. Creating the cluster in a private subnet is more secure, but requires additional configuration, while creating a cluster in a public subnet leaves it accessible from the Internet. Users who create clusters in public subnets can configure strict security groups to prevent unauthorized inbound EMR cluster access. See Amazon's document, [Amazon VPC Options - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-clusters-in-a-vpc.html) for more information.
4853

4954
7. In the Cluster Nodes and Instances section, create a cluster that includes the following on-demand nodes:
5055
- **1** master node
@@ -57,7 +62,7 @@ Segment requires access to an EMR cluster to perform necessary data processing.
5762
* Memory: 16 GiB
5863
* EBS Storage: 64 GiB, EBS only storage
5964

60-
For more information about configuring cluster hardware and networking, please see Amazon's documentation, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
65+
For more information about configuring cluster hardware and networking, see Amazon's documentation, [Configure Cluster Hardware and Networking](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html).
6166

6267
8. Select **Next** to proceed to Step 3: General Cluster Settings.
6368

@@ -73,17 +78,17 @@ Segment requires access to an EMR cluster to perform necessary data processing.
7378
<!---![Configure logging](images/05_logging.png) --->
7479

7580
### Secure the cluster
76-
12. Create or select an **EC2 key pair**.
81+
12. On Step 4: Security, in the Security Options section, create or select an **EC2 key pair**.
7782
13. Choose the appropriate roles in the **EC2 instance profile**.
78-
14. Expand the EC2 security group section and select the appropriate security groups for the Master and Core & Task types.
79-
15. Update any additional security options, then select **Create cluster**.
83+
14. Expand the EC2 security groups section and select the appropriate security groups for the Master and Core & Task types.
84+
15. Select **Create cluster**.
8085

8186
<!--- ![Secure the cluster](images/06_secure-cluster.png)
8287
8388
The image uses the default settings. You can make these settings more restrictive, if required. --->
8489

8590
> note ""
86-
> **NOTE:** If you are updating the EMR cluster for your Data Lakes instance, note the EMR cluster ID.
91+
> **NOTE:** If you are updating the EMR cluster for an existing Data Lakes instance, note the EMR cluster ID on the confirmation page.
8792
8893
## Step 3 - Create an Access Management role and policy
8994

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
hidden: true
3+
title: Updating EMR Clusters
4+
---
5+
{% include content/plan-grid.md name="data-lakes" %}
6+
7+
# Updating EMR Clusters
8+
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After updating your EMR cluster, your Segment Data Lake will continue to use the Glue data catalog initially configured.
9+
10+
By updating your EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc), use dynamic auto-scaling, and experience faster Parquet jobs.
11+
12+
> info""
13+
> Your Segment Data Lake does not need to be disabled during the update process, and any ongoing syncs will complete on the old cluster. Any syncs that fail while you are updating the cluster ID field will be restarted on the new cluster.
14+
15+
## Prerequisites
16+
* An EMR v5.33.0 cluster (for instructions on creating an EMR cluster, see [Configure the Data Lakes AWS Environment](data-lakes-manual-setup.md))
17+
* An existing Segment Data Lakes destination
18+
19+
## Procedure
20+
1. Open your Segment app workspace and select the Data Lakes destination.
21+
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You do not need to update the Glue Catalog ID, IAM Role ARN, and S3 Bucket name fields.
22+
3. Select **Save**.
23+
4. In AWS, view the Events tab for your cluster to verify it is receiving data.
24+
25+
You can delete the old EMR cluster from AWS after the following conditions have been met:
26+
* You have updated all Data Lakes to use the EMR cluster
27+
* A sync has successfully completed in the new cluster
28+
* Data is synced into the new cluster
29+
* There are no ongoing jobs in the old cluster

src/connections/storage/data-lakes/upgrade-emr-cluster.md

Lines changed: 0 additions & 30 deletions
This file was deleted.

0 commit comments

Comments
 (0)