Added outline of Azure resource setup steps, Vale updates

forstisabella · forstisabella · commit 01cd54e8bd8c · 2022-07-05T08:49:47.000-04:00
diff --git a/src/connections/storage/catalog/data-lakes/index.md b/src/connections/storage/catalog/data-lakes/index.md
@@ -18,19 +18,19 @@ To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lake
 
 Before you set up [AWS Data Lakes], you need the following resources:
 
-- An [AWS account](https://aws.amazon.com/account/)
-- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
+- An [AWS account](https://aws.amazon.com/account/){:target="_blank”}
+- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket){:target="_blank”} to receive data and store logs
 - A subnet within a VPC for the EMR cluster to run in
 
-### Step 1 - Set Up AWS Resources
+### Step 1 - Set up AWS resources
 
-You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
+You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake){:target="_blank”} to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
 
 You can also use Segment's [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
 
 The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
 
-### Step 2 - Enable Data Lakes Destination
+### Step 2 - Enable Data Lakes destination
 
 After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
 
@@ -61,43 +61,58 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
 Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
 
 
-### Step 3 - Verify Data is Synced to S3 and Glue
+### Step 3 - Verify data is synced to S3 and Glue
 
 You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
 
-To be alerted of sync failures via email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications).
+To be alerted of sync failures by email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications){:target="_blank”}.
 
 
 `Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
 
 
-### (Optional) Step 4 - Replay Historical Data
+### (Optional) Step 4 - Replay historical data
 
-If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
+If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/){:target="_blank”} to request one.
 
-The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
+The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, Segment recommends that you start with data from the last six months to get started, and then replay additional data if you find you need more.
 
 Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
 
 ## Set up [Azure Data Lakes]
 
-To set up [Azure Data Lakes], create your Azure resources and enable the Data Lakes destination in the Segment app.
+> info "[Azure Data Lakes] is currently in Public Beta"
+> [Azure Data Lakes] is available in Public Beta.
+
+To set up [Azure Data Lakes], create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
 
 ### Prerequisites
 
-### Step 1 - Create and ALDS-enabled storage account
+Before you can configure your Azure resources, you must first [create an Azure subscription](https://azure.microsoft.com/en-us/free/){:target="_blank”}.
+
+### Step 1 - Create an ALDS-enabled storage account
 
-### Step 2 - Setup KeyVault
+To 
 
-### Step 3 - Setup Azure MySQL DB
+### Step 2 - Set up KeyVault
+
+### Step 3 - Set up Azure MySQL database
 
 ### Step 4 - Set up Databricks
 
-### Step 5 - Setup a Service Principal
+### Step 5 - Set up a Service Principal
 
 ### Step 6 - Configure Databricks cluster
 
-### Step 7 - Enable Data Lakes destination in the Segment app
+### Step 7 - Enable the Data Lakes destination in the Segment app
+
+After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
+
+<!-- TODO: Test this workflow in a staging environment to verify that the steps are correct-->
+
+1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**, then search for and select **Azure Data Lakes**.
+2. 
+
 
 ### Optional - Set up the Data Lake using Terraform
 
@@ -147,7 +162,7 @@ To connect a new source to Data Lakes:
 {% endfaqitem %}
 
 {% faqitem Can I configure multiple sources to use the same EMR cluster? %}
-Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
+Yes, you can configure multiple sources to use the same EMR cluster. Segment recommends that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
 {% endfaqitem %}
 
 {% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
@@ -15,12 +15,12 @@ Data lakes typically have four layers:
 
 ![A graphic showing the information flowing from the metadata into the query, compute, and metadata layers, and then into the storage layer](images/data_lakes_overview_graphic.png)
 
-Segment Data Lakes sends Segment data to a cloud data store (either AWS S3 or Azure Data Lake Storage Gen2) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
+Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
 
 > info ""
 > Segment Data Lakes is available to Business tier customers only.
 
-To learn more about Segment Data Lakes, check out the [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"} blog post.
+To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
 
 ## How Segment Data Lakes work
 
@@ -38,38 +38,40 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
 
 ### How [Azure Data Lakes] works
 
-Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
+Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
 
 ![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png)
 
 
 ## Set up Segment Data Lakes
 
-For more detailed information about setting up AWS and Azure Data Lakes, please see 
+For more detailed information about setting up AWS and Azure Data Lakes, please see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
 
 ### Set up [AWS Data Lakes]
-For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
+When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below.
 
 #### EMR
 
-Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
+Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
 
 #### AWS IAM role
 
 Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
-- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
+- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
 - **s3_bucket**: Name of the S3 bucket used by the Data Lake.
 
 ### Set up [Azure Data Lakes]
 
+
+
 Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
 
-- Azure Storage Account
-- Service Principal
-- Databricks Instance
-- Databricks Cluster
-- Azure MySQL Database
-- Azure KeyVault Instance: 
+- [Azure Storage Account](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal){:target="_blank”}: An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks. 
+- [Service Principal](https://docs.microsoft.com/en-us/azure/purview/create-service-principal-azure){:target="_blank”}: Service principals are identities used to access specific resources.
+- [Databricks Instance](https://azure.microsoft.com/en-us/services/databricks/#overview){:target="_blank”}: Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications. 
+- [Databricks Cluster](https://docs.microsoft.com/en-us/azure/purview/register-scan-hive-metastore-source){:target="_blank”}: The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads.
+- [Azure MySQL Database](https://docs.microsoft.com/en-us/azure/purview/register-scan-azure-mysql-database){:target="_blank”}: The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
+- [Azure KeyVault Instance](https://docs.microsoft.com/en-us/azure/key-vault/general/quick-create-portal){:target="_blank”}: Azure KeyVault provides a secure store for your keys, secrets, and certificates. 
 
 For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
 
@@ -112,7 +114,7 @@ Data Lakes stores the inferred schema and associated metadata of the S3 data in
 ![A screenshot of the AWS ios_prod_identify table, displaying the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
 <!--
 TODO:
-add annotated glue image calling out different parts of inferred schema)
+add annotated glue image calling out different parts of inferred schema
 -->
 
 New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
@@ -128,23 +130,23 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
 
 ### Data types
 
-Data Lakes infers the data type for an event it receives. Groups of events are poled every hour to infer the data type for that each event.
+Data Lakes infers the data type for an event it receives. Groups of events are polled every hour to infer the data type for that each event.
 
-The data types supported in Glue are:
+The data types supported in [AWS Data Lakes] are:
 - bigint
 - boolean
 - decimal(38,6)
 - string
 - timestamp
 
-The data types supported in the Hive Metastore are:
+The data types supported in the [Azure Data Lakes] are:
 - bigint
 - boolean
 - decimal(38,6)
 - string
 - timestamp
 
-#### Schema evolution
+### Schema evolution
 
 Once Data Lakes sets a data type for a column, all subsequent data will attempt to be cast into that data type. If incoming data does not match the data type, Data Lakes tries to cast the column to the target data type.
 
@@ -167,7 +169,7 @@ In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate
 
 The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
 
-When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
+When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or ADLS or you can use Data Lakes in addition to a data warehouse.
 
 ## FAQ
 {% faq %}