First pass of overview/catalog updates [DOC-493]

forstisabella · forstisabella · commit cb32198e1639 · 2022-06-29T16:11:39.000-04:00
diff --git a/src/connections/storage/catalog/data-lakes/index.md b/src/connections/storage/catalog/data-lakes/index.md
@@ -8,25 +8,29 @@ redirect_from: '/connections/destinations/catalog/data-lakes/'
 Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation.
 
 > note "Lake Formation"
-> You can also set up your Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
+> You can also set up your [AWS Data Lakes] using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
 
-## Pre-Requisites
+## Set up [AWS Data Lakes]
 
-Before you set up Segment Data Lakes, you need the following resources:
+To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lakes] destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
+
+### Prerequisites
+
+Before you set up [AWS Data Lakes], you need the following resources:
 
 - An [AWS account](https://aws.amazon.com/account/)
 - An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
 - A subnet within a VPC for the EMR cluster to run in
 
-## Step 1 - Set Up AWS Resources
+### Step 1 - Set Up AWS Resources
 
 You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
 
 You can also use Segment's [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
 
 The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
 
-## Step 2 - Enable Data Lakes Destination
+### Step 2 - Enable Data Lakes Destination
 
 After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
 
@@ -57,7 +61,7 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
 Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
 
 
-## Step 3 - Verify Data is Synced to S3 and Glue
+### Step 3 - Verify Data is Synced to S3 and Glue
 
 You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
 
@@ -67,25 +71,49 @@ To be alerted of sync failures via email, subscribe to the `Storage Destination
 `Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
 
 
-## (Optional) Step 4 - Replay Historical Data
+### (Optional) Step 4 - Replay Historical Data
 
 If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
 
 The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
 
 Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
 
+## Set up [Azure Data Lakes]
+
+To set up [Azure Data Lakes], create your Azure resources and enable the Data Lakes destination in the Segment app.
+
+### Prerequisites
+
+### Step 1 - Create and ALDS-enabled storage account
+
+### Step 2 - Setup KeyVault
+
+### Step 3 - Setup Azure MySQL DB
+
+### Step 4 - Set up Databricks
+
+### Step 5 - Setup a Service Principal
+
+### Step 6 - Configure Databricks cluster
+
+### Step 7 - Enable Data Lakes destination in the Segment app
+
+### Optional - Set up the Data Lake using Terraform
+
 ## FAQ
 
-### Data Lakes Set Up
+### [AWS Data Lakes]
 
 {% faq %}
 {% faqitem Do I need to create Glue databases? %}
 No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
 {% endfaqitem %}
+
 {% faqitem What IAM role do I use in the Settings page? %}
 Four roles are created when you set up Data Lakes using Terraform. You add the `arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role` role to the Data Lakes Settings page in the Segment web app.
 {% endfaqitem %}
+
 {% faqitem What level of access do the AWS roles have? %}
 The roles which Data Lakes assigns during set up are:
 
@@ -102,25 +130,26 @@ The roles which Data Lakes assigns during set up are:
 
 - **`segment_emr_autoscaling_role`** - Restricted role that can only be assumed by EMR and EC2. This is set up based on [AWS best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-automatic-scaling.html).
 {% endfaqitem %}
+
 {% faqitem Why doesn't the Data Lakes Terraform module create an S3 bucket? %}
 The module doesn't create a new S3 bucket so you can re-use an existing bucket for your Data Lakes.
 {% endfaqitem %}
+
 {% faqitem Does my S3 bucket need to be in the same region as the other infrastructure? %}
 Yes, the S3 bucket and the EMR cluster must be in the same region.
 {% endfaqitem %}
+
 {% faqitem How do I connect a new source to Data Lakes? %}
 To connect a new source to Data Lakes:
 
 1. Ensure that the `workspace_id` of the Segment workspace is in the list of [external ids](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update this from the AWS console, or re-run the [Terraform](https://github.com/segmentio/terraform-aws-data-lake) job.
 2. From your Segment workspace, connect the source to the Data Lakes destination.
 {% endfaqitem %}
+
 {% faqitem Can I configure multiple sources to use the same EMR cluster? %}
 Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
 {% endfaqitem %}
-{% endfaq %}
 
-### Post-Set Up
-{% faq %}
 {% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
 If you don't see data after enabling a source, check the following:
 - Does the IAM role have the Segment account ID and workspace ID as the external ID?
@@ -129,9 +158,11 @@ If you don't see data after enabling a source, check the following:
 
 If all of these look correct and you're still not seeing any data, please [contact the Support team](https://segment.com/help/contact/).
 {% endfaqitem %}
+
 {% faqitem What are "Segment Output" tables in S3? %}
 The `output` tables are temporary tables Segment creates when loading data. They are deleted after each sync.
 {% endfaqitem %}
+
 {% faqitem Can I make additional directories in the S3 bucket Data Lakes is using? %}
 Yes, you can create new directories in S3 without interfering with Segment data.
 Do not modify, or create additional directories with the following names:
@@ -140,9 +171,11 @@ Do not modify, or create additional directories with the following names:
 - `segment-data/`
 - `segment-logs/`
 {% endfaqitem %}
+
 {% faqitem What does "partitioned" mean in the table name? %}
 `Partitioned` just means that the table has partition columns (day and hour). All tables are partitioned, so you should see this on all table names.
 {% endfaqitem %}
+
 {% faqitem How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data? %}
 You can use the following command to create external tables in Spectrum to access tables in Glue and join the data with Redshift:
 
@@ -161,3 +194,5 @@ Replace:
 - [spectrum_schema_name] = The schema name in Redshift you want to map to
 {% endfaqitem %}
 {% endfaq %}
+
+### [Azure Data Lakes]
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
@@ -26,7 +26,7 @@ To learn more about Segment Data Lakes, check out the [Introducing Segment Data
 
 Segment currently supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options. 
 
-### How [AWS Data Lakes] work
+### How [AWS Data Lakes] works
 
 Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or machine learning vendors like DataBricks or DataRobot.
 
@@ -36,42 +36,42 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
 
 ![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
 
-### How [Azure Data Lakes] work
+### How [Azure Data Lakes] works
 
 Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
 
 ![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png)
 
-### Data Lake deduplication
-
-> info ""
-> As of June 2022, deduplication is only supported for [AWS Data Lakes].
-
-In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data.  Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
-
-### Using a Data Lake with a Data Warehouse
-
-The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
-
-When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
-
 
 ## Set up Segment Data Lakes
 
+For more detailed information about setting up AWS and Azure Data Lakes, please see 
 
 ### Set up [AWS Data Lakes]
 For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
 
-### EMR
+#### EMR
 
 Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
 
-### AWS IAM role
+#### AWS IAM role
 
 Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
 - **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
 - **s3_bucket**: Name of the S3 bucket used by the Data Lake.
 
+### Set up [Azure Data Lakes]
+
+Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
+
+- Azure Storage Account
+- Service Principal
+- Databricks Instance
+- Databricks Cluster
+- Azure MySQL Database
+- Azure KeyVault Instance: 
+
+For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
 
 ## Data Lakes schema
 
@@ -81,7 +81,9 @@ TODO:
 add schema overview (tables/columns generated)
 -->
 
-### S3 partition structure
+### [AWS Data Lakes] schema
+
+#### S3 partition structure
 
 Segment partitions the data in S3 by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible.
 
@@ -103,7 +105,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
 - Year/Month/Day [YYYY/MM/DD]
 - Day [YYYY-MM-DD]
 
-### AWS Glue data catalog
+#### AWS Glue data catalog
 
 Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
 
@@ -115,13 +117,15 @@ add annotated glue image calling out different parts of inferred schema)
 
 New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
 
-#### Glue database
+##### Glue database
 
 The schema inferred by Segment is stored in a Glue database within Glue Data Catalog. Segment stores the schema for each source in its own Glue database to organize the data so it is easier to query. To make it easier to find, Segment writes the schema to a Glue database named using the source slug by default. The database name can be modified from the Data Lakes settings.
 
 > info ""
 > The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
 
+### [Azure Data Lakes] schema
+
 ### Data types
 
 Data Lakes infers the data type for an event it receives. Groups of events are poled every hour to infer the data type for that each event.
@@ -133,6 +137,13 @@ The data types supported in Glue are:
 - string
 - timestamp
 
+The data types supported in the Hive Metastore are:
+- bigint
+- boolean
+- decimal(38,6)
+- string
+- timestamp
+
 #### Schema evolution
 
 Once Data Lakes sets a data type for a column, all subsequent data will attempt to be cast into that data type. If incoming data does not match the data type, Data Lakes tries to cast the column to the target data type.
@@ -145,7 +156,18 @@ If the data type in Glue is wider than the data type for a column in an on-going
 
 If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
 
+### Data Lake deduplication
 
+> info ""
+> As of June 2022, deduplication is only supported for [AWS Data Lakes].
+
+In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data.  Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
+
+### Using a Data Lake with a Data Warehouse
+
+The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
+
+When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
 
 ## FAQ
 {% faq %}