Skip to content

Commit 01cd54e

Browse files
committed
Added outline of Azure resource setup steps, Vale updates
1 parent 1981b91 commit 01cd54e

File tree

2 files changed

+53
-36
lines changed

2 files changed

+53
-36
lines changed

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,19 +18,19 @@ To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lake
1818

1919
Before you set up [AWS Data Lakes], you need the following resources:
2020

21-
- An [AWS account](https://aws.amazon.com/account/)
22-
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
21+
- An [AWS account](https://aws.amazon.com/account/){:target="_blank”}
22+
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket){:target="_blank”} to receive data and store logs
2323
- A subnet within a VPC for the EMR cluster to run in
2424

25-
### Step 1 - Set Up AWS Resources
25+
### Step 1 - Set up AWS resources
2626

27-
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
27+
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake){:target="_blank”} to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
2828

2929
You can also use Segment's [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
3030

3131
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
3232

33-
### Step 2 - Enable Data Lakes Destination
33+
### Step 2 - Enable Data Lakes destination
3434

3535
After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
3636

@@ -61,43 +61,58 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
6161
Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
6262

6363

64-
### Step 3 - Verify Data is Synced to S3 and Glue
64+
### Step 3 - Verify data is synced to S3 and Glue
6565

6666
You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
6767

68-
To be alerted of sync failures via email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications).
68+
To be alerted of sync failures by email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications){:target="_blank”}.
6969

7070

7171
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
7272

7373

74-
### (Optional) Step 4 - Replay Historical Data
74+
### (Optional) Step 4 - Replay historical data
7575

76-
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
76+
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/){:target="_blank”} to request one.
7777

78-
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
78+
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, Segment recommends that you start with data from the last six months to get started, and then replay additional data if you find you need more.
7979

8080
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
8181

8282
## Set up [Azure Data Lakes]
8383

84-
To set up [Azure Data Lakes], create your Azure resources and enable the Data Lakes destination in the Segment app.
84+
> info "[Azure Data Lakes] is currently in Public Beta"
85+
> [Azure Data Lakes] is available in Public Beta.
86+
87+
To set up [Azure Data Lakes], create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
8588

8689
### Prerequisites
8790

88-
### Step 1 - Create and ALDS-enabled storage account
91+
Before you can configure your Azure resources, you must first [create an Azure subscription](https://azure.microsoft.com/en-us/free/){:target="_blank”}.
92+
93+
### Step 1 - Create an ALDS-enabled storage account
8994

90-
### Step 2 - Setup KeyVault
95+
To
9196

92-
### Step 3 - Setup Azure MySQL DB
97+
### Step 2 - Set up KeyVault
98+
99+
### Step 3 - Set up Azure MySQL database
93100

94101
### Step 4 - Set up Databricks
95102

96-
### Step 5 - Setup a Service Principal
103+
### Step 5 - Set up a Service Principal
97104

98105
### Step 6 - Configure Databricks cluster
99106

100-
### Step 7 - Enable Data Lakes destination in the Segment app
107+
### Step 7 - Enable the Data Lakes destination in the Segment app
108+
109+
After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
110+
111+
<!-- TODO: Test this workflow in a staging environment to verify that the steps are correct-->
112+
113+
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**, then search for and select **Azure Data Lakes**.
114+
2.
115+
101116

102117
### Optional - Set up the Data Lake using Terraform
103118

@@ -147,7 +162,7 @@ To connect a new source to Data Lakes:
147162
{% endfaqitem %}
148163

149164
{% faqitem Can I configure multiple sources to use the same EMR cluster? %}
150-
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
165+
Yes, you can configure multiple sources to use the same EMR cluster. Segment recommends that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
151166
{% endfaqitem %}
152167

153168
{% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}

src/connections/storage/data-lakes/index.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,12 @@ Data lakes typically have four layers:
1515

1616
![A graphic showing the information flowing from the metadata into the query, compute, and metadata layers, and then into the storage layer](images/data_lakes_overview_graphic.png)
1717

18-
Segment Data Lakes sends Segment data to a cloud data store (either AWS S3 or Azure Data Lake Storage Gen2) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
18+
Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
1919

2020
> info ""
2121
> Segment Data Lakes is available to Business tier customers only.
2222
23-
To learn more about Segment Data Lakes, check out the [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"} blog post.
23+
To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
2424

2525
## How Segment Data Lakes work
2626

@@ -38,38 +38,40 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
3838

3939
### How [Azure Data Lakes] works
4040

41-
Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
41+
Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
4242

4343
![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png)
4444

4545

4646
## Set up Segment Data Lakes
4747

48-
For more detailed information about setting up AWS and Azure Data Lakes, please see
48+
For more detailed information about setting up AWS and Azure Data Lakes, please see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
4949

5050
### Set up [AWS Data Lakes]
51-
For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
51+
When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below.
5252

5353
#### EMR
5454

55-
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
55+
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
5656

5757
#### AWS IAM role
5858

5959
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
60-
- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
60+
- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
6161
- **s3_bucket**: Name of the S3 bucket used by the Data Lake.
6262

6363
### Set up [Azure Data Lakes]
6464

65+
66+
6567
Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
6668

67-
- Azure Storage Account
68-
- Service Principal
69-
- Databricks Instance
70-
- Databricks Cluster
71-
- Azure MySQL Database
72-
- Azure KeyVault Instance:
69+
- [Azure Storage Account](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal){:target="_blank”}: An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks.
70+
- [Service Principal](https://docs.microsoft.com/en-us/azure/purview/create-service-principal-azure){:target="_blank”}: Service principals are identities used to access specific resources.
71+
- [Databricks Instance](https://azure.microsoft.com/en-us/services/databricks/#overview){:target="_blank”}: Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
72+
- [Databricks Cluster](https://docs.microsoft.com/en-us/azure/purview/register-scan-hive-metastore-source){:target="_blank”}: The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads.
73+
- [Azure MySQL Database](https://docs.microsoft.com/en-us/azure/purview/register-scan-azure-mysql-database){:target="_blank”}: The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
74+
- [Azure KeyVault Instance](https://docs.microsoft.com/en-us/azure/key-vault/general/quick-create-portal){:target="_blank”}: Azure KeyVault provides a secure store for your keys, secrets, and certificates.
7375

7476
For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
7577

@@ -112,7 +114,7 @@ Data Lakes stores the inferred schema and associated metadata of the S3 data in
112114
![A screenshot of the AWS ios_prod_identify table, displaying the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
113115
<!--
114116
TODO:
115-
add annotated glue image calling out different parts of inferred schema)
117+
add annotated glue image calling out different parts of inferred schema
116118
-->
117119

118120
New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
@@ -128,23 +130,23 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
128130

129131
### Data types
130132

131-
Data Lakes infers the data type for an event it receives. Groups of events are poled every hour to infer the data type for that each event.
133+
Data Lakes infers the data type for an event it receives. Groups of events are polled every hour to infer the data type for that each event.
132134

133-
The data types supported in Glue are:
135+
The data types supported in [AWS Data Lakes] are:
134136
- bigint
135137
- boolean
136138
- decimal(38,6)
137139
- string
138140
- timestamp
139141

140-
The data types supported in the Hive Metastore are:
142+
The data types supported in the [Azure Data Lakes] are:
141143
- bigint
142144
- boolean
143145
- decimal(38,6)
144146
- string
145147
- timestamp
146148

147-
#### Schema evolution
149+
### Schema evolution
148150

149151
Once Data Lakes sets a data type for a column, all subsequent data will attempt to be cast into that data type. If incoming data does not match the data type, Data Lakes tries to cast the column to the target data type.
150152

@@ -167,7 +169,7 @@ In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate
167169

168170
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
169171

170-
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
172+
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or ADLS or you can use Data Lakes in addition to a data warehouse.
171173

172174
## FAQ
173175
{% faq %}

0 commit comments

Comments
 (0)