You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/catalog/data-lakes/index.md
+32-17Lines changed: 32 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,19 +18,19 @@ To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lake
18
18
19
19
Before you set up [AWS Data Lakes], you need the following resources:
20
20
21
-
- An [AWS account](https://aws.amazon.com/account/)
22
-
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
21
+
- An [AWS account](https://aws.amazon.com/account/){:target="_blank”}
22
+
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket){:target="_blank”} to receive data and store logs
23
23
- A subnet within a VPC for the EMR cluster to run in
24
24
25
-
### Step 1 - Set Up AWS Resources
25
+
### Step 1 - Set up AWS resources
26
26
27
-
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
27
+
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake){:target="_blank”} to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
28
28
29
29
You can also use Segment's [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
30
30
31
31
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
32
32
33
-
### Step 2 - Enable Data Lakes Destination
33
+
### Step 2 - Enable Data Lakes destination
34
34
35
35
After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
36
36
@@ -61,43 +61,58 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
61
61
Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
62
62
63
63
64
-
### Step 3 - Verify Data is Synced to S3 and Glue
64
+
### Step 3 - Verify data is synced to S3 and Glue
65
65
66
66
You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
67
67
68
-
To be alerted of sync failures via email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications).
68
+
To be alerted of sync failures by email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications){:target="_blank”}.
69
69
70
70
71
71
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
72
72
73
73
74
-
### (Optional) Step 4 - Replay Historical Data
74
+
### (Optional) Step 4 - Replay historical data
75
75
76
-
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
76
+
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/){:target="_blank”} to request one.
77
77
78
-
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
78
+
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, Segment recommends that you start with data from the last six months to get started, and then replay additional data if you find you need more.
79
79
80
80
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
81
81
82
82
## Set up [Azure Data Lakes]
83
83
84
-
To set up [Azure Data Lakes], create your Azure resources and enable the Data Lakes destination in the Segment app.
84
+
> info "[Azure Data Lakes] is currently in Public Beta"
85
+
> [Azure Data Lakes] is available in Public Beta.
86
+
87
+
To set up [Azure Data Lakes], create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
85
88
86
89
### Prerequisites
87
90
88
-
### Step 1 - Create and ALDS-enabled storage account
91
+
Before you can configure your Azure resources, you must first [create an Azure subscription](https://azure.microsoft.com/en-us/free/){:target="_blank”}.
92
+
93
+
### Step 1 - Create an ALDS-enabled storage account
89
94
90
-
### Step 2 - Setup KeyVault
95
+
To
91
96
92
-
### Step 3 - Setup Azure MySQL DB
97
+
### Step 2 - Set up KeyVault
98
+
99
+
### Step 3 - Set up Azure MySQL database
93
100
94
101
### Step 4 - Set up Databricks
95
102
96
-
### Step 5 - Setup a Service Principal
103
+
### Step 5 - Set up a Service Principal
97
104
98
105
### Step 6 - Configure Databricks cluster
99
106
100
-
### Step 7 - Enable Data Lakes destination in the Segment app
107
+
### Step 7 - Enable the Data Lakes destination in the Segment app
108
+
109
+
After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
110
+
111
+
<!-- TODO: Test this workflow in a staging environment to verify that the steps are correct-->
112
+
113
+
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**, then search for and select **Azure Data Lakes**.
114
+
2.
115
+
101
116
102
117
### Optional - Set up the Data Lake using Terraform
103
118
@@ -147,7 +162,7 @@ To connect a new source to Data Lakes:
147
162
{% endfaqitem %}
148
163
149
164
{% faqitem Can I configure multiple sources to use the same EMR cluster? %}
150
-
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
165
+
Yes, you can configure multiple sources to use the same EMR cluster. Segment recommends that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
151
166
{% endfaqitem %}
152
167
153
168
{% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+21-19Lines changed: 21 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,12 +15,12 @@ Data lakes typically have four layers:
15
15
16
16

17
17
18
-
Segment Data Lakes sends Segment data to a cloud data store (either AWS S3 or Azure Data Lake Storage Gen2) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
18
+
Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
19
19
20
20
> info ""
21
21
> Segment Data Lakes is available to Business tier customers only.
22
22
23
-
To learn more about Segment Data Lakes, check out the [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"} blog post.
23
+
To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
24
24
25
25
## How Segment Data Lakes work
26
26
@@ -38,38 +38,40 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
38
38
39
39
### How [Azure Data Lakes] works
40
40
41
-
Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
41
+
Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
42
42
43
43

44
44
45
45
46
46
## Set up Segment Data Lakes
47
47
48
-
For more detailed information about setting up AWS and Azure Data Lakes, please see
48
+
For more detailed information about setting up AWS and Azure Data Lakes, please see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
49
49
50
50
### Set up [AWS Data Lakes]
51
-
For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
51
+
When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below.
52
52
53
53
#### EMR
54
54
55
-
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
55
+
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
56
56
57
57
#### AWS IAM role
58
58
59
59
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
60
-
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
60
+
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
61
61
-**s3_bucket**: Name of the S3 bucket used by the Data Lake.
62
62
63
63
### Set up [Azure Data Lakes]
64
64
65
+
66
+
65
67
Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
66
68
67
-
- Azure Storage Account
68
-
- Service Principal
69
-
- Databricks Instance
70
-
- Databricks Cluster
71
-
- Azure MySQL Database
72
-
- Azure KeyVault Instance:
69
+
-[Azure Storage Account](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal){:target="_blank”}: An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks.
70
+
-[Service Principal](https://docs.microsoft.com/en-us/azure/purview/create-service-principal-azure){:target="_blank”}: Service principals are identities used to access specific resources.
71
+
-[Databricks Instance](https://azure.microsoft.com/en-us/services/databricks/#overview){:target="_blank”}: Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
72
+
-[Databricks Cluster](https://docs.microsoft.com/en-us/azure/purview/register-scan-hive-metastore-source){:target="_blank”}: The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads.
73
+
-[Azure MySQL Database](https://docs.microsoft.com/en-us/azure/purview/register-scan-azure-mysql-database){:target="_blank”}: The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
74
+
-[Azure KeyVault Instance](https://docs.microsoft.com/en-us/azure/key-vault/general/quick-create-portal){:target="_blank”}: Azure KeyVault provides a secure store for your keys, secrets, and certificates.
73
75
74
76
For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
75
77
@@ -112,7 +114,7 @@ Data Lakes stores the inferred schema and associated metadata of the S3 data in
112
114

113
115
<!--
114
116
TODO:
115
-
add annotated glue image calling out different parts of inferred schema)
117
+
add annotated glue image calling out different parts of inferred schema
116
118
-->
117
119
118
120
New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
@@ -128,23 +130,23 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
128
130
129
131
### Data types
130
132
131
-
Data Lakes infers the data type for an event it receives. Groups of events are poled every hour to infer the data type for that each event.
133
+
Data Lakes infers the data type for an event it receives. Groups of events are polled every hour to infer the data type for that each event.
132
134
133
-
The data types supported in Glue are:
135
+
The data types supported in [AWS Data Lakes] are:
134
136
- bigint
135
137
- boolean
136
138
- decimal(38,6)
137
139
- string
138
140
- timestamp
139
141
140
-
The data types supported in the Hive Metastore are:
142
+
The data types supported in the [Azure Data Lakes] are:
141
143
- bigint
142
144
- boolean
143
145
- decimal(38,6)
144
146
- string
145
147
- timestamp
146
148
147
-
####Schema evolution
149
+
### Schema evolution
148
150
149
151
Once Data Lakes sets a data type for a column, all subsequent data will attempt to be cast into that data type. If incoming data does not match the data type, Data Lakes tries to cast the column to the target data type.
150
152
@@ -167,7 +169,7 @@ In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate
167
169
168
170
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
169
171
170
-
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
172
+
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or ADLS or you can use Data Lakes in addition to a data warehouse.
0 commit comments