You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation.
9
9
10
10
> note "Lake Formation"
11
-
> You can also set up your Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
11
+
> You can also set up your [AWS Data Lakes] using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
12
12
13
-
## Pre-Requisites
13
+
## Set up [AWS Data Lakes]
14
14
15
-
Before you set up Segment Data Lakes, you need the following resources:
15
+
To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lakes] destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
16
+
17
+
### Prerequisites
18
+
19
+
Before you set up [AWS Data Lakes], you need the following resources:
16
20
17
21
- An [AWS account](https://aws.amazon.com/account/)
18
22
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
19
23
- A subnet within a VPC for the EMR cluster to run in
20
24
21
-
## Step 1 - Set Up AWS Resources
25
+
###Step 1 - Set Up AWS Resources
22
26
23
27
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.12+. To support more versions of Terraform, the AWS provider must use v4, which is included in the example main.tf.
24
28
25
29
You can also use Segment's [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
26
30
27
31
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
28
32
29
-
## Step 2 - Enable Data Lakes Destination
33
+
###Step 2 - Enable Data Lakes Destination
30
34
31
35
After you set up the necessary AWS resources, the next step is to set up the Data Lakes destination within Segment:
32
36
@@ -57,7 +61,7 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
57
61
Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
58
62
59
63
60
-
## Step 3 - Verify Data is Synced to S3 and Glue
64
+
###Step 3 - Verify Data is Synced to S3 and Glue
61
65
62
66
You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
63
67
@@ -67,25 +71,49 @@ To be alerted of sync failures via email, subscribe to the `Storage Destination
67
71
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
68
72
69
73
70
-
## (Optional) Step 4 - Replay Historical Data
74
+
###(Optional) Step 4 - Replay Historical Data
71
75
72
76
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
73
77
74
78
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
75
79
76
80
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
77
81
82
+
## Set up [Azure Data Lakes]
83
+
84
+
To set up [Azure Data Lakes], create your Azure resources and enable the Data Lakes destination in the Segment app.
85
+
86
+
### Prerequisites
87
+
88
+
### Step 1 - Create and ALDS-enabled storage account
89
+
90
+
### Step 2 - Setup KeyVault
91
+
92
+
### Step 3 - Setup Azure MySQL DB
93
+
94
+
### Step 4 - Set up Databricks
95
+
96
+
### Step 5 - Setup a Service Principal
97
+
98
+
### Step 6 - Configure Databricks cluster
99
+
100
+
### Step 7 - Enable Data Lakes destination in the Segment app
101
+
102
+
### Optional - Set up the Data Lake using Terraform
103
+
78
104
## FAQ
79
105
80
-
### Data Lakes Set Up
106
+
### [AWS Data Lakes]
81
107
82
108
{% faq %}
83
109
{% faqitem Do I need to create Glue databases? %}
84
110
No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
85
111
{% endfaqitem %}
112
+
86
113
{% faqitem What IAM role do I use in the Settings page? %}
87
114
Four roles are created when you set up Data Lakes using Terraform. You add the `arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role` role to the Data Lakes Settings page in the Segment web app.
88
115
{% endfaqitem %}
116
+
89
117
{% faqitem What level of access do the AWS roles have? %}
90
118
The roles which Data Lakes assigns during set up are:
91
119
@@ -102,25 +130,26 @@ The roles which Data Lakes assigns during set up are:
102
130
103
131
-**`segment_emr_autoscaling_role`** - Restricted role that can only be assumed by EMR and EC2. This is set up based on [AWS best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-automatic-scaling.html).
104
132
{% endfaqitem %}
133
+
105
134
{% faqitem Why doesn't the Data Lakes Terraform module create an S3 bucket? %}
106
135
The module doesn't create a new S3 bucket so you can re-use an existing bucket for your Data Lakes.
107
136
{% endfaqitem %}
137
+
108
138
{% faqitem Does my S3 bucket need to be in the same region as the other infrastructure? %}
109
139
Yes, the S3 bucket and the EMR cluster must be in the same region.
110
140
{% endfaqitem %}
141
+
111
142
{% faqitem How do I connect a new source to Data Lakes? %}
112
143
To connect a new source to Data Lakes:
113
144
114
145
1. Ensure that the `workspace_id` of the Segment workspace is in the list of [external ids](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update this from the AWS console, or re-run the [Terraform](https://github.com/segmentio/terraform-aws-data-lake) job.
115
146
2. From your Segment workspace, connect the source to the Data Lakes destination.
116
147
{% endfaqitem %}
148
+
117
149
{% faqitem Can I configure multiple sources to use the same EMR cluster? %}
118
150
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
119
151
{% endfaqitem %}
120
-
{% endfaq %}
121
152
122
-
### Post-Set Up
123
-
{% faq %}
124
153
{% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
125
154
If you don't see data after enabling a source, check the following:
126
155
- Does the IAM role have the Segment account ID and workspace ID as the external ID?
@@ -129,9 +158,11 @@ If you don't see data after enabling a source, check the following:
129
158
130
159
If all of these look correct and you're still not seeing any data, please [contact the Support team](https://segment.com/help/contact/).
131
160
{% endfaqitem %}
161
+
132
162
{% faqitem What are "Segment Output" tables in S3? %}
133
163
The `output` tables are temporary tables Segment creates when loading data. They are deleted after each sync.
134
164
{% endfaqitem %}
165
+
135
166
{% faqitem Can I make additional directories in the S3 bucket Data Lakes is using? %}
136
167
Yes, you can create new directories in S3 without interfering with Segment data.
137
168
Do not modify, or create additional directories with the following names:
@@ -140,9 +171,11 @@ Do not modify, or create additional directories with the following names:
140
171
-`segment-data/`
141
172
-`segment-logs/`
142
173
{% endfaqitem %}
174
+
143
175
{% faqitem What does "partitioned" mean in the table name? %}
144
176
`Partitioned` just means that the table has partition columns (day and hour). All tables are partitioned, so you should see this on all table names.
145
177
{% endfaqitem %}
178
+
146
179
{% faqitem How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data? %}
147
180
You can use the following command to create external tables in Spectrum to access tables in Glue and join the data with Redshift:
148
181
@@ -161,3 +194,5 @@ Replace:
161
194
-[spectrum_schema_name] = The schema name in Redshift you want to map to
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+42-20Lines changed: 42 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ To learn more about Segment Data Lakes, check out the [Introducing Segment Data
26
26
27
27
Segment currently supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
28
28
29
-
### How [AWS Data Lakes]work
29
+
### How [AWS Data Lakes]works
30
30
31
31
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or machine learning vendors like DataBricks or DataRobot.
32
32
@@ -36,42 +36,42 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
36
36
37
37

38
38
39
-
### How [Azure Data Lakes]work
39
+
### How [Azure Data Lakes]works
40
40
41
41
Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
42
42
43
43

44
44
45
-
### Data Lake deduplication
46
-
47
-
> info ""
48
-
> As of June 2022, deduplication is only supported for [AWS Data Lakes].
49
-
50
-
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
51
-
52
-
### Using a Data Lake with a Data Warehouse
53
-
54
-
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
55
-
56
-
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
57
-
58
45
59
46
## Set up Segment Data Lakes
60
47
48
+
For more detailed information about setting up AWS and Azure Data Lakes, please see
61
49
62
50
### Set up [AWS Data Lakes]
63
51
For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
64
52
65
-
### EMR
53
+
####EMR
66
54
67
55
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
68
56
69
-
### AWS IAM role
57
+
####AWS IAM role
70
58
71
59
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
72
60
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
73
61
-**s3_bucket**: Name of the S3 bucket used by the Data Lake.
74
62
63
+
### Set up [Azure Data Lakes]
64
+
65
+
Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
66
+
67
+
- Azure Storage Account
68
+
- Service Principal
69
+
- Databricks Instance
70
+
- Databricks Cluster
71
+
- Azure MySQL Database
72
+
- Azure KeyVault Instance:
73
+
74
+
For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
75
75
76
76
## Data Lakes schema
77
77
@@ -81,7 +81,9 @@ TODO:
81
81
add schema overview (tables/columns generated)
82
82
-->
83
83
84
-
### S3 partition structure
84
+
### [AWS Data Lakes] schema
85
+
86
+
#### S3 partition structure
85
87
86
88
Segment partitions the data in S3 by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible.
87
89
@@ -103,7 +105,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
103
105
- Year/Month/Day [YYYY/MM/DD]
104
106
- Day [YYYY-MM-DD]
105
107
106
-
### AWS Glue data catalog
108
+
####AWS Glue data catalog
107
109
108
110
Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
109
111
@@ -115,13 +117,15 @@ add annotated glue image calling out different parts of inferred schema)
115
117
116
118
New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
117
119
118
-
#### Glue database
120
+
#####Glue database
119
121
120
122
The schema inferred by Segment is stored in a Glue database within Glue Data Catalog. Segment stores the schema for each source in its own Glue database to organize the data so it is easier to query. To make it easier to find, Segment writes the schema to a Glue database named using the source slug by default. The database name can be modified from the Data Lakes settings.
121
123
122
124
> info ""
123
125
> The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
124
126
127
+
### [Azure Data Lakes] schema
128
+
125
129
### Data types
126
130
127
131
Data Lakes infers the data type for an event it receives. Groups of events are poled every hour to infer the data type for that each event.
@@ -133,6 +137,13 @@ The data types supported in Glue are:
133
137
- string
134
138
- timestamp
135
139
140
+
The data types supported in the Hive Metastore are:
141
+
- bigint
142
+
- boolean
143
+
- decimal(38,6)
144
+
- string
145
+
- timestamp
146
+
136
147
#### Schema evolution
137
148
138
149
Once Data Lakes sets a data type for a column, all subsequent data will attempt to be cast into that data type. If incoming data does not match the data type, Data Lakes tries to cast the column to the target data type.
@@ -145,7 +156,18 @@ If the data type in Glue is wider than the data type for a column in an on-going
145
156
146
157
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
147
158
159
+
### Data Lake deduplication
148
160
161
+
> info ""
162
+
> As of June 2022, deduplication is only supported for [AWS Data Lakes].
163
+
164
+
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
165
+
166
+
### Using a Data Lake with a Data Warehouse
167
+
168
+
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
169
+
170
+
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
0 commit comments