You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/comparison.md
-3Lines changed: 0 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,9 +21,6 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
21
21
22
22
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
23
23
24
-
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
25
-
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates applies, but secondary deduplication is not supported.
26
-
27
24
## Object vs event data
28
25
29
26
Warehouses support both event and object data, while Data Lakes supports only event data.
{% include content/plan-grid.md name="data-lakes" %}
7
7
8
-
> warning "Azure Data Lakes public beta data deletion policies"
9
-
> Data deletion is not supported by the Azure Data Lakes product, as customers retain data in systems that they manage. Azure Data Lakes is not supported in EU during the public beta, so European data recency requirements do not apply.
8
+
> warning "Segment Data Lakes (Azure) deletion policies"
9
+
> Data deletion is not supported by Segment Data Lakes (Azure), as customers retain data in systems that they manage.
10
10
11
-
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
11
+
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
12
12
13
-
Data lakes typically have four layers:
14
-
-**Storage layer:** Holds large files and raw data.
15
-
-**Metadata store:** Stores the schema, or the process used to organize the files in the object store.
16
-
-**Query layer:** Allows you to run SQL queries on the object store.
13
+
Data lakes typically have four layers:
14
+
-**Storage layer:** Holds large files and raw data.
15
+
-**Metadata store:** Stores the schema, or the process used to organize the files in the object store.
16
+
-**Query layer:** Allows you to run SQL queries on the object store.
17
17
-**Compute layer:** Allows you to write to and transform the data in the storage layer.
18
18
19
19

20
20
21
21
Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
22
22
23
23
> info ""
24
-
> Segment Data Lakes is available to Business tier customers only. Azure Data Lakes is currently in Public Beta.
24
+
> Segment Data Lakes is available to Business tier customers only.
25
25
26
26
To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
27
27
28
28
## How Data Lakes work
29
29
30
-
Segment supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
30
+
Segment supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
31
31
32
32
### How Segment Data Lakes works
33
33
@@ -39,18 +39,18 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
39
39
40
40

41
41
42
-
### How Azure Data Lakes works
42
+
### How Segment Data Lakes (Azure) works
43
43
44
44
Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure Databricks or Azure Synapse Analytics.
45
45
46
46

47
47
48
+
## Set up Segment Data Lakes (Azure)
48
49
49
-
## Set up Segment Data Lakes
50
-
51
-
For more detailed information about setting up Segment and Azure Data Lakes, please see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
50
+
For detailed Segment Data Lakes (Azure) setup instructions, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
52
51
53
52
### Set up Segment Data Lakes
53
+
54
54
When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below.
55
55
56
56
#### EMR
@@ -60,21 +60,21 @@ Data Lakes uses an EMR cluster to run jobs that load events from all sources int
60
60
#### AWS IAM role
61
61
62
62
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
63
-
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
63
+
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to **Settings > General Settings > ID**.
64
64
-**s3_bucket**: Name of the S3 bucket used by the Data Lake.
65
65
66
-
### Set up Azure Data Lakes
66
+
### Set up Segment Data Lakes (Azure)
67
67
68
-
To connect your Azure Data Lake to Segment, you must set up the following components in your Azure environment:
68
+
To connect Segment Data Lakes (Azure), you must set up the following components in your Azure environment:
69
69
70
70
-[Azure Storage Account](/docs/connections/storage/catalog/data-lakes/#step-1---create-an-alds-enabled-storage-account): An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks.
71
71
-[Azure KeyVault Instance](/docs/connections/storage/catalog/data-lakes/#step-2---set-up-key-vault): Azure KeyVault provides a secure store for your keys, secrets, and certificates.
72
72
-[Azure MySQL Database](/docs/connections/storage/catalog/data-lakes/#step-3---set-up-azure-mysql-database): The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
73
-
-[Databricks Instance](/docs/connections/storage/catalog/data-lakes/#step-4---set-up-databricks): Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
73
+
-[Databricks Instance](/docs/connections/storage/catalog/data-lakes/#step-4---set-up-databricks): Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
74
74
-[Databricks Cluster](/docs/connections/storage/catalog/data-lakes/#step-6---configure-databricks-cluster): The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads.
75
75
-[Service Principal](/docs/connections/storage/catalog/data-lakes/#step-5---set-up-a-service-principal): Service principals are identities used to access specific resources.
76
76
77
-
For more information about configuring Azure Data Lakes, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/#set-up-azure-data-lakes).
77
+
For more information about configuring Segment Data Lakes (Azure), see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes-azure).
78
78
79
79
## Data Lakes schema
80
80
@@ -127,9 +127,9 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
127
127
> info ""
128
128
> The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
129
129
130
-
### Azure Data Lakes schema
130
+
### Segment Data Lakes (Azure) schema
131
131
132
-
Azure Data Lakes applies a consistent schema to make raw data accessible for queries. A transformer automatically calculates the desired schema and uploads a schema JSON file for each event type to your Azure Data Lake Storage (ADLS) in the `/staging/` directory.
132
+
Segment Data Lakes (Azure) applies a consistent schema to make raw data accessible for queries. A transformer automatically calculates the desired schema and uploads a schema JSON file for each event type to your Azure Data Lake Storage (ADLS) in the `/staging/` directory.
133
133
134
134
Segment partitions the data in ALDS by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible.
135
135
@@ -161,10 +161,7 @@ If Data Lakes sees a bad data type, for example text in place of a number or an
161
161
162
162
### Data Lake deduplication
163
163
164
-
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `messageId` field.
165
-
166
-
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
167
-
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates within the 24-hour look-back window applies, but secondary deduplication is not supported.
164
+
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last seven days, based on the `messageId` field.
168
165
169
166
### Using a Data Lake with a Data Warehouse
170
167
@@ -174,13 +171,6 @@ When you use Data Lakes, you can either use Data Lakes as your _only_ source of
174
171
175
172
## FAQ
176
173
177
-
### What are some limitations of the Azure Data Lakes public beta?
178
-
The following capabilities are not supported during the Azure Data Lakes public beta:
179
-
- EU region support
180
-
- Deduplication
181
-
- Sync History and Sync Health in Segment app
182
-
183
-
184
174
#### Can I send all of my Segment data into Data Lakes?
185
175
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources. Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
186
176
@@ -194,7 +184,7 @@ As the data schema evolves and new columns are added, Segment Data Lakes will de
194
184
195
185
196
186
### How does Data Lakes work with Protocols?
197
-
Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
187
+
Data Lakes has no direct integration with [Protocols](/docs/protocols/).
198
188
199
189
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
200
190
@@ -231,10 +221,9 @@ Segment stops creating new tables for the events after you exceed this limit. Ho
231
221
232
222
You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
233
223
234
-
### What analytics tools are available to use with my Azure Data Lake?
235
-
Azure Data Lakes supports the following analytics tools:
224
+
### What analytics tools are available to use with Segment Data Lakes (Azure)?
225
+
Segment Data Lakes (Azure) supports the following analytics tools:
0 commit comments