You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{% include content/plan-grid.md name="data-lakes" %}
7
7
8
-
Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes can reduce the amount of processing required to get real value out of your data.
8
+
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
9
+
10
+
Data lakes typically have four layers:
11
+
-**Storage layer:** Holds large files and raw data.
12
+
-**Metadata store:** Stores the schema, or the process used to organize the files in the object store.
13
+
-**Query layer:** Allows you to run SQL queries on the object store.
14
+
-**Compute layer:** Allows you to write to and transform the data in the storage layer.
15
+
16
+

17
+
18
+
Segment Data Lakes sends Segment data to a cloud data store (either AWS S3 or Azure Data Lake Storage Gen2) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
9
19
10
20
> info ""
11
21
> Segment Data Lakes is available to Business tier customers only.
12
22
13
-
To learn more, check out the blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
14
-
23
+
To learn more about Segment Data Lakes, check out the [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"} blog post.
15
24
16
25
## How Segment Data Lakes work
17
26
18
-
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
27
+
Segment currently supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
28
+
29
+
### How [AWS Data Lakes] work
30
+
31
+
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or machine learning vendors like DataBricks or DataRobot.
19
32
20
33

21
34
22
35
Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
23
36
24
37

25
38
26
-
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
39
+
### How [Azure Data Lakes] work
40
+
41
+
Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
42
+
43
+

27
44
28
45
### Data Lake deduplication
29
46
47
+
> info ""
48
+
> As of June 2022, deduplication is only supported for [AWS Data Lakes].
49
+
30
50
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
31
51
32
52
### Using a Data Lake with a Data Warehouse
33
53
34
54
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
35
55
36
-
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3, or you can use Data Lakes in addition to a data warehouse.
56
+
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
37
57
38
58
39
59
## Set up Segment Data Lakes
40
60
41
-
For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
61
+
62
+
### Set up [AWS Data Lakes]
63
+
For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
42
64
43
65
### EMR
44
66
@@ -157,6 +179,10 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
157
179
-**Labels** - Labels set in Protocols aren't sent to Data Lakes.
158
180
{% endfaqitem %}
159
181
182
+
{ % faqitem How frequently does my Data Lake sync?}
183
+
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
184
+
{ % endfaqitem }
185
+
160
186
{% faqitem What is the cost to use AWS Glue? %}
161
187
You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
0 commit comments