Skip to content

Commit 9c19530

Browse files
committed
Started editing the DL overview page [DOC-493]
1 parent 6d2010b commit 9c19530

File tree

3 files changed

+33
-7
lines changed

3 files changed

+33
-7
lines changed
191 KB
Loading
102 KB
Loading

src/connections/storage/data-lakes/index.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,40 +5,62 @@ redirect_from: '/connections/destinations/catalog/data-lakes/'
55

66
{% include content/plan-grid.md name="data-lakes" %}
77

8-
Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes can reduce the amount of processing required to get real value out of your data.
8+
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
9+
10+
Data lakes typically have four layers:
11+
- **Storage layer:** Holds large files and raw data.
12+
- **Metadata store:** Stores the schema, or the process used to organize the files in the object store.
13+
- **Query layer:** Allows you to run SQL queries on the object store.
14+
- **Compute layer:** Allows you to write to and transform the data in the storage layer.
15+
16+
![A graphic showing the information flowing from the metadata into the query, compute, and metadata layers, and then into the storage layer](images/data_lakes_overview_graphic.png)
17+
18+
Segment Data Lakes sends Segment data to a cloud data store (either AWS S3 or Azure Data Lake Storage Gen2) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
919

1020
> info ""
1121
> Segment Data Lakes is available to Business tier customers only.
1222
13-
To learn more, check out the blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
14-
23+
To learn more about Segment Data Lakes, check out the [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"} blog post.
1524

1625
## How Segment Data Lakes work
1726

18-
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
27+
Segment currently supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
28+
29+
### How [AWS Data Lakes] work
30+
31+
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or machine learning vendors like DataBricks or DataRobot.
1932

2033
![A diagram showing data flowing from Segment, through Parquet and S3, into Glue, and then into your Data Lake](images/dl_overview2.png)
2134

2235
Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
2336

2437
![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
2538

26-
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
39+
### How [Azure Data Lakes] work
40+
41+
Data Lakes store Segment data in Azure Data Lake Storage Gen2 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
42+
43+
![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png)
2744

2845
### Data Lake deduplication
2946

47+
> info ""
48+
> As of June 2022, deduplication is only supported for [AWS Data Lakes].
49+
3050
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `message_id` field.
3151

3252
### Using a Data Lake with a Data Warehouse
3353

3454
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can [read more about the differences between Data Lakes and Warehouses](/docs/connections/storage/data-lakes/comparison/).
3555

36-
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3, or you can use Data Lakes in addition to a data warehouse.
56+
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or Azure Data Lake Storage Gen2, or you can use Data Lakes in addition to a data warehouse.
3757

3858

3959
## Set up Segment Data Lakes
4060

41-
For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
61+
62+
### Set up [AWS Data Lakes]
63+
For detailed instructions on how to configure [AWS Data Lakes], see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
4264

4365
### EMR
4466

@@ -157,6 +179,10 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
157179
- **Labels** - Labels set in Protocols aren't sent to Data Lakes.
158180
{% endfaqitem %}
159181

182+
{ % faqitem How frequently does my Data Lake sync?}
183+
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
184+
{ % endfaqitem }
185+
160186
{% faqitem What is the cost to use AWS Glue? %}
161187
You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
162188
{% endfaqitem %}

0 commit comments

Comments
 (0)