Skip to content

Commit 7be5d5b

Browse files
author
markzegarelli
committed
merge master
2 parents f8a4a3f + 20fa705 commit 7be5d5b

File tree

10 files changed

+253
-261
lines changed

10 files changed

+253
-261
lines changed

src/_data/catalog/destinations.yml

Lines changed: 129 additions & 125 deletions
Large diffs are not rendered by default.

src/_data/catalog/slugs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,4 +45,4 @@
4545
- original: "rokt"
4646
override: "rokt-integration"
4747
- original: "salesforce-actions"
48-
override: "actions-salesforce"
48+
override: "actions-salesforce"

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 66 additions & 67 deletions
Large diffs are not rendered by default.

src/connections/storage/data-lakes/comparison.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,6 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
2121

2222
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
2323

24-
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
25-
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates applies, but secondary deduplication is not supported.
26-
2724
## Object vs event data
2825

2926
Warehouses support both event and object data, while Data Lakes supports only event data.

src/connections/storage/data-lakes/index.md

Lines changed: 24 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -5,29 +5,29 @@ redirect_from: '/connections/destinations/catalog/data-lakes/'
55

66
{% include content/plan-grid.md name="data-lakes" %}
77

8-
> warning "Azure Data Lakes public beta data deletion policies"
9-
> Data deletion is not supported by the Azure Data Lakes product, as customers retain data in systems that they manage. Azure Data Lakes is not supported in EU during the public beta, so European data recency requirements do not apply.
8+
> warning "Segment Data Lakes (Azure) deletion policies"
9+
> Data deletion is not supported by Segment Data Lakes (Azure), as customers retain data in systems that they manage.
1010
11-
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
11+
A **data lake** is a centralized cloud storage location that holds structured and unstructured data.
1212

13-
Data lakes typically have four layers:
14-
- **Storage layer:** Holds large files and raw data.
15-
- **Metadata store:** Stores the schema, or the process used to organize the files in the object store.
16-
- **Query layer:** Allows you to run SQL queries on the object store.
13+
Data lakes typically have four layers:
14+
- **Storage layer:** Holds large files and raw data.
15+
- **Metadata store:** Stores the schema, or the process used to organize the files in the object store.
16+
- **Query layer:** Allows you to run SQL queries on the object store.
1717
- **Compute layer:** Allows you to write to and transform the data in the storage layer.
1818

1919
![A graphic showing the information flowing from the metadata into the query, compute, and metadata layers, and then into the storage layer](images/data_lakes_overview_graphic.png)
2020

2121
Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data.
2222

2323
> info ""
24-
> Segment Data Lakes is available to Business tier customers only. Azure Data Lakes is currently in Public Beta.
24+
> Segment Data Lakes is available to Business tier customers only.
2525
2626
To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
2727

2828
## How Data Lakes work
2929

30-
Segment supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
30+
Segment supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
3131

3232
### How Segment Data Lakes works
3333

@@ -39,18 +39,18 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
3939

4040
![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
4141

42-
### How Azure Data Lakes works
42+
### How Segment Data Lakes (Azure) works
4343

4444
Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure Databricks or Azure Synapse Analytics.
4545

4646
![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png)
4747

48+
## Set up Segment Data Lakes (Azure)
4849

49-
## Set up Segment Data Lakes
50-
51-
For more detailed information about setting up Segment and Azure Data Lakes, please see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
50+
For detailed Segment Data Lakes (Azure) setup instructions, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
5251

5352
### Set up Segment Data Lakes
53+
5454
When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below.
5555

5656
#### EMR
@@ -60,21 +60,21 @@ Data Lakes uses an EMR cluster to run jobs that load events from all sources int
6060
#### AWS IAM role
6161

6262
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
63-
- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
63+
- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to **Settings > General Settings > ID**.
6464
- **s3_bucket**: Name of the S3 bucket used by the Data Lake.
6565

66-
### Set up Azure Data Lakes
66+
### Set up Segment Data Lakes (Azure)
6767

68-
To connect your Azure Data Lake to Segment, you must set up the following components in your Azure environment:
68+
To connect Segment Data Lakes (Azure), you must set up the following components in your Azure environment:
6969

7070
- [Azure Storage Account](/docs/connections/storage/catalog/data-lakes/#step-1---create-an-alds-enabled-storage-account): An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks.
7171
- [Azure KeyVault Instance](/docs/connections/storage/catalog/data-lakes/#step-2---set-up-key-vault): Azure KeyVault provides a secure store for your keys, secrets, and certificates.
7272
- [Azure MySQL Database](/docs/connections/storage/catalog/data-lakes/#step-3---set-up-azure-mysql-database): The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
73-
- [Databricks Instance](/docs/connections/storage/catalog/data-lakes/#step-4---set-up-databricks): Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
73+
- [Databricks Instance](/docs/connections/storage/catalog/data-lakes/#step-4---set-up-databricks): Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications.
7474
- [Databricks Cluster](/docs/connections/storage/catalog/data-lakes/#step-6---configure-databricks-cluster): The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads.
7575
- [Service Principal](/docs/connections/storage/catalog/data-lakes/#step-5---set-up-a-service-principal): Service principals are identities used to access specific resources.
7676

77-
For more information about configuring Azure Data Lakes, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/#set-up-azure-data-lakes).
77+
For more information about configuring Segment Data Lakes (Azure), see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes-azure).
7878

7979
## Data Lakes schema
8080

@@ -127,9 +127,9 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
127127
> info ""
128128
> The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
129129
130-
### Azure Data Lakes schema
130+
### Segment Data Lakes (Azure) schema
131131

132-
Azure Data Lakes applies a consistent schema to make raw data accessible for queries. A transformer automatically calculates the desired schema and uploads a schema JSON file for each event type to your Azure Data Lake Storage (ADLS) in the `/staging/` directory.
132+
Segment Data Lakes (Azure) applies a consistent schema to make raw data accessible for queries. A transformer automatically calculates the desired schema and uploads a schema JSON file for each event type to your Azure Data Lake Storage (ADLS) in the `/staging/` directory.
133133

134134
Segment partitions the data in ALDS by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible.
135135

@@ -161,10 +161,7 @@ If Data Lakes sees a bad data type, for example text in place of a number or an
161161

162162
### Data Lake deduplication
163163

164-
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last 7 days, based on the `messageId` field.
165-
166-
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
167-
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates within the 24-hour look-back window applies, but secondary deduplication is not supported.
164+
In addition to Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window, Data Lakes have another layer of deduplication to ensure clean data in your Data Lake. Segment removes duplicate events at the time your Data Lake ingests data. Data Lakes deduplicate any data synced within the last seven days, based on the `messageId` field.
168165

169166
### Using a Data Lake with a Data Warehouse
170167

@@ -174,13 +171,6 @@ When you use Data Lakes, you can either use Data Lakes as your _only_ source of
174171

175172
## FAQ
176173

177-
### What are some limitations of the Azure Data Lakes public beta?
178-
The following capabilities are not supported during the Azure Data Lakes public beta:
179-
- EU region support
180-
- Deduplication
181-
- Sync History and Sync Health in Segment app
182-
183-
184174
#### Can I send all of my Segment data into Data Lakes?
185175
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources. Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
186176

@@ -194,7 +184,7 @@ As the data schema evolves and new columns are added, Segment Data Lakes will de
194184

195185

196186
### How does Data Lakes work with Protocols?
197-
Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
187+
Data Lakes has no direct integration with [Protocols](/docs/protocols/).
198188

199189
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
200190

@@ -231,10 +221,9 @@ Segment stops creating new tables for the events after you exceed this limit. Ho
231221

232222
You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
233223

234-
### What analytics tools are available to use with my Azure Data Lake?
235-
Azure Data Lakes supports the following analytics tools:
224+
### What analytics tools are available to use with Segment Data Lakes (Azure)?
225+
Segment Data Lakes (Azure) supports the following analytics tools:
236226
- PowerBI
237227
- Azure HDInsight
238228
- Azure Synapse Analytics
239229
- Databricks
240-

0 commit comments

Comments
 (0)