You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/comparison.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
9
9
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
10
10
11
11
12
-
## Data Freshness
12
+
## Data freshness
13
13
14
14
Data Lakes and Warehouses offer different sync frequencies:
15
15
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
21
21
22
22
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
23
23
24
-
## Object vs Event Data
24
+
## Object vs event data
25
25
26
26
Warehouses support both event and object data, while Data Lakes supports only event data.
27
27
@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
73
73
74
74
## Schema
75
75
76
-
### Data Types
76
+
### Data types
77
77
78
78
Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
79
79
@@ -84,7 +84,7 @@ This approach leads to a few scenarios where the data type for an event may be d
84
84
85
85
Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
86
86
87
-
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
87
+
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
87
87
88
88
The following steps provide examples of the IAM Role and IAM Policy.
89
89
90
-
### IAM Role
90
+
### IAM role
91
91
92
92
Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region.
93
93
94
-
#### IAM Role for Data Lakes created in US workspaces:
94
+
#### IAM role for Data Lakes created in US workspaces:
95
95
96
96
Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
97
97
@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
125
125
> note ""
126
126
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
127
127
128
-
#### IAM Role for Data Lakes created in EU workspaces:
128
+
#### IAM role for Data Lakes created in EU workspaces:
129
129
130
130
> info ""
131
131
> EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager.
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
160
160
> note ""
161
161
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
162
162
163
-
### IAM Policy
163
+
### IAM policy
164
164
165
165
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
166
166
@@ -259,7 +259,7 @@ Segment requires access to the data and schema for debugging data quality issues
259
259

260
260
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
261
261
262
-
## Updating EMR Clusters
262
+
## Updating EMR clusters
263
263
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
264
264
265
265
When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+36-20Lines changed: 36 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,18 +10,18 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
10
10
> info ""
11
11
> Segment Data Lakes is available to Business tier customers only.
12
12
13
-
To learn more, check out our [blog post](https://segment.com/blog/introducing-segment-data-lakes/).
13
+
To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
14
14
15
15
16
16
## How Segment Data Lakes work
17
17
18
18
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
19
19
20
-

20
+

21
21
22
22
Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
23
23
24
-

24
+

25
25
26
26
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
27
27
@@ -44,7 +44,7 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data
44
44
45
45
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr).
46
46
47
-
### AWS IAM Role
47
+
### AWS IAM role
48
48
49
49
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
50
50
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID.
By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give you granular access to the S3 data. You can change the partition structure during the [set up process](/docs/connections/storage/catalog/data-lakes/), where you can choose from the following options:
73
79
- Day/Hour [YYYY-MM-DD/HH] (Default)
@@ -79,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
79
85
80
86
Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
81
87
82
-

88
+

83
89
<!--
84
90
TODO:
85
91
add annotated glue image calling out different parts of inferred schema)
@@ -111,29 +117,33 @@ Once Data Lakes sets a data type for a column, all subsequent data will attempt
111
117
112
118
**Size mismatch**
113
119
114
-
If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp).
120
+
If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"}.
115
121
116
122
**Data mismatch**
117
123
118
-
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/) if you find a data type needs to be corrected.
124
+
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
119
125
120
126
121
127
122
128
## FAQ
129
+
{% faq %}
123
130
124
-
#### Can I send all of my Segment data into Data Lakes?
131
+
{% faqitem Can I send all of my Segment data into Data Lakes? %}
125
132
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
126
133
127
-
Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
134
+
Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
135
+
{% endfaqitem %}
128
136
129
-
#### Are user deletions and suppression supported?
130
-
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
137
+
{% faqitem Are user deletions and suppression supported? %}
138
+
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](/docs/privacy/user-deletion-and-suppression/#suppressed-users).
139
+
{% endfaqitem %}
131
140
132
-
#### How does Data Lakes handle schema evolution?
141
+
{% faqitem How does Data Lakes handle schema evolution? %}
133
142
As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
143
+
{% endfaqitem %}
134
144
135
-
#### How does Data Lakes work with Protocols?
136
-
Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
145
+
{% faqitem How does Data Lakes work with Protocols? %}
146
+
Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
137
147
138
148
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
139
149
@@ -145,12 +155,14 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
145
155
146
156
-**Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
147
157
-**Labels** - Labels set in Protocols aren't sent to Data Lakes.
158
+
{% endfaqitem %}
148
159
149
-
#### What is the cost to use AWS Glue?
150
-
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
160
+
{% faqitem What is the cost to use AWS Glue? %}
161
+
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
162
+
{% endfaqitem %}
151
163
152
-
#### What limits does AWS Glue have?
153
-
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
164
+
{% faqitem What limits does AWS Glue have? %}
165
+
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information.
154
166
155
167
The most common limits to keep in mind are:
156
168
- Databases per account: 10,000
@@ -159,4 +171,8 @@ The most common limits to keep in mind are:
159
171
160
172
Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
161
173
162
-
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
174
+
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-history.md
+14-7Lines changed: 14 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,17 +32,24 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
32
32
To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
33
33
34
34
## Data Lakes Reports FAQ
35
-
##### How long is a data point available?
35
+
{% faq %}
36
+
{% faqitem How long is a data point available? %}
36
37
The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
38
+
{% endfaqitem %}
37
39
38
-
##### How do sync history and health compare?
39
-
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
40
+
{% faqitem How do sync history and health compare? %}
41
+
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
42
+
{% endfaqitem %}
40
43
41
-
##### What timezone is the time and date information in?
44
+
{% faqitem What timezone is the time and date information in? %}
42
45
All dates and times on the sync history and health pages are in the user's local time.
46
+
{% endfaqitem %}
43
47
44
-
##### When does the data update?
48
+
{% faqitem When does the data update? %}
45
49
The sync data for both reports updates in real time.
50
+
{% endfaqitem %}
46
51
47
-
##### When do syncs occur?
48
-
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
52
+
{% faqitem When do syncs occur? %}
53
+
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
0 commit comments