You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/spec/common.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -276,10 +276,10 @@ Other libraries only collect `context.library`, any other context variables must
276
276
| screen.height || ✅ | ✅ |
277
277
| screen.width || ✅ | ✅ |
278
278
| traits || ✅ | ✅ |
279
-
| userAgent | ✅ |✅| ✅ |
279
+
| userAgent | ✅ || ✅ |
280
280
| timezone || ✅ | ✅ |
281
281
282
-
- IP Address is not collected by Segment's libraries, but instead filled in by Segmen'ts servers when it receives a message for **client side events only**.
282
+
- IP Address isn't collected by Segment's libraries, but is instead filled in by Segment's servers when it receives a message for **client side events only**.
283
283
- The Android library collects `screen.density` with [this method](/docs/connections/spec/common/#context-fields-automatically-collected).
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/comparison.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
9
9
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
10
10
11
11
12
-
## Data Freshness
12
+
## Data freshness
13
13
14
14
Data Lakes and Warehouses offer different sync frequencies:
15
15
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
21
21
22
22
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
23
23
24
-
## Object vs Event Data
24
+
## Object vs event data
25
25
26
26
Warehouses support both event and object data, while Data Lakes supports only event data.
27
27
@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
73
73
74
74
## Schema
75
75
76
-
### Data Types
76
+
### Data types
77
77
78
78
Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
79
79
@@ -84,15 +84,15 @@ This approach leads to a few scenarios where the data type for an event may be d
84
84
85
85
Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
86
86
87
-
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
87
+
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
88
88
89
89
90
90
### Tables
91
91
92
92
Tables between Warehouses and Data Lakes will be the same, except for in these two cases:
93
93
94
-
-`tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [here](/docs/connections/storage/warehouses/schema/).
95
-
-`users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [here](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
94
+
-`tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/).
95
+
-`users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
96
96
-`accounts` - Group calls generate the `accounts` table in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no `accounts` table in Data Lakes.
97
97
-*(Redshift only)***Table names which begin with numbers** - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don't assign this prefix. For example, in Redshift a table name may be named `_101_account_update`, however in Data Lakes it would be named `101_account_update`. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.
98
98
@@ -105,4 +105,4 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
105
105
-`channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
106
106
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
107
107
-`sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
108
-
-`integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [here](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
108
+
-`integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
87
87
88
88
The following steps provide examples of the IAM Role and IAM Policy.
89
89
90
-
### IAM Role
90
+
### IAM role
91
91
92
92
Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region.
93
93
94
-
#### IAM Role for Data Lakes created in US workspaces:
94
+
#### IAM role for Data Lakes created in US workspaces:
95
95
96
96
Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
97
97
@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
125
125
> note ""
126
126
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
127
127
128
-
#### IAM Role for Data Lakes created in EU workspaces:
128
+
#### IAM role for Data Lakes created in EU workspaces:
129
129
130
130
> info ""
131
131
> EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager.
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
160
160
> note ""
161
161
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
162
162
163
-
### IAM Policy
163
+
### IAM policy
164
164
165
165
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
166
166
@@ -255,11 +255,10 @@ Add a policy to the role created above to give Segment access to the relevant Gl
255
255
Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
256
256
- Access the individual objects stored in S3 and the associated schema to understand data discrepancies
257
257
- Run an Athena query on the underlying data stored in S3
258
-
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
259
-

258
+
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade. The warning reads: <br/> **Upgrade to the AWS Glue Data Catalog** <br/> To use the AWS Glue Data Catalog with Amazon Athena and Amazon Redshift Spectrum, you must upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum. Start the upgrade in the [Athena console](https://console.aws.amazon.com/athena/){:target="_blank"}.
260
259
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
261
260
262
-
## Updating EMR Clusters
261
+
## Updating EMR clusters
263
262
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
264
263
265
264
When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.
@@ -273,7 +272,7 @@ When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Forma
273
272
274
273
## Procedure
275
274
1. Open your Segment app workspace and select the Data Lakes destination.
276
-
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
275
+
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html){:target="_blank"}. You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
277
276
3. Click **Save**.
278
277
4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.
0 commit comments