You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/comparison.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,8 +91,8 @@ If a bad data type is seen, such as text in place of a number or an incorrectly
91
91
92
92
Tables between Warehouses and Data Lakes will be the same, except for in these two cases:
93
93
94
-
-`tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [here](/docs/connections/storage/warehouses/schema/).
95
-
-`users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [here](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
94
+
-`tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/).
95
+
-`users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
96
96
-`accounts` - Group calls generate the `accounts` table in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no `accounts` table in Data Lakes.
97
97
-*(Redshift only)***Table names which begin with numbers** - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don't assign this prefix. For example, in Redshift a table name may be named `_101_account_update`, however in Data Lakes it would be named `101_account_update`. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.
98
98
@@ -105,4 +105,4 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
105
105
-`channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
106
106
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
107
107
-`sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
108
-
-`integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [here](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
108
+
-`integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/data-lakes-manual-setup.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -272,7 +272,7 @@ When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Forma
272
272
273
273
## Procedure
274
274
1. Open your Segment app workspace and select the Data Lakes destination.
275
-
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
275
+
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html){:target="_blank"}. You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
276
276
3. Click **Save**.
277
277
4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
10
10
> info ""
11
11
> Segment Data Lakes is available to Business tier customers only.
12
12
13
-
To learn more, check out the blog post,[Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
13
+
To learn more, check out the blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
14
14
15
15
16
16
## How Segment Data Lakes work
@@ -38,7 +38,7 @@ When you use Data Lakes, you can either use Data Lakes as your _only_ source of
38
38
39
39
## Set up Segment Data Lakes
40
40
41
-
For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below."
41
+
For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
42
42
43
43
### EMR
44
44
@@ -85,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
85
85
86
86
Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
87
87
88
-

88
+

89
89
<!--
90
90
TODO:
91
91
add annotated glue image calling out different parts of inferred schema)
@@ -158,7 +158,7 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
158
158
{% endfaqitem %}
159
159
160
160
{% faqitem What is the cost to use AWS Glue? %}
161
-
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
161
+
You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
162
162
{% endfaqitem %}
163
163
164
164
{% faqitem What limits does AWS Glue have? %}
@@ -171,7 +171,7 @@ The most common limits to keep in mind are:
171
171
172
172
Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
173
173
174
-
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
174
+
You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-reports.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ The table has the following columns in its schema:
13
13
14
14
|**Sync Metric**|**Description**|
15
15
| ----------------- | ------------------- |
16
-
|`workspace_id`| Distinct ID assigned to each Segment workspace and [found in the workspace settings](https://app.segment.com/goto-my-workspace/settings/basic). |
16
+
|`workspace_id`| Distinct ID assigned to each Segment workspace and [found in the workspace settings](https://app.segment.com/goto-my-workspace/settings/basic){:target="_blank"}. |
17
17
|`source_id`| Distinct ID assigned to each Segment source, found in the Source Settings > API Keys > Source ID. |
18
18
|`database`| Name of the Glue Database used to store sync report tables. Segment automatically creates this database during the Data Lakes set up process. |
19
19
|`emr_cluster_id`| ID of the EMR cluster which Data Lakes uses, found in the [Data Lakes Settings page](). |
@@ -223,7 +223,7 @@ WHERE source_id='9IP56Shn6' AND status='failed' AND date(day) >= (CURRENT_DATE -
223
223
The following error types can cause your data lake syncs to fail:
224
224
-**[Insufficient permissions](#insufficient-permissions)** - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
225
225
-**[Invalid settings](#invalid-settings)** - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
226
-
-**[Internal error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/) if the sync failure persists.
226
+
-**[Internal error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/){:target="_blank"} if the sync failure persists.
227
227
228
228
### Insufficient permissions
229
229
@@ -253,11 +253,11 @@ If you have invalid settings, you might see one of the error messages below:
253
253
- "External ID is invalid. Please ensure the external ID in the IAM role used to connect to your Data Lake matches the source ID."
254
254
- "External ID is not set. Please ensure that the IAM role used to connect to your Data Lake has the source ID in the list of external IDs."
255
255
256
-
The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of [External IDs](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the [Data Lakes set up Terraform job](https://github.com/segmentio/terraform-aws-data-lake).
256
+
The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of [External IDs](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids){:target="_blank"} in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the [Data Lakes set up Terraform job](https://github.com/segmentio/terraform-aws-data-lake){:target="_blank"}.
257
257
258
258
### Internal error
259
259
260
-
Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/).
260
+
Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/){:target="_blank"}.
0 commit comments