You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/comparison.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,15 +12,18 @@ Data Lakes and Warehouses are not identical, but are compatible with a configura
12
12
## Data freshness
13
13
14
14
Data Lakes and Warehouses offer different sync frequencies:
15
-
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
15
+
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/storage/warehouses/warehouse-syncs/#warehouse-selective-sync) collections and properties within a source to Warehouses.
16
16
- Data Lakes offers 12 syncs in a 24 hour period, and doesn't offer custom sync schedules or selective sync.
17
17
18
18
## Duplicates
19
19
20
-
Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Data Lakes and Warehouses.
20
+
Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Segment Data Lakes and Warehouses.
21
21
22
22
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
23
23
24
+
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
25
+
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates applies, but secondary deduplication is not supported.
26
+
24
27
## Object vs event data
25
28
26
29
Warehouses support both event and object data, while Data Lakes supports only event data.
@@ -103,6 +106,6 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
103
106
104
107
-`event` and `event_text` - Each property within an event has its own column, however the naming convention for these columns differs between Warehouses and Data Lakes. Warehouses snake case the original payload value and preserves the original text within the `event_text` column. Data Lakes use the original payload value as-is for the column name, and does not need an `event_text` column.
105
108
-`channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
106
-
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
109
+
-*(Redshift only)*`uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
107
110
-`sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
108
111
-`integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/lake-formation.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,10 @@ title: Lake Formation
4
4
5
5
{% include content/plan-grid.md name="data-lakes" %}
6
6
7
-
Lake Formation is a fully managed service built on top of the AWS Glue Data Catalog that provides one central set of tools to build and manage a Data Lake. These tools help import, catalog, transform, and deduplicate data, as well as provide strategies to optimize data storage and security.
7
+
Lake Formation is a fully managed service built on top of the AWS Glue Data Catalog that provides one central set of tools to build and manage a Data Lake. These tools help import, catalog, transform, and deduplicate data, as well as provide strategies to optimize data storage and security. To learn more about Lake Formation features, see [Amazon Web Services documentation](https://aws.amazon.com/lake-formation/features/){:target="_blank"}.
8
8
9
-
> note "Learn more about Lake Formation features"
10
-
> To learn more about Lake Formation features, refer to the [Amazon Web Services documentation](https://aws.amazon.com/lake-formation/features/){:target="_blank"}.
9
+
> note "This feature is not supported in the Azure Data Lakes public beta"
10
+
> Lake Formation is only supported for Segment Data Lakes. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
11
11
12
12
The security policies in Lake Formation use two layers of permissions: each resource is protected by Lake Formation permissions (which control access to Data Catalog resources and S3 locations) and IAM permissions (which control access to Lake Formation and AWS Glue API resources). When any user or role reads or writes to a resource, that action must pass a both a Lake Formation and an IAM resource check: for example, a user trying to create a new table in the Data Catalog may have Lake Formation access to the Data Catalog, but if they don't have the correct Glue API permissions, they will be unable to create the table.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-history.md
+10-13Lines changed: 10 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,9 @@ title: Data Lakes Sync History and Health
5
5
6
6
The Segment Data Lakes sync history and health tabs generate real-time information about data syncs so you can monitor the health and performance of your data lakes. These tools provide monitoring and debugging capabilities within the Data Lakes UI, so you can identify and proactively address data sync or data pipeline failures.
7
7
8
+
> note "This feature is not supported for the Azure Data Lakes public beta"
9
+
> The Sync History/Sync Health tabs are currently not supported for the Azure Data Lakes public beta. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
10
+
8
11
## Sync History
9
12
The 'Sync History' table shows detailed information about the latest 100 syncs to the data lake. The table includes the following fields:
10
13
***Sync status:** The status of the sync: either 'Success,' indicating that all rows synced correctly, 'Partial Success,' indicating that some rows synced correctly, or 'Failed,' indicating that no rows synced correctly
@@ -32,24 +35,18 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
32
35
To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
33
36
34
37
## Data Lakes Reports FAQ
35
-
{% faq %}
36
-
{% faqitem How long is a data point available? %}
38
+
39
+
### How long is a data point available?
37
40
The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
38
-
{% endfaqitem %}
39
41
40
-
{% faqitem How do sync history and health compare? %}
42
+
### How do sync history and health compare?
41
43
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
42
-
{% endfaqitem %}
43
44
44
-
{% faqitem What timezone is the time and date information in? %}
45
+
### What timezone is the time and date information in?
45
46
All dates and times on the sync history and health pages are in the user's local time.
46
-
{% endfaqitem %}
47
47
48
-
{% faqitem When does the data update? %}
48
+
### When does the data update?
49
49
The sync data for both reports updates in real time.
50
-
{% endfaqitem %}
51
50
52
-
{% faqitem When do syncs occur? %}
53
-
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
54
-
{% endfaqitem %}
55
-
{% endfaq %}
51
+
### When do syncs occur?
52
+
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-reports.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,9 @@ title: Data Lakes Sync Reports and Errors
6
6
7
7
Segment Data Lakes generates reports with operational metrics about each sync to your data lake so you can monitor sync performance. These sync reports are stored in your S3 bucket and Glue Data Catalog. This means you have access to the raw data, so you can query it to answer questions and set up alerting and monitoring tools.
8
8
9
+
> note "This feature is not supported for the Azure Data Lakes public beta"
10
+
> The Sync Report tab is currently not supported for the Azure Data Lakes public beta. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
11
+
9
12
## Sync Report schema
10
13
11
14
Your sync_report table stores all of your sync data. You can query it to answer common questions about data synced to your data lake.
@@ -261,13 +264,10 @@ Internal errors occur in Segment's internal systems, and should resolve on their
261
264
262
265
## FAQ
263
266
264
-
{% faq %}
265
-
{% faqitem How are Data Lakes sync reports different from the sync data for Segment Warehouses? %}
267
+
### How are Data Lakes sync reports different from the sync data for Segment Warehouses?
266
268
Both Warehouses and Data Lakes provide similar information about syncs, including the start and finish time, rows synced, and errors.
267
269
268
270
However, Warehouse sync information is only available in the Segment app: on the Sync History page and Warehouse Health pages. With Data Lakes sync reports, the raw sync information is sent directly to your data lake. This means you can query the raw data and answer your own questions about syncs, and use the data to power alerting and monitoring tools.
269
-
{% endfaqitem %}
270
-
{% faqitem What happens if a sync is partly successful? %}
271
-
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.
272
-
{% endfaqitem %}
273
-
{% endfaq %}
271
+
272
+
### What happens if a sync is partly successful?
273
+
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.
0 commit comments