Skip to content

Commit ed032e7

Browse files
Merge pull request #3200 from segmentio/DOC-493-IG
Azure Data Lakes private beta docs [DOC-493]
2 parents 73d204f + a309a40 commit ed032e7

File tree

8 files changed

+496
-113
lines changed

8 files changed

+496
-113
lines changed

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 366 additions & 45 deletions
Large diffs are not rendered by default.

src/connections/storage/data-lakes/comparison.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,18 @@ Data Lakes and Warehouses are not identical, but are compatible with a configura
1212
## Data freshness
1313

1414
Data Lakes and Warehouses offer different sync frequencies:
15-
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
15+
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/storage/warehouses/warehouse-syncs/#warehouse-selective-sync) collections and properties within a source to Warehouses.
1616
- Data Lakes offers 12 syncs in a 24 hour period, and doesn't offer custom sync schedules or selective sync.
1717

1818
## Duplicates
1919

20-
Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Data Lakes and Warehouses.
20+
Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for data within a 24 hour look-back window applies to data in Segment Data Lakes and Warehouses.
2121

2222
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
2323

24+
> note "Secondary deduplication is not supported during the Azure Data Lakes public beta"
25+
> During the Azure Data Lakes public beta, Segment's guarantee of 99% no duplicates applies, but secondary deduplication is not supported.
26+
2427
## Object vs event data
2528

2629
Warehouses support both event and object data, while Data Lakes supports only event data.
@@ -103,6 +106,6 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
103106

104107
- `event` and `event_text` - Each property within an event has its own column, however the naming convention for these columns differs between Warehouses and Data Lakes. Warehouses snake case the original payload value and preserves the original text within the `event_text` column. Data Lakes use the original payload value as-is for the column name, and does not need an `event_text` column.
105108
- `channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
106-
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
109+
- *(Redshift only)* `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
107110
- `sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
108111
- `integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
191 KB
Loading
102 KB
Loading

src/connections/storage/data-lakes/index.md

Lines changed: 104 additions & 42 deletions
Large diffs are not rendered by default.

src/connections/storage/data-lakes/lake-formation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ title: Lake Formation
44

55
{% include content/plan-grid.md name="data-lakes" %}
66

7-
Lake Formation is a fully managed service built on top of the AWS Glue Data Catalog that provides one central set of tools to build and manage a Data Lake. These tools help import, catalog, transform, and deduplicate data, as well as provide strategies to optimize data storage and security.
7+
Lake Formation is a fully managed service built on top of the AWS Glue Data Catalog that provides one central set of tools to build and manage a Data Lake. These tools help import, catalog, transform, and deduplicate data, as well as provide strategies to optimize data storage and security. To learn more about Lake Formation features, see [Amazon Web Services documentation](https://aws.amazon.com/lake-formation/features/){:target="_blank"}.
88

9-
> note "Learn more about Lake Formation features"
10-
> To learn more about Lake Formation features, refer to the [Amazon Web Services documentation](https://aws.amazon.com/lake-formation/features/){:target="_blank"}.
9+
> note "This feature is not supported in the Azure Data Lakes public beta"
10+
> Lake Formation is only supported for Segment Data Lakes. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
1111
1212
The security policies in Lake Formation use two layers of permissions: each resource is protected by Lake Formation permissions (which control access to Data Catalog resources and S3 locations) and IAM permissions (which control access to Lake Formation and AWS Glue API resources). When any user or role reads or writes to a resource, that action must pass a both a Lake Formation and an IAM resource check: for example, a user trying to create a new table in the Data Catalog may have Lake Formation access to the Data Catalog, but if they don't have the correct Glue API permissions, they will be unable to create the table.
1313

src/connections/storage/data-lakes/sync-history.md

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ title: Data Lakes Sync History and Health
55

66
The Segment Data Lakes sync history and health tabs generate real-time information about data syncs so you can monitor the health and performance of your data lakes. These tools provide monitoring and debugging capabilities within the Data Lakes UI, so you can identify and proactively address data sync or data pipeline failures.
77

8+
> note "This feature is not supported for the Azure Data Lakes public beta"
9+
> The Sync History/Sync Health tabs are currently not supported for the Azure Data Lakes public beta. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
10+
811
## Sync History
912
The 'Sync History' table shows detailed information about the latest 100 syncs to the data lake. The table includes the following fields:
1013
* **Sync status:** The status of the sync: either 'Success,' indicating that all rows synced correctly, 'Partial Success,' indicating that some rows synced correctly, or 'Failed,' indicating that no rows synced correctly
@@ -32,24 +35,18 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
3235
To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
3336

3437
## Data Lakes Reports FAQ
35-
{% faq %}
36-
{% faqitem How long is a data point available? %}
38+
39+
### How long is a data point available?
3740
The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
38-
{% endfaqitem %}
3941

40-
{% faqitem How do sync history and health compare? %}
42+
### How do sync history and health compare?
4143
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
42-
{% endfaqitem %}
4344

44-
{% faqitem What timezone is the time and date information in? %}
45+
### What timezone is the time and date information in?
4546
All dates and times on the sync history and health pages are in the user's local time.
46-
{% endfaqitem %}
4747

48-
{% faqitem When does the data update? %}
48+
### When does the data update?
4949
The sync data for both reports updates in real time.
50-
{% endfaqitem %}
5150

52-
{% faqitem When do syncs occur? %}
53-
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
54-
{% endfaqitem %}
55-
{% endfaq %}
51+
### When do syncs occur?
52+
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.

src/connections/storage/data-lakes/sync-reports.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ title: Data Lakes Sync Reports and Errors
66

77
Segment Data Lakes generates reports with operational metrics about each sync to your data lake so you can monitor sync performance. These sync reports are stored in your S3 bucket and Glue Data Catalog. This means you have access to the raw data, so you can query it to answer questions and set up alerting and monitoring tools.
88

9+
> note "This feature is not supported for the Azure Data Lakes public beta"
10+
> The Sync Report tab is currently not supported for the Azure Data Lakes public beta. For more information about Azure Data Lakes, see the [Data Lakes overview documentation](/docs/connections/storage/data-lakes/index/#how-azure-data-lakes-works).
11+
912
## Sync Report schema
1013

1114
Your sync_report table stores all of your sync data. You can query it to answer common questions about data synced to your data lake.
@@ -261,13 +264,10 @@ Internal errors occur in Segment's internal systems, and should resolve on their
261264

262265
## FAQ
263266

264-
{% faq %}
265-
{% faqitem How are Data Lakes sync reports different from the sync data for Segment Warehouses? %}
267+
### How are Data Lakes sync reports different from the sync data for Segment Warehouses?
266268
Both Warehouses and Data Lakes provide similar information about syncs, including the start and finish time, rows synced, and errors.
267269

268270
However, Warehouse sync information is only available in the Segment app: on the Sync History page and Warehouse Health pages. With Data Lakes sync reports, the raw sync information is sent directly to your data lake. This means you can query the raw data and answer your own questions about syncs, and use the data to power alerting and monitoring tools.
269-
{% endfaqitem %}
270-
{% faqitem What happens if a sync is partly successful? %}
271-
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.
272-
{% endfaqitem %}
273-
{% endfaq %}
271+
272+
### What happens if a sync is partly successful?
273+
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.

0 commit comments

Comments
 (0)