Skip to content

Commit 7a5bf3a

Browse files
committed
Updating data lakes pages with consistent FAQ format, Vale updates
1 parent 783fc00 commit 7a5bf3a

File tree

5 files changed

+99
-51
lines changed

5 files changed

+99
-51
lines changed

src/connections/storage/data-lakes/comparison.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
99
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
1010

1111

12-
## Data Freshness
12+
## Data freshness
1313

1414
Data Lakes and Warehouses offer different sync frequencies:
1515
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
2121

2222
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
2323

24-
## Object vs Event Data
24+
## Object vs event data
2525

2626
Warehouses support both event and object data, while Data Lakes supports only event data.
2727

@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
7373

7474
## Schema
7575

76-
### Data Types
76+
### Data types
7777

7878
Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
7979

@@ -84,7 +84,7 @@ This approach leads to a few scenarios where the data type for an event may be d
8484

8585
Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
8686

87-
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
87+
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
8888

8989

9090
### Tables

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
8787

8888
The following steps provide examples of the IAM Role and IAM Policy.
8989

90-
### IAM Role
90+
### IAM role
9191

9292
Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region.
9393

94-
#### IAM Role for Data Lakes created in US workspaces:
94+
#### IAM role for Data Lakes created in US workspaces:
9595

9696
Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
9797

@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
125125
> note ""
126126
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
127127
128-
#### IAM Role for Data Lakes created in EU workspaces:
128+
#### IAM role for Data Lakes created in EU workspaces:
129129

130130
> info ""
131131
> EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager.
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
160160
> note ""
161161
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
162162
163-
### IAM Policy
163+
### IAM policy
164164

165165
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
166166

@@ -259,7 +259,7 @@ Segment requires access to the data and schema for debugging data quality issues
259259
![Debugging](images/dl_setup_glueerror.png)
260260
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
261261

262-
## Updating EMR Clusters
262+
## Updating EMR clusters
263263
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
264264

265265
When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.

src/connections/storage/data-lakes/index.md

Lines changed: 36 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,18 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
1010
> info ""
1111
> Segment Data Lakes is available to Business tier customers only.
1212
13-
To learn more, check out our [blog post](https://segment.com/blog/introducing-segment-data-lakes/).
13+
To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
1414

1515

1616
## How Segment Data Lakes work
1717

1818
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
1919

20-
![](images/dl_overview2.png)
20+
![A diagram showing data flowing from Segment, through Parquet and S3, into Glue, and then into your Data Lake](images/dl_overview2.png)
2121

2222
Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
2323

24-
![](images/dl_vpc.png)
24+
![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
2525

2626
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
2727

@@ -44,7 +44,7 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data
4444

4545
Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr).
4646

47-
### AWS IAM Role
47+
### AWS IAM role
4848

4949
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
5050
- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID.
@@ -67,7 +67,13 @@ The file path looks like:
6767
`s3://<top-level-Segment-bucket>/data/<source-id>/segment_type=<event type>/day=<YYYY-MM-DD>/hr=<HH>`
6868

6969
Here are a few examples of what events look like:
70-
![](images/dl_s3bucket.png)
70+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=11/`
71+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=12/`
72+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=13/`
73+
74+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=11/`
75+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=12/`
76+
`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=13/`
7177

7278
By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give you granular access to the S3 data. You can change the partition structure during the [set up process](/docs/connections/storage/catalog/data-lakes/), where you can choose from the following options:
7379
- Day/Hour [YYYY-MM-DD/HH] (Default)
@@ -79,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
7985

8086
Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
8187

82-
![](images/dl_gluecatalog.png)
88+
![A screenshot of the AWS ios_prod_identify table, containing the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
8389
<!--
8490
TODO:
8591
add annotated glue image calling out different parts of inferred schema)
@@ -111,29 +117,33 @@ Once Data Lakes sets a data type for a column, all subsequent data will attempt
111117

112118
**Size mismatch**
113119

114-
If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp).
120+
If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"}.
115121

116122
**Data mismatch**
117123

118-
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/) if you find a data type needs to be corrected.
124+
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
119125

120126

121127

122128
## FAQ
129+
{% faq %}
123130

124-
#### Can I send all of my Segment data into Data Lakes?
131+
{% faqitem Can I send all of my Segment data into Data Lakes? %}
125132
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
126133

127-
Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
134+
Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
135+
{% endfaqitem %}
128136

129-
#### Are user deletions and suppression supported?
130-
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
137+
{% faqitem Are user deletions and suppression supported? %}
138+
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](/docs/privacy/user-deletion-and-suppression/#suppressed-users).
139+
{% endfaqitem %}
131140

132-
#### How does Data Lakes handle schema evolution?
141+
{% faqitem How does Data Lakes handle schema evolution? %}
133142
As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
143+
{% endfaqitem %}
134144

135-
#### How does Data Lakes work with Protocols?
136-
Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
145+
{% faqitem How does Data Lakes work with Protocols? %}
146+
Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
137147

138148
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
139149

@@ -145,12 +155,14 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
145155

146156
- **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
147157
- **Labels** - Labels set in Protocols aren't sent to Data Lakes.
158+
{% endfaqitem %}
148159

149-
#### What is the cost to use AWS Glue?
150-
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
160+
{% faqitem What is the cost to use AWS Glue? %}
161+
You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
162+
{% endfaqitem %}
151163

152-
#### What limits does AWS Glue have?
153-
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
164+
{% faqitem What limits does AWS Glue have? %}
165+
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information.
154166

155167
The most common limits to keep in mind are:
156168
- Databases per account: 10,000
@@ -159,4 +171,8 @@ The most common limits to keep in mind are:
159171

160172
Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
161173

162-
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
174+
You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
175+
176+
{% endfaqitem %}
177+
178+
{% endfaq %}

src/connections/storage/data-lakes/sync-history.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,24 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
3232
To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
3333

3434
## Data Lakes Reports FAQ
35-
##### How long is a data point available?
35+
{% faq %}
36+
{% faqitem How long is a data point available? %}
3637
The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
38+
{% endfaqitem %}
3739

38-
##### How do sync history and health compare?
39-
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
40+
{% faqitem How do sync history and health compare? %}
41+
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
42+
{% endfaqitem %}
4043

41-
##### What timezone is the time and date information in?
44+
{% faqitem What timezone is the time and date information in? %}
4245
All dates and times on the sync history and health pages are in the user's local time.
46+
{% endfaqitem %}
4347

44-
##### When does the data update?
48+
{% faqitem When does the data update? %}
4549
The sync data for both reports updates in real time.
50+
{% endfaqitem %}
4651

47-
##### When do syncs occur?
48-
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
52+
{% faqitem When do syncs occur? %}
53+
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
54+
{% endfaqitem %}
55+
{% endfaq %}

0 commit comments

Comments
 (0)