Skip to content

Commit 81865b1

Browse files
authored
Merge branch 'master' into dj/cookies
2 parents 712b772 + 08ecefc commit 81865b1

21 files changed

+546
-180
lines changed

src/_data/sidenav/main.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -169,25 +169,25 @@ sections:
169169
title: Functions environment
170170
- path: /connections/functions/usage
171171
title: Functions usage limits
172-
- section_title: Data Storage
172+
- section_title: Storage Destinations
173173
slug: connections/storage
174174
section:
175175
- path: /connections/storage
176-
title: Data Storage overview
176+
title: Storage Destinations overview
177177
- path: /connections/storage/catalog
178-
title: Data Storage catalog
178+
title: Storage Destinations catalog
179179
menu_icon: read-more
180180
- section_title: Segment Data Lakes
181181
slug: connections/storage/data-lakes
182182
section:
183183
- path: /connections/storage/data-lakes
184184
title: Data Lakes overview
185-
- path: /connections/storage/data-lakes/comparison
186-
title: Data Lakes vs Warehouses
187185
- path: /connections/storage/catalog/data-lakes
188186
title: Set up Data Lakes
189187
- path: /connections/storage/data-lakes/sync-reports
190188
title: Sync Reports and error reporting
189+
- path: /connections/storage/data-lakes/comparison
190+
title: Data Lakes vs Warehouses
191191
- section_title: Data Warehouses
192192
slug: connections/storage/warehouses
193193
section:

src/connections/destinations/catalog/amplitude/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -427,21 +427,21 @@ it's how you tell Segment which trait should be used as the group type.
427427
`.group()` calls will contain the Amplitude "group value". It's how you tell
428428
Segment which trait should be used as the group value.
429429

430-
For example, if you specified `industry` as the "Amplitude Group Type Trait",
430+
For example, if you specified `group_type` as the "Amplitude Group Type Trait",
431431
and `name` as the "Amplitude Group Value Trait", then the following call:
432432

433433
```js
434434
analytics.group("082108c8-f51e-485f-9d2d-b6ba57ee2c40", {
435-
industry: "Placeholding",
435+
group_type: "Organization",
436436
name: "ExampleCorp, LLC",
437437
employees: "20",
438438
439439
});
440440
```
441441

442-
Would associate the current user with the group with type `"Placeholding"` and
442+
Would associate the current user with the group with type `"Organization"` and
443443
value `"ExampleCorp, LLC"`. On client-side, that's all that happens. On
444-
server-side and Android, the traits you pass (in this case, `industry`, `name`,
444+
server-side and Android, the traits you pass (in this case, `group_type`, `name`,
445445
`employees`, and `email`) will be provided as `group_properties` of that group.
446446

447447
What you provide as group ID doesn't matter, but Segment requires that all
Lines changed: 61 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
---
2-
hidden: true
3-
title: Data Lakes (Beta)
2+
title: Set Up Segment Data Lakes
43
redirect_from: '/connections/destinations/catalog/data-lakes/'
54
---
65

7-
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from warehouses](/docs/connections/storage/data-lakes/comparison/) in our documentation.
6+
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in our documentation.
87

98
> info ""
109
> Segment Data Lakes is available to Business tier customers only.
@@ -13,15 +12,15 @@ Segment Data Lakes provide a way to collect large quantities of data in a format
1312

1413
Before you set up Segment Data Lakes, you need the following resources:
1514

16-
- An authorized [AWS account](https://aws.amazon.com/account/)
17-
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to send data to and store logs
15+
- An [AWS account](https://aws.amazon.com/account/)
16+
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to receive data and store logs
1817
- A subnet within a VPC for the EMR cluster to run in
1918

2019
## Step 1 - Set Up AWS Resources
2120

22-
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however we can only guarantee support for the template as provided. The Terraform version should be > 0.12.
21+
You can use the [open source Terraform module](https://github.com/segmentio/terraform-aws-data-lake) to automate much of the set up work to get Data Lakes up and running. If you’re familiar with Terraform, you can modify the module to meet your organization’s needs, however Segment guarantees support only for the template as provided. The Data Lakes set up uses Terraform v0.11+. To support more versions of Terraform, the aws provider must use v2, which is included in our example main.tf.
2322

24-
You can also use our [manual set up instructions](https://docs.google.com/document/d/1GlWzS5KO4QaiVZx9pwfpgF-N-Xy2e_QQcdYSX-nLMDU/view) to configure these AWS resources if you prefer.
23+
You can also use our [manual set up instructions](/docs/connections/storage/data-lakes/data-lakes-manual-setup) to configure these AWS resources if you prefer.
2524

2625
The Terraform module and manual set up instructions both provide a base level of permissions to Segment (for example, the correct IAM role to allow Segment to create Glue databases on your behalf). If you want stricter permissions, or other custom configurations, you can customize these manually.
2726

@@ -31,53 +30,61 @@ After you set up the necessary AWS resources, the next step is to set up the Dat
3130

3231
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview), click **Add Destination**, then search for and select **Data Lakes**.
3332

34-
2. Click **Configure Data Lakes** and select the source to connect to the Data Lakes destination.
35-
> **Warning**: You must include all source ids in the external ID list in the IAM policy, or else the source data cannot be synced to S3.
33+
2. Click **Configure Data Lakes** and select the source to connect to the Data Lakes destination.
34+
**Warning**:You must include all source ids in the external ID list in the IAM policy, or else the source data cannot be synced to S3.
3635

37-
4. In the Settings tab, enter and save the following connection settings:
38-
- **AWS Region**: The AWS Region where your EMR cluster, S3 Bucket and Glue DB reside.
36+
3. In the Settings tab, enter and save the following connection settings:
37+
- **AWS Region**: The AWS Region where your EMR cluster, S3 Bucket and Glue DB reside. Ex: `us-west-2`
3938
- **EMR Cluster ID**: The EMR Cluster ID where the Data Lakes jobs will be run.
4039
- **Glue Catalog ID**: The Glue Catalog ID (this must be the same as your AWS account ID).
41-
- **IAM Role ARN**: The ARN of the IAM role that Segment will use to connect to Data Lakes.
42-
- **S3 Bucket**: Name of the S3 bucket used by Data Lakes. The EMR cluster will store logs in this bucket.
43-
40+
- **IAM Role ARN**: The ARN of the IAM role that Segment will use to connect to Data Lakes. Ex: `arn:aws:iam::000000000000:role/SegmentDataLakeRole`
41+
- **S3 Bucket**: Name of the S3 bucket used by Data Lakes. The EMR cluster will store logs in this bucket. Ex: `segment-data-lake`
42+
4443
You must individually connect each source to the Data Lakes destination. However, you can copy the settings from another source by clicking **** ("more") (next to the button for “Set up Guide”).
4544

46-
5. _(Optional)_ **Date Partition**: Optional advanced setting to change the date partition structure, with a default structure `day=<YYYY-MM-DD>/hr=<HH>`. To use the default, leave this setting unchanged. To partition the data by a different date structure, choose one of the following options:
45+
4. _(Optional)_ **Date Partition**: Optional advanced setting to change the date partition structure, with a default structure `day=<YYYY-MM-DD>/hr=<HH>`. To use the default, leave this setting unchanged. To partition the data by a different date structure, choose one of the following options:
4746
- Day/Hour [YYYY-MM-DD/HH] (Default)
4847
- Year/Month/Day/Hour [YYYY/MM/DD/HH]
4948
- Year/Month/Day [YYYY/MM/DD]
5049
- Day [YYYY-MM-DD]
5150

52-
6. _(Optional)_ **Glue Database Name**: Optional advanced setting to change the name of the Glue Database which is set to the source slug by default. Each source connected to Data Lakes must have a different Glue Database name otherwise data from different sources will collide in the same database.
51+
5. _(Optional)_ **Glue Database Name**: Optional advanced setting to change the name of the Glue Database which is set to the source slug by default. Each source connected to Data Lakes must have a different Glue Database name otherwise data from different sources will collide in the same database.
5352

54-
7. Enable the Data Lakes destination by clicking the toggle near the **Set up Guide** button.
53+
6. Enable the Data Lakes destination by clicking the toggle near the **Set up Guide** button.
5554

5655
Once the Data Lakes destination is enabled, the first sync will begin approximately 2 hours later.
5756

5857

59-
## (Optional) Step 3 - Replay Historical Data
58+
## Step 3 - Verify Data is Synced to S3 and Glue
6059

61-
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
60+
You will see event data and [sync reports](/docs/connections/storage/data-lakes/sync-reports) populated in S3 and Glue after the first sync successfully completes. However if an [insufficient permission](/docs/connections/storage/data-lakes/sync-reports/#insufficient-permissions) or [invalid setting](/docs/connections/storage/data-lakes/sync-reports/#invalid-settings) is provided during set up, the first data lake sync will fail.
6261

63-
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
62+
To be alerted of sync failures via email, subscribe to the `Storage Destination Sync Failed` activity email notification within the App Settings > User Preferences > [Notification Settings](https://app.segment.com/goto-my-workspace/settings/notifications).
63+
![](images/dl_activity_notifications2.png)
6464

65-
Segment uses a creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
65+
`Sync Failed` emails are sent on the 1st, 5th and 20th sync failure. Learn more about the types of errors which can cause sync failures [here](/docs/connections/storage/data-lakes/sync-reports/#sync-errors).
6666

67-
# Common Questions
6867

69-
## Data Lakes Set Up
68+
## (Optional) Step 4 - Replay Historical Data
7069

71-
##### Do I need to create Glue databases?
70+
If you want to add historical data to your data set using a [replay of historical data](/docs/guides/what-is-replay/) into Data Lakes, [contact the Segment Support team](https://segment.com/help/contact/) to request one.
7271

73-
No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
72+
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, we recommend that you start with data from the last six months to get started, and then replay additional data if you find you need more.
7473

75-
##### What IAM role do I use in the Settings page?
74+
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
7675

77-
Four roles are created when you set up Data Lakes using Terraform. You add the `arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role` role to the Data Lakes Settings page in the Segment web app.
76+
## FAQ
7877

79-
##### What level of access do the AWS roles have?
78+
### Data Lakes Set Up
8079

80+
{% faq %}
81+
{% faqitem Do I need to create Glue databases? %}
82+
No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
83+
{% endfaqitem %}
84+
{% faqitem What IAM role do I use in the Settings page? %}
85+
Four roles are created when you set up Data Lakes using Terraform. You add the `arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role` role to the Data Lakes Settings page in the Segment web app.
86+
{% endfaqitem %}
87+
{% faqitem What level of access do the AWS roles have? %}
8188
The roles which Data Lakes assigns during set up are:
8289

8390
- **`segment-datalake-iam-role`** - This is the role that Segment assumes to access S3, Glue and the EMR cluster. It allows Segment access to:
@@ -92,61 +99,49 @@ The roles which Data Lakes assigns during set up are:
9299
- Access only to the specific S3 bucket used for Data Lakes.
93100

94101
- **`segment_emr_autoscaling_role`** - Restricted role that can only be assumed by EMR and EC2. This is set up based on [AWS best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-automatic-scaling.html).
95-
96-
##### Why doesn't the Data Lakes Terraform module create an S3 bucket?
97-
102+
{% endfaqitem %}
103+
{% faqitem Why doesn't the Data Lakes Terraform module create an S3 bucket? %}
98104
The module doesn't create a new S3 bucket so you can re-use an existing bucket for your Data Lakes.
99-
100-
##### Does my S3 bucket need to be in the same region as the other infrastructure?
101-
105+
{% endfaqitem %}
106+
{% faqitem Does my S3 bucket need to be in the same region as the other infrastructure? %}
102107
Yes, the S3 bucket and the EMR cluster must be in the same region.
103-
104-
##### How do I connect a new source to Data Lakes?
105-
108+
{% endfaqitem %}
109+
{% faqitem How do I connect a new source to Data Lakes? %}
106110
To connect a new source to Data Lakes:
107111

108112
1. Add the `source_id` found in the Segment workspace into the list of [external ids](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update this from the AWS console, or re-run the [Terraform](https://github.com/segmentio/terraform-aws-data-lake) job.
109113
2. From your Segment workspace, connect the source to the Data Lakes destination.
110-
111-
##### Can I configure multiple sources to use the same EMR cluster?
112-
113-
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes jobs.
114-
115-
116-
## Post-Set Up
117-
118-
##### Why don't I see any data in S3 or Glue after enabling a source?
119-
114+
{% endfaqitem %}
115+
{% faqitem Can I configure multiple sources to use the same EMR cluster? %}
116+
Yes, you can configure multiple sources to use the same EMR cluster. We recommend that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
117+
{% endfaqitem %}
118+
{% endfaq %}
119+
120+
### Post-Set Up
121+
{% faq %}
122+
{% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
120123
If you don't see data after enabling a source, check the following:
124+
- Does the IAM role have the Segment account ID and source IDs as the external IDs?
121125
- Is the EMR cluster running?
122126
- Is the correct IAM role and S3 bucket configured in the settings?
123-
- Does the IAM role have the Segment account ID and source IDs as the external IDs?
124127

125128
If all of these look correct and you're still not seeing any data, please [contact the Support team](https://segment.com/help/contact/).
126-
127-
##### What are "Segment Output" tables in S3?
128-
129+
{% endfaqitem %}
130+
{% faqitem What are "Segment Output" tables in S3? %}
129131
The `output` tables are temporary tables Segment creates when loading data. They are deleted after each sync.
130-
131-
##### Can I make additional directories in the S3 bucket Data Lakes is using?
132-
132+
{% endfaqitem %}
133+
{% faqitem Can I make additional directories in the S3 bucket Data Lakes is using? %}
133134
Yes, you can create new directories in S3 without interfering with Segment data.
134135
Do not modify, or create additional directories with the following names:
135136
- `logs/`
136137
- `segment-stage/`
137138
- `segment-data/`
138139
- `segment-logs/`
139-
140-
##### What does "partitioned" mean in the table name?
141-
140+
{% endfaqitem %}
141+
{% faqitem What does "partitioned" mean in the table name? %}
142142
`Partitioned` just means that the table has partition columns (day and hour). All tables are partitioned, so you should see this on all table names.
143-
144-
##### Why are the Filters, Event Tester and Event Delivery tabs in-app empty?
145-
146-
Data Lakes does not currently support these features. Sync history information will be available soon.
147-
148-
##### How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data?
149-
143+
{% endfaqitem %}
144+
{% faqitem How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data? %}
150145
You can use the following command to create external tables in Spectrum to access tables in Glue and join the data with Redshift:
151146

152147
Run the `CREATE EXTERNAL SCHEMA` command:
@@ -162,3 +157,5 @@ create external database if not exists;
162157
Replace:
163158
- [glue_db_name] = The Glue database created by Data Lakes which is named after the source slug
164159
- [spectrum_schema_name] = The schema name in Redshift you want to map to
160+
{% endfaqitem %}
161+
{% endfaq %}

0 commit comments

Comments
 (0)