You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/catalog/data-lakes/index.md
+19-39Lines changed: 19 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -384,16 +384,14 @@ Running the `plan` command gives you an output that creates 19 new objects, unle
384
384
385
385
### Segment Data Lakes
386
386
387
-
{% faq %}
388
-
{% faqitem Do I need to create Glue databases? %}
387
+
388
+
#### Do I need to create Glue databases?
389
389
No, Data Lakes automatically creates one Glue database per source. This database uses the source slug as its name.
390
-
{% endfaqitem %}
391
390
392
-
{% faqitem What IAM role do I use in the Settings page? %}
391
+
#### What IAM role do I use in the Settings page?
393
392
Four roles are created when you set up Data Lakes using Terraform. You add the `arn:aws:iam::$ACCOUNT_ID:role/segment-data-lake-iam-role` role to the Data Lakes Settings page in the Segment web app.
394
-
{% endfaqitem %}
395
393
396
-
{% faqitem What level of access do the AWS roles have? %}
394
+
#### What level of access do the AWS roles have?
397
395
The roles which Data Lakes assigns during set up are:
398
396
399
397
-**`segment-datalake-iam-role`** - This is the role that Segment assumes to access S3, Glue and the EMR cluster. It allows Segment access to:
@@ -408,54 +406,46 @@ The roles which Data Lakes assigns during set up are:
408
406
- Access only to the specific S3 bucket used for Data Lakes.
409
407
410
408
-**`segment_emr_autoscaling_role`** - Restricted role that can only be assumed by EMR and EC2. This is set up based on [AWS best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-automatic-scaling.html).
411
-
{% endfaqitem %}
412
409
413
-
{% faqitem Why doesn't the Data Lakes Terraform module create an S3 bucket? %}
410
+
411
+
#### Why doesn't the Data Lakes Terraform module create an S3 bucket?
414
412
The module doesn't create a new S3 bucket so you can re-use an existing bucket for your Data Lakes.
415
-
{% endfaqitem %}
416
413
417
-
{% faqitem Does my S3 bucket need to be in the same region as the other infrastructure? %}
414
+
#### Does my S3 bucket need to be in the same region as the other infrastructure?
418
415
Yes, the S3 bucket and the EMR cluster must be in the same region.
419
-
{% endfaqitem %}
420
416
421
-
{% faqitem How do I connect a new source to Data Lakes? %}
417
+
#### How do I connect a new source to Data Lakes?
422
418
To connect a new source to Data Lakes:
423
419
424
420
1. Ensure that the `workspace_id` of the Segment workspace is in the list of [external ids](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update this from the AWS console, or re-run the [Terraform](https://github.com/segmentio/terraform-aws-data-lake) job.
425
421
2. From your Segment workspace, connect the source to the Data Lakes destination.
426
-
{% endfaqitem %}
427
422
428
-
{% faqitem Can I configure multiple sources to use the same EMR cluster? %}
423
+
#### Can I configure multiple sources to use the same EMR cluster?
429
424
Yes, you can configure multiple sources to use the same EMR cluster. Segment recommends that the EMR cluster only be used for Data Lakes to ensure there aren't interruptions from non-Data Lakes job.
430
-
{% endfaqitem %}
431
425
432
-
{% faqitem Why don't I see any data in S3 or Glue after enabling a source? %}
426
+
#### Why don't I see any data in S3 or Glue after enabling a source?
433
427
If you don't see data after enabling a source, check the following:
434
428
- Does the IAM role have the Segment account ID and workspace ID as the external ID?
435
429
- Is the EMR cluster running?
436
430
- Is the correct IAM role and S3 bucket configured in the settings?
437
431
438
432
If all of these look correct and you're still not seeing any data, please [contact the Support team](https://segment.com/help/contact/).
439
-
{% endfaqitem %}
440
433
441
-
{% faqitem What are "Segment Output" tables in S3? %}
434
+
#### What are "Segment Output" tables in S3?
442
435
The `output` tables are temporary tables Segment creates when loading data. They are deleted after each sync.
443
-
{% endfaqitem %}
444
436
445
-
{% faqitem Can I make additional directories in the S3 bucket Data Lakes is using? %}
437
+
#### Can I make additional directories in the S3 bucket Data Lakes is using?
446
438
Yes, you can create new directories in S3 without interfering with Segment data.
447
439
Do not modify, or create additional directories with the following names:
448
440
-`logs/`
449
441
-`segment-stage/`
450
442
-`segment-data/`
451
443
-`segment-logs/`
452
-
{% endfaqitem %}
453
444
454
-
{% faqitem What does "partitioned" mean in the table name? %}
445
+
#### What does "partitioned" mean in the table name?
455
446
`Partitioned` just means that the table has partition columns (day and hour). All tables are partitioned, so you should see this on all table names.
456
-
{% endfaqitem %}
457
447
458
-
{% faqitem How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data? %}
448
+
#### How can I use AWS Spectrum to access Data Lakes tables in Glue, and join it with Redshift data?
459
449
You can use the following command to create external tables in Spectrum to access tables in Glue and join the data with Redshift:
460
450
461
451
Run the `CREATE EXTERNAL SCHEMA` command:
@@ -471,35 +461,25 @@ create external database if not exists;
471
461
Replace:
472
462
-[glue_db_name] = The Glue database created by Data Lakes which is named after the source slug
473
463
-[spectrum_schema_name] = The schema name in Redshift you want to map to
474
-
{% endfaqitem %}
475
-
{% endfaq %}
476
464
477
465
### Azure Data Lakes
478
466
479
-
{% faq %}
480
-
481
-
{% faqitem Does my ALDS-enabled storage account need to be in the same region as the other infrastructure? %}
467
+
#### Does my ALDS-enabled storage account need to be in the same region as the other infrastructure?
482
468
Yes, your storage account and Databricks instance should be in the same region.
483
-
{% endfaqitem %}
484
469
485
-
{% faqitem What analytics tools are available to use with my Azure Data Lake? %}
470
+
#### What analytics tools are available to use with my Azure Data Lake?
486
471
Azure Data Lakes supports the following post-processing tools:
487
472
- PowerBI
488
473
- Azure HDInsight
489
474
- Azure Synapse Analytics
490
475
- Databricks
491
-
{% endfaqitem %}
492
476
493
-
{% faqitem What can I do to troubleshoot my Databricks database? %}
477
+
#### What can I do to troubleshoot my Databricks database?
494
478
If you encounter errors related to your Databricks database, try adding the following line to the config: <br/>
<br/>After you've added to your config, restart your cluster so that your changes can take effect. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
499
-
{% endfaqitem %}
500
-
501
-
{% faqitem What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database? %}
502
-
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
503
-
{% endfaqitem %}
504
483
505
-
{% endfaq %}
484
+
#### What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database?
485
+
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+18-24Lines changed: 18 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Az
22
22
23
23
To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
24
24
25
-
## How Segment Data Lakes work
25
+
## How Data Lakes work
26
26
27
27
Segment currently supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options.
28
28
@@ -170,28 +170,27 @@ The Data Lakes and Warehouses products are compatible using a mapping, but do no
170
170
When you use Data Lakes, you can either use Data Lakes as your _only_ source of data and query all of your data directly from S3 or ADLS or you can use Data Lakes in addition to a data warehouse.
171
171
172
172
## FAQ
173
-
{% faq %}
174
173
175
-
{% faqitem What AWS Data Lake features are not supported in the Azure Data Lakes public beta? %}
174
+
### What AWS Data Lake features are not supported in the Azure Data Lakes public beta?
176
175
The following capabilities are supported by Segment Data Lakes but not by the Azure Data Lakes public beta:
177
176
- EU region support
178
177
- Deduplication
179
178
- Sync History and Sync Health in Segment app
180
-
{% endfaqitem %}
181
179
182
-
{% faqitem Can I send all of my Segment data into Data Lakes? %}
180
+
181
+
#### Can I send all of my Segment data into Data Lakes?
183
182
Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources. Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
184
-
{% endfaqitem %}
185
183
186
-
{% faqitem Are user deletions and suppression supported? %}
184
+
185
+
### Are user deletions and suppression supported?
187
186
Segment doesn't support User deletions in Data Lakes, but supports [user suppression](/docs/privacy/user-deletion-and-suppression/#suppressed-users).
188
-
{% endfaqitem %}
189
187
190
-
{% faqitem How does Data Lakes handle schema evolution? %}
188
+
189
+
### How does Data Lakes handle schema evolution?
191
190
As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
192
-
{% endfaqitem %}
193
191
194
-
{% faqitem How does Data Lakes work with Protocols? %}
192
+
193
+
### How does Data Lakes work with Protocols?
195
194
Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
196
195
197
196
Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
@@ -204,21 +203,20 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
204
203
205
204
-**Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
206
205
-**Labels** - Labels set in Protocols aren't sent to Data Lakes.
207
-
{% endfaqitem %}
208
206
209
-
{% faqitem How frequently does my Data Lake sync? %}
207
+
208
+
### How frequently does my Data Lake sync?
210
209
Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
211
-
{% endfaqitem %}
212
210
213
-
{% faqitem What is the cost to use AWS Glue? %}
211
+
212
+
### What is the cost to use AWS Glue?
214
213
You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
215
-
{% endfaqitem %}
216
214
217
-
{% faqitem What is the cost to use Microsoft Azure? %}
215
+
### What is the cost to use Microsoft Azure?
218
216
You can find details on Microsoft's [pricing for Azure](https://azure.microsoft.com/en-us/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
219
-
{% endfaqitem %}
220
217
221
-
{% faqitem What limits does AWS Glue have? %}
218
+
219
+
### What limits does AWS Glue have?
222
220
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information.
223
221
224
222
The most common limits to keep in mind are:
@@ -230,14 +228,10 @@ Segment stops creating new tables for the events after you exceed this limit. Ho
230
228
231
229
You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
232
230
233
-
{% endfaqitem %}
234
-
235
-
{% faqitem What analytics tools are available to use with my Azure Data Lake? %}
231
+
### What analytics tools are available to use with my Azure Data Lake?
236
232
Azure Data Lakes supports the following analytics tools:
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-history.md
+7-13Lines changed: 7 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,24 +35,18 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
35
35
To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
36
36
37
37
## Data Lakes Reports FAQ
38
-
{% faq %}
39
-
{% faqitem How long is a data point available? %}
38
+
39
+
### How long is a data point available?
40
40
The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
41
-
{% endfaqitem %}
42
41
43
-
{% faqitem How do sync history and health compare? %}
42
+
### How do sync history and health compare?
44
43
The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
45
-
{% endfaqitem %}
46
44
47
-
{% faqitem What timezone is the time and date information in? %}
45
+
### What timezone is the time and date information in?
48
46
All dates and times on the sync history and health pages are in the user's local time.
49
-
{% endfaqitem %}
50
47
51
-
{% faqitem When does the data update? %}
48
+
### When does the data update?
52
49
The sync data for both reports updates in real time.
53
-
{% endfaqitem %}
54
50
55
-
{% faqitem When do syncs occur? %}
56
-
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
57
-
{% endfaqitem %}
58
-
{% endfaq %}
51
+
### When do syncs occur?
52
+
Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/sync-reports.md
+4-7Lines changed: 4 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -264,13 +264,10 @@ Internal errors occur in Segment's internal systems, and should resolve on their
264
264
265
265
## FAQ
266
266
267
-
{% faq %}
268
-
{% faqitem How are Data Lakes sync reports different from the sync data for Segment Warehouses? %}
267
+
### How are Data Lakes sync reports different from the sync data for Segment Warehouses?
269
268
Both Warehouses and Data Lakes provide similar information about syncs, including the start and finish time, rows synced, and errors.
270
269
271
270
However, Warehouse sync information is only available in the Segment app: on the Sync History page and Warehouse Health pages. With Data Lakes sync reports, the raw sync information is sent directly to your data lake. This means you can query the raw data and answer your own questions about syncs, and use the data to power alerting and monitoring tools.
272
-
{% endfaqitem %}
273
-
{% faqitem What happens if a sync is partly successful? %}
274
-
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.
275
-
{% endfaqitem %}
276
-
{% endfaq %}
271
+
272
+
### What happens if a sync is partly successful?
273
+
Sync reports are currently generated only when a sync completes, or when it fails. Partial failure reporting is not currently supported.
0 commit comments