Skip to content

Commit 9106f2d

Browse files
authored
Merge pull request #3088 from segmentio/develop
Release 22.25.1
2 parents 998e7ec + f0206f2 commit 9106f2d

File tree

11 files changed

+180
-109
lines changed

11 files changed

+180
-109
lines changed

scripts/add_id.js

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,7 @@ const {
2323
const {
2424
type
2525
} = require('os');
26-
const {
27-
autocomplete
28-
} = require('@algolia/autocomplete-js');
26+
2927

3028
require('dotenv').config();
3129

src/connections/spec/common.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -276,10 +276,10 @@ Other libraries only collect `context.library`, any other context variables must
276276
| screen.height | |||
277277
| screen.width | |||
278278
| traits | |||
279-
| userAgent || ||
279+
| userAgent || ||
280280
| timezone | |||
281281

282-
- IP Address is not collected by Segment's libraries, but instead filled in by Segmen'ts servers when it receives a message for **client side events only**.
282+
- IP Address isn't collected by Segment's libraries, but is instead filled in by Segment's servers when it receives a message for **client side events only**.
283283
- The Android library collects `screen.density` with [this method](/docs/connections/spec/common/#context-fields-automatically-collected).
284284

285285
## Integrations

src/connections/storage/data-lakes/comparison.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
99
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
1010

1111

12-
## Data Freshness
12+
## Data freshness
1313

1414
Data Lakes and Warehouses offer different sync frequencies:
1515
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
2121

2222
[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
2323

24-
## Object vs Event Data
24+
## Object vs event data
2525

2626
Warehouses support both event and object data, while Data Lakes supports only event data.
2727

@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
7373

7474
## Schema
7575

76-
### Data Types
76+
### Data types
7777

7878
Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
7979

@@ -84,15 +84,15 @@ This approach leads to a few scenarios where the data type for an event may be d
8484

8585
Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
8686

87-
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
87+
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
8888

8989

9090
### Tables
9191

9292
Tables between Warehouses and Data Lakes will be the same, except for in these two cases:
9393

94-
- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [here](/docs/connections/storage/warehouses/schema/).
95-
- `users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [here](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
94+
- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/).
95+
- `users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
9696
- `accounts` - Group calls generate the `accounts` table in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no `accounts` table in Data Lakes.
9797
- *(Redshift only)* **Table names which begin with numbers** - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don't assign this prefix. For example, in Redshift a table name may be named `_101_account_update`, however in Data Lakes it would be named `101_account_update`. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.
9898

@@ -105,4 +105,4 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
105105
- `channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
106106
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
107107
- `sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
108-
- `integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [here](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
108+
- `integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).

src/connections/storage/data-lakes/data-lakes-manual-setup.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
8787

8888
The following steps provide examples of the IAM Role and IAM Policy.
8989

90-
### IAM Role
90+
### IAM role
9191

9292
Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region.
9393

94-
#### IAM Role for Data Lakes created in US workspaces:
94+
#### IAM role for Data Lakes created in US workspaces:
9595

9696
Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
9797

@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
125125
> note ""
126126
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
127127
128-
#### IAM Role for Data Lakes created in EU workspaces:
128+
#### IAM role for Data Lakes created in EU workspaces:
129129

130130
> info ""
131131
> EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager.
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
160160
> note ""
161161
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
162162
163-
### IAM Policy
163+
### IAM policy
164164

165165
Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
166166

@@ -255,11 +255,10 @@ Add a policy to the role created above to give Segment access to the relevant Gl
255255
Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
256256
- Access the individual objects stored in S3 and the associated schema to understand data discrepancies
257257
- Run an Athena query on the underlying data stored in S3
258-
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
259-
![Debugging](images/dl_setup_glueerror.png)
258+
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade. The warning reads: <br/> **Upgrade to the AWS Glue Data Catalog** <br/> To use the AWS Glue Data Catalog with Amazon Athena and Amazon Redshift Spectrum, you must upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum. Start the upgrade in the [Athena console](https://console.aws.amazon.com/athena/){:target="_blank"}.
260259
- An easier alternative is to create a new account that has Athena backed by Glue as the default.
261260

262-
## Updating EMR Clusters
261+
## Updating EMR clusters
263262
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
264263

265264
When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.
@@ -273,7 +272,7 @@ When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Forma
273272

274273
## Procedure
275274
1. Open your Segment app workspace and select the Data Lakes destination.
276-
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
275+
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html){:target="_blank"}. You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
277276
3. Click **Save**.
278277
4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.
279278

0 commit comments

Comments
 (0)