Updating data lakes pages with consistent FAQ format, Vale updates

forstisabella · forstisabella · commit 7a5bf3a9bab0 · 2022-06-16T12:46:03.000-04:00
diff --git a/src/connections/storage/data-lakes/comparison.md b/src/connections/storage/data-lakes/comparison.md
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
 Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
 
 
-## Data Freshness
+## Data freshness
 
 Data Lakes and Warehouses offer different sync frequencies:
 - Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
 
 [Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
 
-## Object vs Event Data
+## Object vs event data
 
 Warehouses support both event and object data, while Data Lakes supports only event data.
 
@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
 
 ## Schema
 
-### Data Types
+### Data types
 
 Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
 
@@ -84,7 +84,7 @@ This approach leads to a few scenarios where the data type for an event may be d
 
 Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
 
-If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
+If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
 
 
 ### Tables
diff --git a/src/connections/storage/data-lakes/data-lakes-manual-setup.md b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
 
 The following steps provide examples of the IAM Role and IAM Policy.
 
-### IAM Role
+### IAM role
 
 Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region. 
 
-#### IAM Role for Data Lakes created in US workspaces:
+#### IAM role for Data Lakes created in US workspaces:
 
 Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
 
@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
 > note ""
 > Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
 
-#### IAM Role for Data Lakes created in EU workspaces:
+#### IAM role for Data Lakes created in EU workspaces:
 
 > info ""
 > EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager. 
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
 > note ""
 > **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
 
-### IAM Policy
+### IAM policy
 
 Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
 
@@ -259,7 +259,7 @@ Segment requires access to the data and schema for debugging data quality issues
 ![Debugging](images/dl_setup_glueerror.png)
   - An easier alternative is to create a new account that has Athena backed by Glue as the default.
 
-## Updating EMR Clusters
+## Updating EMR clusters
 You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
 
 When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.  
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
@@ -10,18 +10,18 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
 > info ""
 > Segment Data Lakes is available to Business tier customers only.
 
-To learn more, check out our [blog post](https://segment.com/blog/introducing-segment-data-lakes/).
+To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
 
 
 ## How Segment Data Lakes work
 
 Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
 
-![](images/dl_overview2.png)
+![A diagram showing data flowing from Segment, through Parquet and S3, into Glue, and then into your Data Lake](images/dl_overview2.png)
 
 Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
 
-![](images/dl_vpc.png)
+![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
 
 Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
 
@@ -44,7 +44,7 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data
 
 Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr).
 
-### AWS IAM Role
+### AWS IAM role
 
 Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
 - **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID.
@@ -67,7 +67,13 @@ The file path looks like:
 `s3://<top-level-Segment-bucket>/data/<source-id>/segment_type=<event type>/day=<YYYY-MM-DD>/hr=<HH>`
 
 Here are a few examples of what events look like:
-![](images/dl_s3bucket.png)
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=11/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=12/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=13/`
+
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=11/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=12/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=13/`
 
 By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give you granular access to the S3 data. You can change the partition structure during the [set up process](/docs/connections/storage/catalog/data-lakes/), where you can choose from the following options:
 - Day/Hour [YYYY-MM-DD/HH] (Default)
@@ -79,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
 
 Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
 
-![](images/dl_gluecatalog.png)
+![A screenshot of the AWS ios_prod_identify table, containing the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
 <!--
 TODO:
 add annotated glue image calling out different parts of inferred schema)
@@ -111,29 +117,33 @@ Once Data Lakes sets a data type for a column, all subsequent data will attempt
 
 **Size mismatch**
 
-If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp).
+If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"}.
 
 **Data mismatch**
 
-If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/) if you find a data type needs to be corrected.
+If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
 
 
 
 ## FAQ
+{% faq %}
 
-#### Can I send all of my Segment data into Data Lakes?
+{% faqitem Can I send all of my Segment data into Data Lakes? %}
 Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
 
-Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
+Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
+{% endfaqitem %}
 
-#### Are user deletions and suppression supported?
-Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
+{% faqitem Are user deletions and suppression supported? %}
+Segment doesn't support User deletions in Data Lakes, but supports [user suppression](/docs/privacy/user-deletion-and-suppression/#suppressed-users).
+{% endfaqitem %}
 
-#### How does Data Lakes handle schema evolution?
+{% faqitem How does Data Lakes handle schema evolution? %}
 As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
+{% endfaqitem %}
 
-#### How does Data Lakes work with Protocols?
-Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
+{% faqitem How does Data Lakes work with Protocols? %}
+Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
 
 Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
 
@@ -145,12 +155,14 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
 
 - **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
 - **Labels** - Labels set in Protocols aren't sent to Data Lakes.
+{% endfaqitem %}
 
-#### What is the cost to use AWS Glue?
-You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
+{% faqitem What is the cost to use AWS Glue? %}
+You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
+{% endfaqitem %}
 
-#### What limits does AWS Glue have?
-AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
+{% faqitem What limits does AWS Glue have? %}
+AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information.
 
 The most common limits to keep in mind are:
 - Databases per account: 10,000
@@ -159,4 +171,8 @@ The most common limits to keep in mind are:
 
 Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
 
-You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
+You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
+
+{% endfaqitem %}
+
+{% endfaq %}
diff --git a/src/connections/storage/data-lakes/sync-history.md b/src/connections/storage/data-lakes/sync-history.md
@@ -32,17 +32,24 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
 To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
 
 ## Data Lakes Reports FAQ
-##### How long is a data point available?
+{% faq %}
+{% faqitem How long is a data point available? %}
 The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
+{% endfaqitem %}
 
-##### How do sync history and health compare? 
-The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days. 
+{% faqitem How do sync history and health compare? %}
+The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
+{% endfaqitem %}
 
-##### What timezone is the time and date information in?
+{% faqitem What timezone is the time and date information in? %}
 All dates and times on the sync history and health pages are in the user's local time. 
+{% endfaqitem %}
 
-##### When does the data update?
+{% faqitem When does the data update? %}
 The sync data for both reports updates in real time.
+{% endfaqitem %}
 
-##### When do syncs occur?
-Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs. 
+{% faqitem When do syncs occur? %}
+Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs. 
+{% endfaqitem %}
+{% endfaq %}
diff --git a/src/connections/storage/data-lakes/sync-reports.md b/src/connections/storage/data-lakes/sync-reports.md