You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/catalog/data-lakes/index.md
+47-21Lines changed: 47 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,12 +79,12 @@ The time needed to process a Replay can vary depending on the volume of data and
79
79
80
80
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finishes. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
81
81
82
-
## Set up [Azure Data Lakes]
82
+
## Set up Azure Data Lakes
83
83
84
-
> info "[Azure Data Lakes] is currently in Public Beta"
85
-
> [Azure Data Lakes] is available in Public Beta.
84
+
> info ""
85
+
> Azure Data Lakes is available in Public Beta.
86
86
87
-
To set up [Azure Data Lakes], create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
87
+
To set up Azure Data Lakes, create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
88
88
89
89
### Prerequisites
90
90
@@ -141,7 +141,7 @@ Before you can configure your Azure resources, you must first [create an Azure s
141
141
6. Click **Review + create**.
142
142
7. Review your chosen settings. When you are satisfied with your selections, click **Create**.
143
143
8. After your resource is deployed, click **Go to resource**.
144
-
9. From the resouce page, select the **Connection security** tab.
144
+
9. From the resource page, select the **Connection security** tab.
145
145
10. Under the Firewall rules section, select **Yes** to allow access to Azure services, and click the **Allow current client IP address (xx.xxx.xxx.xx)** button to allow access from your current IP address.
146
146
11. Click **Save** to save the changes you made on the **Connection security** page, and select the **Server parameters** tab.
147
147
12. Update the `lower_case_table_names` value to 2, and click **Save**.
@@ -163,16 +163,42 @@ Before you can configure your Azure resources, you must first [create an Azure s
163
163
### Step 4 - Set up Databricks
164
164
165
165
> note "Databricks pricing tier"
166
-
> If you create a Databricks instance only for [Azure Data Lakes] to use, only the standard pricing tier is required. However, if you use your Databricks instance for other applications, you may require premium pricing.
166
+
> If you create a Databricks instance only for Azure Data Lakes to use, only the standard pricing tier is required. However, if you use your Databricks instance for other applications, you may require premium pricing.
167
+
168
+
1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select **Create a resource**.
169
+
2. Search for and select **Azure Databricks**.
170
+
3. On the Azure Database for MySQL resource page, select the **Azure Databricks** plan and click **Create**.
171
+
4. On the **Basic** tab, select an existing subscription and resource group, enter a name for your workspace, select the region you'd like to house your Databricks instance in, and select a pricing tier. For those using the Databricks instance only for Azure Data Lakes, a Standard pricing tier is appropriate. If you plan to use your Databricks instance for more than just Azure Data Lakes, you may require the premium pricing tier.
172
+
5. Click **Review + create**.
173
+
6. Review your chosen settings. When you are satisfied with your selections, click **Create**.
174
+
7. After your resource is deployed, click **Go to resource**.
175
+
8. On the Azure Databricks Service overview page, click **Launch Workspace**.
176
+
9. On the Databricks page, select **Create a cluster**.
177
+
10. On the Compute page, select **Create Cluster**.
178
+
11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your data lake syncs.__
179
+
12. Click **Create Cluster**.
180
+
13. Open [your Azure portal](https://portal.azure.com/#home){:target="_blank”} and select the Key Vault you created in a previous step.
181
+
14. On the Key Vault page, select the JSON View link to view the Resource ID and vaultURI. Take note of these values, as you'll need them in the next step to configure your Databricks instance.
182
+
15. Open `https://<databricks-instance>#secrets/createScope` and enter the following information to connect your Databricks instance to the Key Vault you created in an earlier step:
183
+
-**Scope Name**: Set this value to `segment`.
184
+
-**Manage Principal**: Select **All Users**.
185
+
-**DNS Name**: Set this value to the Vault URI of your Key Vault instance.
186
+
-**Resource ID**: The Resource ID of your Azure Key Vault instance.
187
+
16. When you've entered all of your information, click **Create**.
167
188
168
189
> warning " "
169
190
> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for Databricks Workspace: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
170
191
171
192
### Step 5 - Set up a Service Principal
172
193
173
-
### Step 6 - Configure Databricks cluster
194
+
1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
195
+
2. On the overview page for your Databricks instance, select **Access control (IAM)**.
196
+
3. Click **Add** and select **Add role assignment**.
197
+
4. On the **Members** tab, assign access to a **User, group, or service principal**.
198
+
5. Click **Select members**.
199
+
6. Search for and select the `Databricks Resource Provider` service principal.
174
200
175
-
### Step 7 - Enable the Data Lakes destination in the Segment app
201
+
### Step 6 - Enable the Data Lakes destination in the Segment app
176
202
177
203
After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
178
204
@@ -182,22 +208,22 @@ After you set up the necessary resources in Azure, the next step is to set up th
182
208
2. Search for and select **Azure Data Lakes**.
183
209
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
184
210
3. In the **Connection Settings** section, enter the following values:
185
-
- Azure Storage Account (The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
186
-
- Azure Storage Container (The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
187
-
- Azure Subscription ID
188
-
- Azure Tenant ID
189
-
- Databricks Cluster ID
190
-
- Databricks Instance URL
191
-
- Databricks Workspace Name
192
-
- Databricks Workspace Resource Group
193
-
- Region (The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
194
-
- Service Principal Client ID
195
-
- Service Principal Client Secret
211
+
-**Azure Storage Account**: The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
212
+
-**Azure Storage Container**: The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
213
+
-**Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}.
214
+
-**Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}.
215
+
-**Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}
216
+
-**Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
217
+
-**Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
218
+
-**Databricks Workspace Resource Group**: The resource group that hosts your Azure Databricks instance. This is visible in Azure on the overview page for your Databricks instance.
219
+
-**Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
220
+
-**Service Principal Client ID**:
221
+
-**Service Principal Client Secret**:
196
222
197
223
198
224
### Optional - Set up the Data Lake using Terraform
199
225
200
-
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes)Github repository.
226
+
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes)GitHub repository.
Copy file name to clipboardExpand all lines: src/connections/storage/data-lakes/index.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR
36
36
37
37

38
38
39
-
### How [Azure Data Lakes] works
39
+
### How Azure Data Lakes works
40
40
41
41
Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure DataBricks or Azure Synapse Analytics.
42
42
@@ -60,7 +60,7 @@ Data Lakes uses an IAM role to grant Segment secure access to your AWS account.
60
60
-**external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to Settings > General Settings > ID.
61
61
-**s3_bucket**: Name of the S3 bucket used by the Data Lake.
62
62
63
-
### Set up [Azure Data Lakes]
63
+
### Set up Azure Data Lakes
64
64
65
65
Before you can connect your [Azure Data Lake] to Segment, you must set up the following components in your Azure environment:
66
66
@@ -71,7 +71,7 @@ Before you can connect your [Azure Data Lake] to Segment, you must set up the fo
71
71
-[Azure MySQL Database](https://docs.microsoft.com/en-us/azure/purview/register-scan-azure-mysql-database){:target="_blank”}: The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0.
72
72
-[Azure KeyVault Instance](https://docs.microsoft.com/en-us/azure/key-vault/general/quick-create-portal){:target="_blank”}: Azure KeyVault provides a secure store for your keys, secrets, and certificates.
73
73
74
-
For more information about configuring [Azure Data Lakes], see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
74
+
For more information about configuring Azure Data Lakes, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/).
75
75
76
76
## Data Lakes schema
77
77
@@ -124,7 +124,7 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat
124
124
> info ""
125
125
> The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
126
126
127
-
### [Azure Data Lakes] schema
127
+
### Azure Data Lakes schema
128
128
129
129
### Data types
130
130
@@ -137,7 +137,7 @@ The data types supported in [AWS Data Lakes] are:
137
137
- string
138
138
- timestamp
139
139
140
-
The data types supported in the [Azure Data Lakes] are:
140
+
The data types supported in the Azure Data Lakes are:
0 commit comments