You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation.
9
9
10
10
> note "Lake Formation"
11
-
> You can also set up your [AWS Data Lakes] using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
11
+
> You can also set up your Segment Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
12
12
13
-
## Set up [AWS Data Lakes]
13
+
## Set up Segment Data Lakes
14
14
15
-
To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lakes] destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
15
+
To set up Segment Data Lakes, create your AWS resources, enable the Segment Data Lakes destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
16
16
17
17
### Prerequisites
18
18
19
-
Before you set up [AWS Data Lakes], you need the following resources:
19
+
Before you set up Segment Data Lakes, you need the following resources:
20
20
21
21
- An [AWS account](https://aws.amazon.com/account/){:target="_blank”}
22
22
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket){:target="_blank”} to receive data and store logs
@@ -84,7 +84,7 @@ Segment creates a separate EMR cluster to run replays, then destroys it when the
84
84
> info " "
85
85
> Azure Data Lakes is available in Public Beta.
86
86
87
-
To set up Azure Data Lakes, create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
87
+
To set up Azure Data Lakes, create your Azure resources and then enable the Data Lakes destination in the Segment app.
88
88
89
89
### Prerequisites
90
90
@@ -120,16 +120,17 @@ Before you can configure your Azure resources, you must first [create an Azure s
120
120
2. Search for and select **Key Vault**.
121
121
3. On the Key Vault resource page, select the **Key Vault** plan and click **Create**.
122
122
4. On the **Basic** tab, select an existing subscription and resource group, give your Key Vault a name, and update the **Days to retain deleted vaults** setting, if desired.
123
-
6. Click **Review + create**.
124
-
7. Review your chosen settings. When you are satisfied with your selections, click **Review + create**.
125
-
8. After your resource is deployed, click **Go to resource**.
126
-
9. On the Key Vault page, select the **Access control (IAM)** tab.
127
-
10. Click **Add** and select **Add role assignment**.
128
-
11. On the **Roles** tab, select the `Key Vault Secrets User` role. Click **Next**.
129
-
12. On the **Members** tab, assign access to a **User, group, or service principal**.
130
-
13. Click **Select members**.
131
-
14. Search for and select the `Databricks Resource Provider` service principal.
132
-
15.
123
+
5. Click **Review + create**.
124
+
6. Review your chosen settings. When you are satisfied with your selections, click **Review + create**.
125
+
7. After your resource is deployed, click **Go to resource**.
126
+
8. On the Key Vault page, select the **Access control (IAM)** tab.
127
+
9. Click **Add** and select **Add role assignment**.
128
+
10. On the **Roles** tab, select the `Key Vault Secrets User` role. Click **Next**.
129
+
11. On the **Members** tab, select a **User, group, or service principal**.
130
+
12. Click **Select members**.
131
+
13. Search for and select the `Databricks Resource Provider` service principal.
132
+
14. Click **Select**.
133
+
15. Under the **Members** header, verify that you selected the Databricks Resource Provider. Click **Review + assign**.
133
134
134
135
### Step 3 - Set up Azure MySQL database
135
136
@@ -175,7 +176,7 @@ Before you can configure your Azure resources, you must first [create an Azure s
175
176
8. On the Azure Databricks Service overview page, click **Launch Workspace**.
176
177
9. On the Databricks page, select **Create a cluster**.
177
178
10. On the Compute page, select **Create Cluster**.
178
-
11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your data lake syncs.__
179
+
11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your Data Lake syncs.__
179
180
12. Click **Create Cluster**.
180
181
13. Open [your Azure portal](https://portal.azure.com/#home){:target="_blank”} and select the Key Vault you created in a previous step.
181
182
14. On the Key Vault page, select the JSON View link to view the Resource ID and vaultURI. Take note of these values, as you'll need them in the next step to configure your Databricks instance.
@@ -187,23 +188,107 @@ Before you can configure your Azure resources, you must first [create an Azure s
187
188
16. When you've entered all of your information, click **Create**.
188
189
189
190
> warning " "
190
-
> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for Databricks Workspace: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
191
+
> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for your Databricks Workspace: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
191
192
192
193
### Step 5 - Set up a Service Principal
193
194
194
-
1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
195
+
1. Open your Azure CLI and create a new service principal using the following commands: <br/>
196
+
```powershell
197
+
az login
198
+
az ad sp create-for-rbac --name <ServicePrincipalName>
199
+
```
200
+
2. In your Azure portal, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
195
201
2. On the overview page for your Databricks instance, select **Access control (IAM)**.
196
202
3. Click **Add** and select **Add role assignment**.
197
-
4. On the **Members** tab, assign access to a **User, group, or service principal**.
198
-
5. Click **Select members**.
199
-
6. Search for and select the `Databricks Resource Provider` service principal.
203
+
4. On the **Roles** tab, select the `Managed Application Operator` role. Click **Next**.
204
+
11. On the **Members** tab, select a **User, group, or service principal**.
205
+
12. Click **Select members**.
206
+
13. Search for and select the Service Principal you created above.
207
+
14. Click **Select**.
208
+
15. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**.
209
+
16. Return to the Azure home page. Select your storage account.
210
+
17. On the overview page for your storage account, select **Access control (IAM)**.
211
+
18. Click **Add** and select **Add role assignment**.
212
+
4. On the **Roles** tab, select the `Storage Blob Data Contributor` role. Click **Next**.
213
+
11. On the **Members** tab, select a **User, group, or service principal**.
214
+
12. Click **Select members**.
215
+
13. Search for and select the Service Principal you created above.
216
+
14. Click **Select**.
217
+
15. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**.
218
+
219
+
### Step 6 - Configure Databricks Cluster
220
+
221
+
> warning "Optional configuration settings for log4j vulnerability"
222
+
> While Databricks released a statement that clusters are likely unaffected by the log4j vulnerability, out of an abundance of caution, Databricks recommends updating to log4j 2.15+ or adding the following options to the Spark configuration: <br/> `spark.driver.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`<br/>`spark.executor.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`
223
+
224
+
1. Connect to a [Hive metastore](https://docs.databricks.com/data/metastores/external-hive-metastore.html){:target="_blank”} on your Databricks cluster.
225
+
2. Copy the following Spark configuration, replacing the variables (`<example_variable>`) with information from your workspace: <br/>
226
+
```py
227
+
## Configs so we can read from the storage account
3. Log in to your Databricks instance and open your cluster.
257
+
4. On the overview page for your cluster, select **Edit**.
258
+
5. Open the **Advanced options** toggle and paste the Spark config you copied above, replacing the variables (`<example_variable>`) with information from your workspace.
259
+
6. Select **Confirm and restart**. On the popup window, select **Confirm**.
260
+
7. Log in to your Azure MySQL database using the following command: <br/>
261
+
```powershell
262
+
mysql --host=[HOSTNAME] --port=3306 --user=[USERNAME] --password=[PASSWORD]
263
+
```
264
+
8. Once you've logged in to your MySQL database, run the following commands: <br/>
265
+
```sql
266
+
USE <db-name>
267
+
INSERT INTO VERSION (VER_ID, SCHEMA_VERSION) VALUES (0, '2.3.7');
268
+
```
269
+
9. Log in to your Databricks cluster.
270
+
10. Click **Create** and select **Notebook**.
271
+
11. Give your cluster a name, select **SQL** as the default language, and make sure it's located in the cluster you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
272
+
12. Click **Create**.
273
+
13. On the overview page for your new notebook, run the following command: <br/>
274
+
```sql
275
+
CREATETABLEtest (id string);
276
+
```
277
+
14. Open your cluster.
278
+
15. On the overview page for your cluster, select **Edit**.
279
+
16. Open the **Advanced options** toggle and paste the following code snippet: <br/>
280
+
```py
281
+
datanucleus.autoCreateSchema false
282
+
datanucleus.autoCreateTables false
283
+
spark.sql.hive.metastore.schema.verification true
284
+
datanucleus.fixedDatastore true
285
+
```
286
+
17. Select **Confirm and restart**. On the popup window, select **Confirm**.
200
287
201
-
### Step 6 - Enable the Data Lakes destination in the Segment app
288
+
### Step 7 - Enable the Data Lakes destination in the Segment app
202
289
203
290
After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
204
291
205
-
<!-- TODO: Test this workflow in a staging environment to verify that the steps are correct -->
206
-
207
292
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**.
208
293
2. Search for and select **Azure Data Lakes**.
209
294
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
@@ -212,18 +297,18 @@ After you set up the necessary resources in Azure, the next step is to set up th
212
297
-**Azure Storage Container**: The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
213
298
-**Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}.
214
299
-**Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}.
215
-
-**Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}
216
-
-**Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
217
-
-**Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
300
+
-**Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}.
301
+
-**Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
302
+
-**Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
218
303
-**Databricks Workspace Resource Group**: The resource group that hosts your Azure Databricks instance. This is visible in Azure on the overview page for your Databricks instance.
219
-
-**Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
220
-
-**Service Principal Client ID**:
221
-
-**Service Principal Client Secret**:
304
+
-**Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
305
+
-**Service Principal Client ID**: The Client ID of the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal).
306
+
-**Service Principal Client Secret**: The Client ID of the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal).
222
307
223
308
224
-
### Optional - Set up the Data Lake using Terraform
309
+
### (Optional) Set up your Azure Data Lake using Terraform
225
310
226
-
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) GitHub repository.
311
+
Instead of manually configuring your Data Lake, you can create it using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) GitHub repository.
227
312
228
313
> note " "
229
314
> This script requires Terraform versions 0.12+.
@@ -262,7 +347,7 @@ Running the `plan` command gives you an output that creates 19 new objects, unle
262
347
263
348
## FAQ
264
349
265
-
### [AWS Data Lakes]
350
+
### Segment Data Lakes
266
351
267
352
{% faq %}
268
353
{% faqitem Do I need to create Glue databases? %}
@@ -354,4 +439,34 @@ Replace:
354
439
{% endfaqitem %}
355
440
{% endfaq %}
356
441
357
-
### Azure Data Lakes
442
+
### Azure Data Lakes
443
+
444
+
{% faq %}
445
+
446
+
{% faqitem Does my ALDS-enabled storage account need to be in the same region as the other infrastructure? %}
447
+
Yes, your storage account and Databricks instance should be in the same region.
448
+
{% endfaqitem %}
449
+
450
+
{% faqitem What analytics tools are available to use with my Azure Data Lake? %}
451
+
Azure Data Lakes supports the following post-processing tools:
452
+
- PowerBI
453
+
- Azure HDInsight
454
+
- Azure Synapse Analytics
455
+
- Databricks
456
+
{% endfaqitem %}
457
+
458
+
{% faqitem What can I do to troubleshoot my Databricks database? %}
459
+
If you encounter errors related to your Databricks database, try adding the following line to the config: <br/>
<br/>After you've added to your config, restart your cluster so that your changes can take effect. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
464
+
{% endfaqitem %}
465
+
466
+
{% faqitem What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database? %}
467
+
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
0 commit comments