Skip to content

Commit 6163d55

Browse files
committed
Finishing steps, adding schema info, callouts on pages unrelated to Azure Data Lakes
DOC-493
1 parent 6424ba5 commit 6163d55

File tree

6 files changed

+209
-68
lines changed

6 files changed

+209
-68
lines changed

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 149 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ redirect_from: '/connections/destinations/catalog/data-lakes/'
88
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation.
99

1010
> note "Lake Formation"
11-
> You can also set up your [AWS Data Lakes] using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
11+
> You can also set up your Segment Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
1212
13-
## Set up [AWS Data Lakes]
13+
## Set up Segment Data Lakes
1414

15-
To set up [AWS Data Lakes], create your AWS resources, enable the [AWS Data Lakes] destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
15+
To set up Segment Data Lakes, create your AWS resources, enable the Segment Data Lakes destination in the Segment app, and verify that your Segment data synced to S3 and Glue.
1616

1717
### Prerequisites
1818

19-
Before you set up [AWS Data Lakes], you need the following resources:
19+
Before you set up Segment Data Lakes, you need the following resources:
2020

2121
- An [AWS account](https://aws.amazon.com/account/){:target="_blank”}
2222
- An [Amazon S3 bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket){:target="_blank”} to receive data and store logs
@@ -84,7 +84,7 @@ Segment creates a separate EMR cluster to run replays, then destroys it when the
8484
> info " "
8585
> Azure Data Lakes is available in Public Beta.
8686
87-
To set up Azure Data Lakes, create your [Azure resources](/docs/src/connections/storage/data-lakes/#set-up-[azure-data-lakes]) and then enable the Data Lakes destination in the Segment app.
87+
To set up Azure Data Lakes, create your Azure resources and then enable the Data Lakes destination in the Segment app.
8888

8989
### Prerequisites
9090

@@ -120,16 +120,17 @@ Before you can configure your Azure resources, you must first [create an Azure s
120120
2. Search for and select **Key Vault**.
121121
3. On the Key Vault resource page, select the **Key Vault** plan and click **Create**.
122122
4. On the **Basic** tab, select an existing subscription and resource group, give your Key Vault a name, and update the **Days to retain deleted vaults** setting, if desired.
123-
6. Click **Review + create**.
124-
7. Review your chosen settings. When you are satisfied with your selections, click **Review + create**.
125-
8. After your resource is deployed, click **Go to resource**.
126-
9. On the Key Vault page, select the **Access control (IAM)** tab.
127-
10. Click **Add** and select **Add role assignment**.
128-
11. On the **Roles** tab, select the `Key Vault Secrets User` role. Click **Next**.
129-
12. On the **Members** tab, assign access to a **User, group, or service principal**.
130-
13. Click **Select members**.
131-
14. Search for and select the `Databricks Resource Provider` service principal.
132-
15.
123+
5. Click **Review + create**.
124+
6. Review your chosen settings. When you are satisfied with your selections, click **Review + create**.
125+
7. After your resource is deployed, click **Go to resource**.
126+
8. On the Key Vault page, select the **Access control (IAM)** tab.
127+
9. Click **Add** and select **Add role assignment**.
128+
10. On the **Roles** tab, select the `Key Vault Secrets User` role. Click **Next**.
129+
11. On the **Members** tab, select a **User, group, or service principal**.
130+
12. Click **Select members**.
131+
13. Search for and select the `Databricks Resource Provider` service principal.
132+
14. Click **Select**.
133+
15. Under the **Members** header, verify that you selected the Databricks Resource Provider. Click **Review + assign**.
133134

134135
### Step 3 - Set up Azure MySQL database
135136

@@ -175,7 +176,7 @@ Before you can configure your Azure resources, you must first [create an Azure s
175176
8. On the Azure Databricks Service overview page, click **Launch Workspace**.
176177
9. On the Databricks page, select **Create a cluster**.
177178
10. On the Compute page, select **Create Cluster**.
178-
11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your data lake syncs.__
179+
11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your Data Lake syncs.__
179180
12. Click **Create Cluster**.
180181
13. Open [your Azure portal](https://portal.azure.com/#home){:target="_blank”} and select the Key Vault you created in a previous step.
181182
14. On the Key Vault page, select the JSON View link to view the Resource ID and vaultURI. Take note of these values, as you'll need them in the next step to configure your Databricks instance.
@@ -187,23 +188,107 @@ Before you can configure your Azure resources, you must first [create an Azure s
187188
16. When you've entered all of your information, click **Create**.
188189

189190
> warning " "
190-
> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for Databricks Workspace: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
191+
> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for your Databricks Workspace: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
191192
192193
### Step 5 - Set up a Service Principal
193194

194-
1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
195+
1. Open your Azure CLI and create a new service principal using the following commands: <br/>
196+
``` powershell
197+
az login
198+
az ad sp create-for-rbac --name <ServicePrincipalName>
199+
```
200+
2. In your Azure portal, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
195201
2. On the overview page for your Databricks instance, select **Access control (IAM)**.
196202
3. Click **Add** and select **Add role assignment**.
197-
4. On the **Members** tab, assign access to a **User, group, or service principal**.
198-
5. Click **Select members**.
199-
6. Search for and select the `Databricks Resource Provider` service principal.
203+
4. On the **Roles** tab, select the `Managed Application Operator` role. Click **Next**.
204+
11. On the **Members** tab, select a **User, group, or service principal**.
205+
12. Click **Select members**.
206+
13. Search for and select the Service Principal you created above.
207+
14. Click **Select**.
208+
15. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**.
209+
16. Return to the Azure home page. Select your storage account.
210+
17. On the overview page for your storage account, select **Access control (IAM)**.
211+
18. Click **Add** and select **Add role assignment**.
212+
4. On the **Roles** tab, select the `Storage Blob Data Contributor` role. Click **Next**.
213+
11. On the **Members** tab, select a **User, group, or service principal**.
214+
12. Click **Select members**.
215+
13. Search for and select the Service Principal you created above.
216+
14. Click **Select**.
217+
15. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**.
218+
219+
### Step 6 - Configure Databricks Cluster
220+
221+
> warning "Optional configuration settings for log4j vulnerability"
222+
> While Databricks released a statement that clusters are likely unaffected by the log4j vulnerability, out of an abundance of caution, Databricks recommends updating to log4j 2.15+ or adding the following options to the Spark configuration: <br/> `spark.driver.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`<br/>`spark.executor.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`
223+
224+
1. Connect to a [Hive metastore](https://docs.databricks.com/data/metastores/external-hive-metastore.html){:target="_blank”} on your Databricks cluster.
225+
2. Copy the following Spark configuration, replacing the variables (`<example_variable>`) with information from your workspace: <br/>
226+
```py
227+
## Configs so we can read from the storage account
228+
spark.hadoop.fs.azure.account.oauth.provider.type.<storage_account_name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
229+
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage_account_name>.dfs.core.windows.net https://login.microsoftonline.com/<azure-tenant-id>/oauth2/token
230+
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage_account_name>.dfs.core.windows.net <service-principal-secret>
231+
spark.hadoop.fs.azure.account.auth.type.<storage_account_name>.dfs.core.windows.net OAuth
232+
spark.hadoop.fs.azure.account.oauth2.client.id.<storage_account_name>.dfs.core.windows.net <service_principal_client_id>
233+
##
234+
##
235+
spark.hadoop.javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver
236+
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<db-host>:<port>/<database-name>?useSSL=true&requireSSL=false
237+
spark.hadoop.javax.jdo.option.ConnectionUserName <database_user>
238+
spark.hadoop.javax.jdo.option.ConnectionPassword <database_password>
239+
##
240+
##
241+
##
242+
spark.hive.mapred.supports.subdirectories true
243+
spark.sql.storeAssignmentPolicy Legacy
244+
mapreduce.input.fileinputformat.input.dir.recursive true
245+
spark.sql.hive.convertMetastoreParquet false
246+
##
247+
datanucleus.autoCreateSchema true
248+
datanucleus.autoCreateTables true
249+
spark.sql.hive.metastore.schema.verification false
250+
datanucleus.fixedDatastore false
251+
##
252+
spark.sql.hive.metastore.version 2.3.7
253+
spark.sql.hive.metastore.jars builtin
254+
```
255+
256+
3. Log in to your Databricks instance and open your cluster.
257+
4. On the overview page for your cluster, select **Edit**.
258+
5. Open the **Advanced options** toggle and paste the Spark config you copied above, replacing the variables (`<example_variable>`) with information from your workspace.
259+
6. Select **Confirm and restart**. On the popup window, select **Confirm**.
260+
7. Log in to your Azure MySQL database using the following command: <br/>
261+
```powershell
262+
mysql --host=[HOSTNAME] --port=3306 --user=[USERNAME] --password=[PASSWORD]
263+
```
264+
8. Once you've logged in to your MySQL database, run the following commands: <br/>
265+
```sql
266+
USE <db-name>
267+
INSERT INTO VERSION (VER_ID, SCHEMA_VERSION) VALUES (0, '2.3.7');
268+
```
269+
9. Log in to your Databricks cluster.
270+
10. Click **Create** and select **Notebook**.
271+
11. Give your cluster a name, select **SQL** as the default language, and make sure it's located in the cluster you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks).
272+
12. Click **Create**.
273+
13. On the overview page for your new notebook, run the following command: <br/>
274+
```sql
275+
CREATE TABLE test (id string);
276+
```
277+
14. Open your cluster.
278+
15. On the overview page for your cluster, select **Edit**.
279+
16. Open the **Advanced options** toggle and paste the following code snippet: <br/>
280+
```py
281+
datanucleus.autoCreateSchema false
282+
datanucleus.autoCreateTables false
283+
spark.sql.hive.metastore.schema.verification true
284+
datanucleus.fixedDatastore true
285+
```
286+
17. Select **Confirm and restart**. On the popup window, select **Confirm**.
200287

201-
### Step 6 - Enable the Data Lakes destination in the Segment app
288+
### Step 7 - Enable the Data Lakes destination in the Segment app
202289

203290
After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment:
204291

205-
<!-- TODO: Test this workflow in a staging environment to verify that the steps are correct -->
206-
207292
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**.
208293
2. Search for and select **Azure Data Lakes**.
209294
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
@@ -212,18 +297,18 @@ After you set up the necessary resources in Azure, the next step is to set up th
212297
- **Azure Storage Container**: The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
213298
- **Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}.
214299
- **Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}.
215-
- **Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}
216-
- **Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
217-
- **Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}
300+
- **Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}.
301+
- **Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
302+
- **Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
218303
- **Databricks Workspace Resource Group**: The resource group that hosts your Azure Databricks instance. This is visible in Azure on the overview page for your Databricks instance.
219-
- **Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
220-
- **Service Principal Client ID**:
221-
- **Service Principal Client Secret**:
304+
- **Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
305+
- **Service Principal Client ID**: The Client ID of the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal).
306+
- **Service Principal Client Secret**: The Client ID of the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal).
222307

223308

224-
### Optional - Set up the Data Lake using Terraform
309+
### (Optional) Set up your Azure Data Lake using Terraform
225310

226-
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) GitHub repository.
311+
Instead of manually configuring your Data Lake, you can create it using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) GitHub repository.
227312

228313
> note " "
229314
> This script requires Terraform versions 0.12+.
@@ -262,7 +347,7 @@ Running the `plan` command gives you an output that creates 19 new objects, unle
262347

263348
## FAQ
264349

265-
### [AWS Data Lakes]
350+
### Segment Data Lakes
266351

267352
{% faq %}
268353
{% faqitem Do I need to create Glue databases? %}
@@ -354,4 +439,34 @@ Replace:
354439
{% endfaqitem %}
355440
{% endfaq %}
356441

357-
### Azure Data Lakes
442+
### Azure Data Lakes
443+
444+
{% faq %}
445+
446+
{% faqitem Does my ALDS-enabled storage account need to be in the same region as the other infrastructure? %}
447+
Yes, your storage account and Databricks instance should be in the same region.
448+
{% endfaqitem %}
449+
450+
{% faqitem What analytics tools are available to use with my Azure Data Lake? %}
451+
Azure Data Lakes supports the following post-processing tools:
452+
- PowerBI
453+
- Azure HDInsight
454+
- Azure Synapse Analytics
455+
- Databricks
456+
{% endfaqitem %}
457+
458+
{% faqitem What can I do to troubleshoot my Databricks database? %}
459+
If you encounter errors related to your Databricks database, try adding the following line to the config: <br/>
460+
```py
461+
spark.sql.hive.metastore.schema.verification.record.version false
462+
```
463+
<br/>After you've added to your config, restart your cluster so that your changes can take effect. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
464+
{% endfaqitem %}
465+
466+
{% faqitem What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database? %}
467+
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
468+
{% endfaqitem %}
469+
470+
471+
472+
{% endfaq %}

0 commit comments

Comments
 (0)