Skip to content

Commit 034a034

Browse files
authored
Merge pull request #595 from segmentio/repo-sync
repo sync
2 parents defb034 + f1ae0c7 commit 034a034

File tree

4 files changed

+12
-7
lines changed

4 files changed

+12
-7
lines changed

.github/styles/Vocab/Docs/accept.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ Cocoapods
3939
Contentful
4040
Criteo
4141
csv
42+
Databricks
4243
datetime
4344
deeplink
4445
Dev
20.2 KB
Loading
36.5 KB
Loading

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ redirect_from: '/connections/destinations/catalog/data-lakes/'
66

77

88
Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation.
9+
Segment supports two type of data-lakes:
10+
- [AWS Data Lakes](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes)
11+
- [Azure Data Lakes](/docs/connections/storage/catalog/data-lakes/#set-up-azure-data-lakes)
912

1013
> note "Lake Formation"
1114
> You can also set up your Segment Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog.
@@ -255,8 +258,7 @@ curl -X POST 'https://<per-workspace-url>/api/2.0/preview/scim/v2/ServicePrincip
255258
> warning "Optional configuration settings for log4j vulnerability"
256259
> While Databricks released a statement that clusters are likely unaffected by the log4j vulnerability, out of an abundance of caution, Databricks recommends updating to log4j 2.15+ or adding the following options to the Spark configuration: <br/> `spark.driver.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`<br/>`spark.executor.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`
257260
258-
1. Connect to a [Hive metastore](https://docs.databricks.com/data/metastores/external-hive-metastore.html){:target="_blank”} on your Databricks cluster.
259-
2. Copy the following Spark configuration, replacing the variables (`<example_variable>`) with information from your workspace: <br/>
261+
1. Connect to a [Hive metastore](https://docs.databricks.com/data/metastores/external-hive-metastore.html){:target="_blank”} on your Databricks cluster using the following Spark configuration, replacing the variables (`<example_variable>`) with information from your workspace: <br/>
260262
```py
261263
## Configs so we can read from the storage account
262264
spark.hadoop.fs.azure.account.oauth.provider.type.<storage_account_name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
@@ -267,7 +269,7 @@ spark.hadoop.fs.azure.account.oauth2.client.id.<storage_account_name>.dfs.core.w
267269
##
268270
##
269271
spark.hadoop.javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver
270-
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<db-host>:<port>/<database-name>?useSSL=true&requireSSL=false
272+
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<db-host>:<port>/<database-name>?useSSL=true&requireSSL=true&enabledSslProtocolSuites=TLSv1.2
271273
spark.hadoop.javax.jdo.option.ConnectionUserName <database_user>
272274
spark.hadoop.javax.jdo.option.ConnectionPassword <database_password>
273275
##
@@ -328,11 +330,13 @@ After you set up the necessary resources in Azure, the next step is to set up th
328330
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
329331
3. In the **Connection Settings** section, enter the following values:
330332
- **Azure Storage Account**: The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
333+
![img.png](images/storageaccount.png)
331334
- **Azure Storage Container**: The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
332-
- **Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}.
333-
- **Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}.
335+
![img_1.png](images/storagecontainer.png)
336+
- **Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}. <br> Please add it as it is in the Azure portal, in the format `********-****-****-****-************`
337+
- **Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}. <br> Please add it as it is in the Azure portal, in the format `********-****-****-****-************`
334338
- **Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}.
335-
- **Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
339+
- **Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}. <br> The correct format for adding the URL is 'adb-0000000000000000.00.azureatabricks.net'
336340
- **Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
337341
- **Databricks Workspace Resource Group**: The resource group that hosts your Azure Databricks instance. This is visible in Azure on the overview page for your Databricks instance.
338342
- **Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account).
@@ -481,4 +485,4 @@ spark.sql.hive.metastore.schema.verification.record.version false
481485
<br/>After you've added to your config, restart your cluster so that your changes can take effect. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
482486

483487
#### What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database?
484-
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.
488+
Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}.

0 commit comments

Comments
 (0)