Skip to content

Commit 36f0225

Browse files
committed
First pass of steps 1, 7, & optional Terraform setup [DOC-493]
1 parent 672f03e commit 36f0225

File tree

1 file changed

+73
-4
lines changed
  • src/connections/storage/catalog/data-lakes

1 file changed

+73
-4
lines changed

src/connections/storage/catalog/data-lakes/index.md

Lines changed: 73 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ If you want to add historical data to your data set using a [replay of historica
7777

7878
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, Segment recommends that you start with data from the last six months to get started, and then replay additional data if you find you need more.
7979

80-
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
80+
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finishes. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
8181

8282
## Set up [Azure Data Lakes]
8383

@@ -92,14 +92,35 @@ Before you can configure your Azure resources, you must first [create an Azure s
9292

9393
### Step 1 - Create an ALDS-enabled storage account
9494

95-
To
95+
> note " "
96+
> Take note of the Location, Storage Account Name, and the name of your Azure Storage Container: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
97+
98+
1. Sign in to your [Azure environment](https://portal.azure.com){:target="_blank”}.
99+
2. From the Azure home page, select **Create a resource**.
100+
3. Search for and select **Storage account**.
101+
4. On the Storage account resource page, select the **Storage account** plan and click **Create**.
102+
5. On the **Basic** tab, select an existing subscription and resource group, give your storage account a name, and update any necessary instance details. Take note of the **Region** you select in this step, as you'll need it when creating the [Azure Data Lakes] destination in the Segment app.
103+
6. Click **Next: Advanced**.
104+
7. On the **Advanced Settings** tab in the Security section, select the following options:
105+
- Require secure transfer for REST API operations
106+
- Enable blob public access
107+
- Enable storage account key access
108+
- Minimum TLS version: Version 1.2
109+
8. In the Data Lake Storage Gen2 section, select **Enable hierarchical namespace**. In the Blob storage selection, select the **Hot** option.
110+
9. Click **Next: Networking**.
111+
10. On the **Networking** page, select **Disable public access and use private access**.
112+
11. Click **Review + create**. Take note of your location,
113+
96114

97115
### Step 2 - Set up KeyVault
98116

99117
### Step 3 - Set up Azure MySQL database
100118

101119
### Step 4 - Set up Databricks
102120

121+
> note "Databricks pricing tier"
122+
> If you create a Databricks instance only for [Azure Data Lakes] to use, only the standard pricing tier is required. However, if you use your Databricks instance for other applications, you may require premium pricing.
123+
103124
### Step 5 - Set up a Service Principal
104125

105126
### Step 6 - Configure Databricks cluster
@@ -112,12 +133,60 @@ After you set up the necessary resources in Azure, the next step is to set up th
112133

113134
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**.
114135
2. Search for and select **Azure Data Lakes**.
115-
2. Click the **Configure Data Lakes** button, and select the source you'd like to recieve data from.
116-
3.
136+
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
137+
3. In the **Connection Settings** section, enter the following values:
138+
- Azure Storage Account (The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
139+
- Azure Storage Container (The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
140+
- Azure Subscription ID
141+
- Azure Tenant ID
142+
- Databricks Cluster ID
143+
- Databricks Instance URL
144+
- Databricks Workspace Name
145+
- Databricks Workspace Resource Group
146+
- Region (The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
147+
- Service Principal Client ID
148+
- Service Principal Client Secret
117149

118150

119151
### Optional - Set up the Data Lake using Terraform
120152

153+
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) Github repository.
154+
155+
> note " "
156+
> This script requires Terraform versions 0.12+.
157+
158+
Before you can run the Terraform script, create a Databricks workspace in the Azure UI using the instructions in [Step 4 - Set up Databricks](#step-4---set-up-databricks). Note the **Workspace URL**, as you will need it to run the script.
159+
160+
In the setup file, set the following local variables:
161+
162+
```js
163+
164+
locals {
165+
region = "<segment-datlakes-region>"
166+
resource_group = "<segment-datlakes-regource-group>"
167+
storage_account = "<segment-datalake-storage-account"
168+
container_name = "<segment-datlakes-container>"
169+
key_vault_name = "<segment-datlakes-key vault>"
170+
server_name = "<segment-datlakes-server>"
171+
db_name = "<segment-datlakes-db-name>"
172+
db_password = "<segment-datlakes-db-password>"
173+
db_admin = "<segment-datlakes-db-admin>"
174+
databricks_workspace_url = "<segment-datlakes-db-worspace-url>"
175+
cluster_name = "<segment-datlakes-db-cluster>"
176+
tenant_id = "<tenant-id>"
177+
}
178+
```
179+
After you've configured your local variables, run the following commands:
180+
181+
```hcl
182+
terraform init
183+
terraform plan
184+
terraform apply
185+
```
186+
187+
Running the `plan` command gives you an output that creates 19 new objects, unless you are reusing objects in other Azure applications. Running the `apply` command creates the resources and produces a service principal password you can use to set up the destination.
188+
189+
121190
## FAQ
122191

123192
### [AWS Data Lakes]

0 commit comments

Comments
 (0)