You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/connections/storage/catalog/data-lakes/index.md
+73-4Lines changed: 73 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ If you want to add historical data to your data set using a [replay of historica
77
77
78
78
The time needed to process a Replay can vary depending on the volume of data and number of events in each source. If you decide to run a Replay, Segment recommends that you start with data from the last six months to get started, and then replay additional data if you find you need more.
79
79
80
-
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finished. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
80
+
Segment creates a separate EMR cluster to run replays, then destroys it when the replay finishes. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster.
81
81
82
82
## Set up [Azure Data Lakes]
83
83
@@ -92,14 +92,35 @@ Before you can configure your Azure resources, you must first [create an Azure s
92
92
93
93
### Step 1 - Create an ALDS-enabled storage account
94
94
95
-
To
95
+
> note " "
96
+
> Take note of the Location, Storage Account Name, and the name of your Azure Storage Container: you'll need these variables when configuring the Azure Data Lakes destination in the Segment app.
97
+
98
+
1. Sign in to your [Azure environment](https://portal.azure.com){:target="_blank”}.
99
+
2. From the Azure home page, select **Create a resource**.
100
+
3. Search for and select **Storage account**.
101
+
4. On the Storage account resource page, select the **Storage account** plan and click **Create**.
102
+
5. On the **Basic** tab, select an existing subscription and resource group, give your storage account a name, and update any necessary instance details. Take note of the **Region** you select in this step, as you'll need it when creating the [Azure Data Lakes] destination in the Segment app.
103
+
6. Click **Next: Advanced**.
104
+
7. On the **Advanced Settings** tab in the Security section, select the following options:
105
+
- Require secure transfer for REST API operations
106
+
- Enable blob public access
107
+
- Enable storage account key access
108
+
- Minimum TLS version: Version 1.2
109
+
8. In the Data Lake Storage Gen2 section, select **Enable hierarchical namespace**. In the Blob storage selection, select the **Hot** option.
110
+
9. Click **Next: Networking**.
111
+
10. On the **Networking** page, select **Disable public access and use private access**.
112
+
11. Click **Review + create**. Take note of your location,
113
+
96
114
97
115
### Step 2 - Set up KeyVault
98
116
99
117
### Step 3 - Set up Azure MySQL database
100
118
101
119
### Step 4 - Set up Databricks
102
120
121
+
> note "Databricks pricing tier"
122
+
> If you create a Databricks instance only for [Azure Data Lakes] to use, only the standard pricing tier is required. However, if you use your Databricks instance for other applications, you may require premium pricing.
123
+
103
124
### Step 5 - Set up a Service Principal
104
125
105
126
### Step 6 - Configure Databricks cluster
@@ -112,12 +133,60 @@ After you set up the necessary resources in Azure, the next step is to set up th
112
133
113
134
1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**.
114
135
2. Search for and select **Azure Data Lakes**.
115
-
2. Click the **Configure Data Lakes** button, and select the source you'd like to recieve data from.
116
-
3.
136
+
2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**.
137
+
3. In the **Connection Settings** section, enter the following values:
138
+
- Azure Storage Account (The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
139
+
- Azure Storage Container (The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account))
140
+
- Azure Subscription ID
141
+
- Azure Tenant ID
142
+
- Databricks Cluster ID
143
+
- Databricks Instance URL
144
+
- Databricks Workspace Name
145
+
- Databricks Workspace Resource Group
146
+
- Region (The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account)
147
+
- Service Principal Client ID
148
+
- Service Principal Client Secret
117
149
118
150
119
151
### Optional - Set up the Data Lake using Terraform
120
152
153
+
Instead of manually configuring your Data Lake, you can create a Data Lake using the script in the [`terraform-azure-data-lake`](https://github.com/segmentio/terraform-azure-data-lakes) Github repository.
154
+
155
+
> note " "
156
+
> This script requires Terraform versions 0.12+.
157
+
158
+
Before you can run the Terraform script, create a Databricks workspace in the Azure UI using the instructions in [Step 4 - Set up Databricks](#step-4---set-up-databricks). Note the **Workspace URL**, as you will need it to run the script.
159
+
160
+
In the setup file, set the following local variables:
After you've configured your local variables, run the following commands:
180
+
181
+
```hcl
182
+
terraform init
183
+
terraform plan
184
+
terraform apply
185
+
```
186
+
187
+
Running the `plan` command gives you an output that creates 19 new objects, unless you are reusing objects in other Azure applications. Running the `apply` command creates the resources and produces a service principal password you can use to set up the destination.
0 commit comments