You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-explorer/spark-connector.md
+82-60Lines changed: 82 additions & 60 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,16 +9,15 @@ ms.topic: conceptual
9
9
ms.date: 1/14/2020
10
10
---
11
11
12
-
# Azure Data Explorer Connector for Apache Spark (Preview)
12
+
# Azure Data Explorer Connector for Apache Spark
13
13
14
14
[Apache Spark](https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of data.
15
15
16
-
Azure Data Explorer connector for Spark implements data source and data sink for moving data across Azure Data Explorer and Spark clusters to use both of their capabilities. Using Azure Data Explorer and Apache Spark, you can build fast and scalable applications targeting data driven scenarios, such as machine learning (ML), Extract-Transform-Load (ETL), and Log Analytics.
17
-
Writing to Azure Data Explorer can be done in batch and streaming mode.
18
-
Reading from Azure Data Explorer supports column pruning and predicate pushdown, which reduces the volume of transferred data by filtering out data in Azure Data Explorer.
16
+
The Azure Data Explorer connector for Spark is an [open source project](https://github.com/Azure/azure-kusto-spark) that can run on any Spark cluster. It implements data source and data sink for moving data across Azure Data Explorer and Spark clusters. Using Azure Data Explorer and Apache Spark, you can build fast and scalable applications targeting data driven scenarios. For example, machine learning (ML), Extract-Transform-Load (ETL), and Log Analytics. With the connector, Azure Data Explorer becomes a valid data store for standard Spark source and sink operations, such as write, read, and writeStream.
19
17
20
-
Azure Data Explorer Spark connector is an [open source project](https://github.com/Azure/azure-kusto-spark) that can run on any Spark cluster. The Azure Data Explorer Spark connector makes Azure Data Explorer a valid data store for standard Spark source
21
-
and sink operations such as write, read and writeStream.
18
+
You can write to Azure Data Explorer in either batch or streaming mode. Reading from Azure Data Explorer supports column pruning and predicate pushdown, which filters the data in Azure Data Explorer, reducing the volume of transferred data.
19
+
20
+
This topic describes how to install and configure the Azure Data Explorer Spark connector and move data between Azure Data Explorer and Apache Spark clusters.
22
21
23
22
> [!NOTE]
24
23
> Although some of the examples below refer to an [Azure Databricks](https://docs.azuredatabricks.net/) Spark cluster, Azure Data Explorer Spark connector does not take direct dependencies on Databricks or any other Spark distribution.
@@ -27,36 +26,36 @@ and sink operations such as write, read and writeStream.
27
26
28
27
*[Create an Azure Data Explorer cluster and database](/azure/data-explorer/create-cluster-database-portal)
29
28
* Create a Spark cluster
30
-
* Install Azure Data Explorer connector library, and libraries listed in [dependencies](https://github.com/Azure/azure-kusto-spark#dependencies) including the following [Kusto Java SDK](/azure/kusto/api/java/kusto-java-client-library) libraries:
31
-
*[Kusto Data Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-data)
*Pre-built libraries for [Spark 2.4, Scala 2.11](https://github.com/Azure/azure-kusto-spark/releases) and [Maven repo](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/spark-kusto-connector)
29
+
* Install Azure Data Explorer connector library:
30
+
*Pre-built libraries for [Spark 2.4, Scala 2.11](https://github.com/Azure/azure-kusto-spark/releases)
> 2.3.x versions are also supported, but may require some changes in pom.xml dependencies.
44
+
1. Install the libraries listed in [dependencies](https://github.com/Azure/azure-kusto-spark#dependencies) including the following [Kusto Java SDK](/azure/kusto/api/java/kusto-java-client-library) libraries:
45
+
*[Kusto Data Client](https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-data)
Azure Data Explorer Spark connector allows you to authenticate with Azure Active Directory (Azure AD) using an [Azure AD application](#azure-ad-application-authentication), [Azure AD access token](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#direct-authentication-with-access-token), [device authentication](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#device-authentication) (for non-production scenarios), or [Azure Key Vault](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#key-vault). The user must install azure-keyvault package and provide application credentials to access the Key Vault resource.
100
+
Azure Data Explorer Spark connector enables you to authenticate with Azure Active Directory (Azure AD) using one of the following methods:
101
+
* An [Azure AD application](#azure-ad-application-authentication)
102
+
* An [Azure AD access token](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#direct-authentication-with-access-token)
* An [Azure Key Vault](https://github.com/Azure/azure-kusto-spark/blob/dev/docs/Authentication.md#key-vault)
105
+
To access the Key Vault resource, install the azure-keyvault package and provide application credentials.
97
106
98
107
### Azure AD application authentication
99
108
100
-
Most simple and common authentication method. This method is recommended for Azure Data Explorer Spark connector usage.
109
+
Azure AD application authentication is the simplest and most common authentication method and is recommended for the Azure Data Explorer Spark connector.
101
110
102
111
|Properties |Description |
103
112
|---------|---------|
@@ -107,10 +116,10 @@ Most simple and common authentication method. This method is recommended for Azu
107
116
108
117
### Azure Data Explorer privileges
109
118
110
-
The following privileges must be granted on an Azure Data Explorer cluster:
119
+
Grant the following privileges on an Azure Data Explorer cluster:
111
120
112
-
* For reading (data source), Azure AD application must have *viewer* privileges on the target database, or *admin* privileges on the target table.
113
-
* For writing (data sink), Azure AD application must have *ingestor* privileges on the target database. It must also have *user* privileges on the target database to create new tables. If the target table already exists, *admin* privileges on the target table can be configured.
121
+
* For reading (data source), the Azure AD identity must have *viewer* privileges on the target database, or *admin* privileges on the target table.
122
+
* For writing (data sink), the Azure AD identity must have *ingestor* privileges on the target database. It must also have *user* privileges on the target database to create new tables. If the target table already exists, you must configure *admin* privileges on the target table.
114
123
115
124
For more information on Azure Data Explorer principal roles, see [role-based authorization](/azure/kusto/management/access-control/role-based-authorization). For managing security roles, see [security roles management](/azure/kusto/management/security-roles).
116
125
@@ -167,10 +176,9 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
167
176
import java.util.concurrent.TimeUnit
168
177
import org.apache.spark.sql.streaming.Trigger
169
178
170
-
// Set up a checkpoint and disable codeGen. Set up a checkpoint and disable codeGen as a workaround for an known issue
@@ -212,7 +220,8 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
212
220
display(df2)
213
221
```
214
222
215
-
1. When reading large amounts of data, transient blob storage must be provided. Provide storage container SAS key, or storage account name, account key, and container name. This step is only required for the current preview release of the Spark connector.
223
+
1. Optional: If **you** provide the transient blob storage (and not Azure Data Explorer) the blobs are created are under the caller's responsibility. This includes provisioning the storage, rotating access keys, and deleting transient artifacts.
224
+
The KustoBlobStorageUtils module contains helper functions for deleting blobs based on either account and container coordinates and account credentials, or a full SAS URL with write, read and list permissions. When the corresponding RDD is no longer needed, each transaction stores transient blob artifacts in a separate directory. This directory is captured as part of read-transaction information logs reported on the Spark Driver node.
216
225
217
226
```scala
218
227
// Use either container/account-key/account name, or container SaS
@@ -222,28 +231,41 @@ For more information on Azure Data Explorer principal roles, see [role-based aut
222
231
// val storageSas = dbutils.secrets.get(scope = "KustoDemos", key = "blobStorageSasUrl")
223
232
```
224
233
225
-
In the example above, we don't access the KeyVault using the connector interface. Alternatively, we use a simpler method of using the Databricks secrets.
0 commit comments